Another problem faced when collecting data is that some data may be missing. For example, in conducting a survey with ten questions, perhaps some of the people who take the survey don’t answer all ten questions. In Identifying Outliers and Missing Data we show how to identify missing data using a supplemental data analysis tool provided in the Real Statistics Resource Pack.
A simple approach for dealing with missing data is to throw out all the data for any sample missing one or more data elements. One problem with this approach is that the sample size will be reduced. This is particularly relevant when the reduced sample size is too small to obtain significant results in the analysis. In this case additional sample data elements may need to be collected. This problem is a bigger than might first be evident. E.g. if a questionnaire with 5 questions is randomly missing 10% of the data, then on average almost 60% of the sample will have at least one question missing.
Also it is often the case that the missing data is not randomly distributed. E.g. people filling out a long questionnaire may give up at some point and not answer any further questions, or they may be offended or embarrassed by a particular question and choose not to answer it. These are characteristics that might be quite relevant to the analysis.
In general there are the following types of remedies for missing data:
- Delete the samples with any missing data elements
- Impute the value of the missing data
- Remove a variable (e.g. a particular question in the case of a questionnaire or survey) which has a high incidence of missing data, especially if there are other variables (e.g. questions) which measure similar aspects of the characteristics being studied.
Deleting Missing Data
Of particular importance is the randomness of the missing data. E.g. suppose a lot of people didn’t answer question 5 but everyone answered question 7. If the frequency of the responses to question 7 changes significantly when samples which are missing responses to question 5 are dropped, then the missing data is not random, and so dropping samples can bias the results of the analysis. In this case either another remedy should be employed or the analysis should be run twice: once with samples with missing data retained (e.g. by adding a “no response” for missing data) and once with these samples dropped.
Missing data can be removed by using the following supplemental Excel functions found in the Real Statistics Resource Pack.
Real Statistics Functions:
DELBLANK(R1, s) – fills the highlighted range with the data in range R1 (by columns) omitting any empty cells
DELNonNum(R1, s) – fills the highlighted range with the data in range R1 (by columns) omitting any non-numeric cells
DELROWBLANK(R1, b) – fills the highlighted range with the data in range R1 omitting any row which has one or more empty cells; if b is True then the first row of R1 (presumably containing column headings) is always copied (even if it contains an empty cell); this argument is optional and defaults to b = False.
DELROWNonNum(R1, b) – fills the highlighted range with the data in range R1 omitting any row which has one or more non-numeric cells; if b is True then the first row of R1 (presumably containing column headings) is always is always copied (even if it contains a non-numeric cell); this argument is optional and defaults to b = False.
The string s is used as a filler in case the output range has more cells than R1. This argument is optional and defaults to the error value #N/A. See Data Conversion and Reformatting for an example of the use of these functions.
Also see Data Conversion and Reformatting for how to use the supplemental Reformat Data Range data analysis tool found in the Real Statistics Resource Pack to accomplish the same objectives.
Imputing the values for missing data
Some techniques for imputing values for missing data include:
- Substituting the missing data with another observation which is considered similar, either taken from another sample or from a previous study
- Using the mean of all the non-missing data elements for that variable. This might be acceptable in cases with a small number of missing data elements, but otherwise it can distort the distribution of the data (e.g. by reducing the variance) or by lowering the observed correlations (see Basic Concepts of Correlation).
- Using regression techniques. In this approach regression (as described in Regression and Multiple Regression) is used to predict the value of the missing data element based on the relationship between that variable and other variables. This approach reinforces existing relationships and so makes it more likely that the analysis will characterize the sample and not the general population.