Testing for outliers
We have the following ways of identifying the presence of outliers:
- Side by side plotting of the raw data (histograms and box plots)
- Examination of residuals
Residuals are defined as for Levene’s test, namely:
The residual is a measure of how far away an observation is from its group mean value (our best guess of the value). If an observation has a large residual, we consider it a potential outlier. To determine how large a residual must be to be classified as an outlier we use the fact that if the population is normally distributed, then the residuals are also normally distributed with distribution
Thus approximately 4.56% of the observations should be more than 2 standard deviations away from its group mean, 1.24% should be more than 2.5 standard deviations away, 0.26% should be more than 3 standard deviations away, etc. Any deviations from these norms can be viewed as indicating the presence of potential outliers.
Example 1: Determine whether there are any outliers for the data in Example 2 of Basic Concepts for ANOVA if we change the first sample for Method 4 to 185 (instead of 85).
Figure 1 – Identifying outliers for data in Example 1
In Figure 1, we calculate MSW as we did in Example 3 of Basic Concepts for ANOVA. From this we calculate the standard error sW = SQRT(MSW) = 24.2. Since residuals are presumed to be normally distributed with mean 0 and standard deviation sW, we can standardize the residuals to obtain the results in range B21:E28. For example, cell B21 contains the formula =(B4-B$12)/$F$16.
We define an outlier based on an alpha value. Here we choose α = .02, and so we label any of the original data items as an outlier if its standardized residual has normal value in the critical range defined by alpha.
E.g. we see that the first sample element for Method 4 is labeled an outlier (in cell J21) since NORMSDIST(ABS(E21)) = NORMSDIST(4.16) > .99 = 1 – α/2.
Dealing with outliers
Once a potential outlier has been identified, first check the data to make sure the outlier is not a data entry or data coding error. If not you can conduct a sensitivity analysis as follows to see how much the outlying observations affect your results.
- Run ANOVA on the entire data.
- Remove outlier(s) and rerun the ANOVA.
- If the results are the same then you can report the analysis on the full data and report that the outliers did not influence the results.
- If the results are different, try running a non-parametric test (e.g. Kruskal-Wallis) or simply report your analysis with and without the outlier.
Two other approaches for dealing with outliers are to use trimmed means or Winsorized samples (as described in Outliers and Robustness) or to use a transformation. In particular, a reciprocal transformation f(x) = 1/x can be useful.
For example, if in measuring response times for a rat in maze, suppose the following times were recorded:
20, 21, 24, 26, 30, 31, 33, 95, 230
It is quite possible that two of the rats simply got bored or got distracted and so the results are quite distorted. In this case, the use of a reciprocal transformation tends to reduce the effect of long times. Effectively you are transforming time into speed. The transformed data are:
.0500, .0476, .0417, .0385, .0333, .0323, .0303, .0105, .0043