**Testing for outliers**

We have the following ways of identifying the presence of outliers:

- Side by side plotting of the raw data (histograms and box plots)
- Examination of residuals

Residuals are defined as for Levene’s test, namely:

The residual is a measure of how far away an observation is from its group mean value (our best guess of the value). If an observation has a large residual, we consider it a potential outlier. To determine how large a residual must be to be classified as an outlier we use the fact that if the population is normally distributed, then the residuals are also normally distributed with distribution

Thus approximately 4.56% of the observations should be more than 2 standard deviations away from its group mean, 1.24% should be more than 2.5 standard deviations away, 0.26% should be more than 3 standard deviations away, etc. Any deviations from these norms can be viewed as indicating the presence of potential outliers.

**Example 1**: Determine whether there are any outliers for the data in Example 2 of Basic Concepts for ANOVA if we change the first sample for Method 4 to 185 (instead of 85).

**Figure 1 – Identifying outliers for data in Example 1**

In Figure 1, we calculate *MS _{W} *as we did in Example 3 of Basic Concepts for ANOVA. From this we calculate the standard error

*s*= SQRT(

_{W}*MS*) = 24.2. Since residuals are presumed to be normally distributed with mean 0 and standard deviation

_{W}*s*, we can standardize the residuals to obtain the results in range B21:E28. For example, cell B21 contains the formula =(B4-B$12)/$F$16.

_{W}We define an outlier based on an alpha value. Here we choose *α* = .02, and so we label any of the original data items as an outlier if its standardized residual has normal value in the critical range defined by alpha.

E.g. we see that the first sample element for Method 4 is labeled an outlier (in cell J21) since NORMSDIST(ABS(E21)) = NORMSDIST(4.16) > .99 = 1 – *α*/2.

**Dealing with outliers**

Once a potential outlier has been identified, first check the data to make sure the outlier is not a data entry or data coding error. If not you can conduct a sensitivity analysis as follows to see how much the outlying observations affect your results.

- Run ANOVA on the entire data.
- Remove outlier(s) and rerun the ANOVA.
- If the results are the same then you can report the analysis on the full data and report that the outliers did not influence the results.
- If the results are different, try running a non-parametric test (e.g. Kruskal-Wallis) or simply report your analysis with and without the outlier.

Two other approaches for dealing with outliers are to use trimmed means or Winsorized samples (as described in Outliers and Robustness) or to use a transformation. In particular, a reciprocal transformation *f*(*x*)* =* 1/*x *can be useful.

For example, if in measuring response times for a rat in maze, suppose the following times were recorded:

20, 21, 24, 26, 30, 31, 33, 95, 230

It is quite possible that two of the rats simply got bored or got distracted and so the results are quite distorted. In this case, the use of a reciprocal transformation tends to reduce the effect of long times. Effectively you are transforming time into speed. The transformed data are:

.0500, .0476, .0417, .0385, .0333, .0323, .0303, .0105, .0043

More informative

Hi

I have a Nested design it is for Gage R&R

How can I detect outliers in this Nested design which is based on ANOVA .Is it the same way that you mentioned above or there are different way and what software could help me to detect outliers in Nested Gage R&R and which ways can deal with this outliers?

Mohammed,

Unless I have misunderstand your question, whether a data element is an outlier is not dependent on whether you are conducting a Gage R&R (nested or not) or some other research. An outlier is simply a data element that is unusually large or small compared to either other data elements or compared to your expectations. If data is normally distributed then a data element that is 3 standard deviation higher or lower than the mean occurs once out every 370 times. If you have a random sample with 1,000 elements, you would expect to have one or more elements that are 3 standard deviation above or below the mean, but if your sample size is 50, you may not expect to have such an outlier.

There is also nothing magical about the value 3 standard deviation. You could use 2.5 standard deviations instead (or some other metric). Also just because a data is an outlier doesn’t necessarily mean that you should ignore it.

The Real Statistics website provides various tests and tools for identifying potential outliers. See the following webpage:

Charles

Hi Charles

First my objective is :

* take data without outlier and analyze the data

* put outlier in the data (one on each operator and one on all)

*analyze the data with outlier

*identify outlier in the data and handle the outlier

* find a best method that is identify and handle the outliers

* my data contains 30 measurements (3 operators 5 parts 2 replications)

My question is :is there any method can identify and handle outlier in Nested ANOVA

The webpage you mentioned above is not appeared

Thanks

Mohammed,

I can’t think of any reasons why dealing with outliers is different for nested ANOVA.

Which webpage does not appear?

Charles

All pages not appeared

Mohammed,

I don’t know why the pages don’t appear. They appear on my computer. If necessary you need to copy each link manually into your browser.

Charles

Dear Charles

Can you send the webpages by email

Thanks

meysm27@gmail.com

Mohammed,

I don’t know why you can’t access the webpages via your browser. They are clearly accessible and so I don’t want to find myself sending out emails with this information.

Charles

Hi Charles,

How did you calculate the residual values in this example?

Thank you,

Jonathan,

E.g. cell B21 contains the formula =(B4-B$12)/$F$16.

Charles