We now look at how to detect outliers that have an undue influence on the multiple regression model. Keep in mind that since we are dealing with a multi-dimensional model, there may be data points that look perfectly fine in any single dimension but are multivariate outliers. E.g. for the general population there is nothing unusual about a 6-foot man or a 125 pound man, but a 6-foot man that weighs 125 pounds is unusual.

**Definition 1**: The following parameters are indicators that a sample point (*x _{i1}, …, x_{ik}*, y

*) is an outlier:*

_{i}is the measure of distance of the *i*th sample point from the regression line. Points with large residuals are potential outliers.

**Leverage** – By Property 1 of Method of Least Squares for Multiple Regression, *Y*-hat = *HY* where *H* is the *n* ×* n* hat matrix = [*h _{ij}*]. Thus for the

*i*th point in the sample,

where each *h _{ij} *only depends on the

*x*values in the sample. Thus the strength of the contribution of sample value y

*on the predicted value ŷ*

_{i}_{i}is determined by the coefficient

*h*which is called the

_{ii},**leverage**and usually abbreviated as

*h*

_{i}.**Observation**: Where there is only one independent variable, we have

Leverage measures how far away the data point is from the mean value. In general 1/*n* ≤ *h _{i}* ≤ 1. Where there are

*k*independent variables in the model, the mean value for leverage is (

*k*+1)/

*n*. A rule of thumb (Steven’s) is that values 3 times this mean value are considered large.

As we saw in Residuals, the standard error of the residual *e _{i}* is

and so the studentized residuals *s _{i} *have the following property:

We will use this measure when we define Cook’s distance below. For our purposes now, we need to look at the version of the studentized residual when the *i*th observation is removed from the model, i.e.

**Definition 2**: If we remove a point from the sample, then equation for regression line changes. Points that have the most **influence** produce the largest change in the equation of the regression line. A measure of this influence is called **Cook’s distance**. For the *i*th point in the sample, Cook’s distance is defined as

where ŷ_{j(i)} is the prediction of y* _{j} *by the revised regression model when the point (

*x*

*, …, x*y

_{ik},*) is removed from the sample.*

_{i}Another measure of influence is **DFFITS**, which is defined by the formula

Whereas Cook’s distance is a measure of the change in the mean vector when the *i*th point is removed, DFFITS is a measure of the change in the *i*th mean when the *i*th point is removed.

**Property 1**: Cook’s distance can be given by the following equation:

**Observation**: Property 1 means that we don’t need to perform repeated regressions to obtain Cook’s distance. Furthermore, Cook’s distance combines the effects of distance and leverage to obtain one metric. This definition of Cook’s distance is equivalent to

Values of Cook’s distance of 1 or greater are generally viewed as high.

Similarly DFFITS can be calculated without repeated regressions as shown by Property 2.

**Property 2**: DFFITS can be given by the following equation:

**Observation**: Values of |DFFITS| > 1 are potential problems in small to medium samples and values of |DFFITS| > 2 are potential problems in large samples.

**Example 1**: Find any outliers or influencers for the data in Example 1 of Regression Analysis What happens if we change the Life Expectancy measure for the fourth data element from 53 to 83?

Figure 1 implements the various statistics described above for the data in Example 1.

Figure 2 describes some of the formulas in the worksheet in Figure 1:

**Figure 2 – Formulas in Figure 1**

The formulas for the first observation in the table (row 6) of Figure 1 are displayed in Figure 3. The formulas for the other observations are similar.

**Figure 3 – Formulas for observation 1 in Figure 1**

As you can see, no data element in Figure 1 has a significant t-test (< .05) nor a high Cook’s distance (> 1) or DFFITS (> 1). We now repeat the same analysis where the Life Expectancy measure for the fourth data element (cell B9) is changed from 53 to 83 (Figure 4).

This time we see that the fourth observation has a significant t-test (.0096 < .05) indicating a potential outlier and a high Cook’s distance (1.58 > 1) and high DFFITS (2.71 > 1) indicating an influencer. Observation 13 also has a significant t-test (.034 < .05). Observations 3 and 14 are also close to having a significant t-test and observation 14 is close to having a high DFFITS value. The charts in Figure 5 graphically show the influence the change in observation 4 has on the regression line.

We can also see the change in the plot of the studentized residuals vs. *x* data elements. Here it is even more apparent that the revised fourth observation is an outlier (in Version 2).

In the simple regression case it is relatively easy to spot potential outliers. This is not the case in the multivariate case. We consider this in the next example.

**Example 2**: Find any outliers or influencers for the data in Example 1 of Method of Least Squares for Multiple Regression.

The approach is similar to that used in Example 1. The output of the analysis is given in Figure 7.

The formulas in Figure 7 refer to cells described in Figure 3 of Method of Least Squares for Multiple Regression and Figure 1 of Residuals, which contain references to *n*, *k*, *MS _{E}*,

*df*and

_{E}*Y*-hat.

All the calculations are similar to those in Example 1, except that this time we need to use the hat matrix *H* to calculate leverage. E.g. to calculate the leverage for the first observation (cell X22) we use the following formula

=INDEX(Q4:AA14,Q22,Q22)

i.e. the diagonal values in the hat matrix contained in range Q4:AA14 (see Figure 1 of Residuals).

Alternatively, we can calculate the *k* × 1 vector of leverage entries, using the DIAG supplemental function (see Basic Concepts of Matrices), as follows:

=DIAG(Q4:AA14)

Note too that we have added the standardized residuals (column W), which we didn’t show in Figure 1. We use *MS _{E}* as an estimate of the variance, and so e.g. cell W22 contains the formula =V22/SQRT($O$19). Note that this estimate of variance is different from the one used in Excel’s Regression data analysis tool (see Figure 6 of Multiple Regression Analysis).

As we can see from Figure 7 there are no clear outliers or influencers, although the t-test for the first observation is .050942, which is close to being significant (as a potential outlier). This is consistent with what we can observe in Figure 8 of Multiple Regression Analysis and Figure 2 of Residuals.

**Real Statistics Function** The Real Statistics Resource Pack provides the following supplemental array function where R1 is an *n* × *k* array containing X sample data.

**LEVERAGE**(R1) = *n* × 1 vector of leverage values

**Real Statistics Data Analysis Tool**: The Real Statistics Resource Pack also provides a Cook’s D supplemental data analysis tool which outputs a table similar to that shown in Figure 7.

To use this tool for Example 2, perform the following steps: Enter **Ctrl-m** and then select **Linear Regression** from the menu. A dialog box will then appear as in Figure 12 of Multiple Regression Analysis. Next enter A3:B14 for **Input Range X** and C3:C14 for **Input range Y**, click on **Column Headings included with data**, retain the value .05 for **Alpha**, select the **Residuals and Cook’s D **option and click on the **OK **button.** **The output is similar to the table displayed in Figure 7.

Hi

Your results in Figure 3 for Mod MSE, Rstudent , T-test and DFFITS are incorrect.

To calculate Mod MSE in Figure 3 you used MS = 63.5 from Figure 1 rather than MS=90.29 from Figure 3

Dear Charles, when reading about testing on residuals ,it is always stated about variance homegeneity in data, but it is not the case in surveying data. What would you sugest? Thank you.

Claudio,

Looking at residuals is important for many reasons, not just for analyzing homogeneity of variances.

I am not sure what you mean, by “What do you suggest?”, though.

Charles

Charles, I perfomed a least square adjustment where data have different variances applying a weight matrix(the inverse of covariance matrix) . Is it possible the qq plot of residuals when data have different variances? Do I have to standarize residuals first?

Claudio

Claudio,

Sorry, but I don’t understand your question. You are creating a QQ plot of what data?

Charles

sorry by the delay. data are a levelling network and I’m to use different methods tu find outliers. I’ll plot studentized residuals vs adjusted height differences .

thank you

Hi Charles,

Could you please provide me with some information on how I should delete outlier data in each data set other than multi regression?

Thanks,

Ehsan

Ehsan,

There is more information about outliers on the following webpages:

Outliers and Robustness

Identifying Outliers

Grubbs Test and ESD

Charles

Thanks for the reply Charles. Can you provide any insight into whether or not IVs are automatically excluded from the model if they have a high degree of multi-collinearity. It seems spss does this automatically and i’m thinking it could account for some difference. Also, is your’s a SS type III estimator?

Thanks again; Best, Phillip

Phillip,

Excel drops one the variables if there is perfect collinearity. Otherwise, Excel doesn’t do anything. See the webpage http://www.real-statistics.com/multiple-regression/collinearity/.

I am using SS type III, which is relevant for unbalanced ANOVA models. See http://www.real-statistics.com/multiple-regression/unbalanced-factorial-anova/.

Charles

Your equation for Mod MSE in Figure 3 has an error. You seem to have flipped df and MSe. Should be (I$25-I6^2/(1-J6)/I$24)*….

Kevin,

The formula I actually used (as shown in the Examples Workbook) is =(I$25-I6^2/((1-J6)*$I$24))*$I$24/($I$24-1). I believe this is the same as the formula you listed (let me know if this is not so). The formula that is shown in Figure 3 of the referenced webpage is wrong. I will correct this shortly. Thanks for catching this error.

Charles

Thanks for producing these Charles. Cool stuff! I’m running some Cook’s D compared to the Cook’s D output from SPSS. They are very close, but not exactly the same results. Any clues as to where the difference may be? (in my case, your Cook’s D results are slightly lower — from .001 to .01 lower — than the SPSS results; not a huge difference but it led to different classification of 1% of cases)

Phillip,

I don’t know what the difference is. I don’t have SPSS and so I can’t check. I will try to investigate this in the future.

Charles

Sir

You wrote: “cell W22 contains the formula =V22/SQRT($O$19)” Or should it be V22/SQRT($O$19*(1-hi))?

Colin

Colin,

Cell W22 only contains the standardized residual and so the formula is the one given. Perhaps I should have used STDRESID (or something similar) instead of SRESIDUAL to make it clear that the S stands for “standard” not “studentized”.

Charles

Hi Rich

Charles does have a formula in the tool that shows how to do this. It’s the residual divided by the standard deviation of the residuals.

Regards

Declan

Hi there Charles

Hope you are well and Merry Christmas to you.

I think the RStudent formula needs to be relooked at, as it does not give the most accurate answer.

However, when it is done via the hat matrix, your answers are spot on!

Kind regards

Declan

Declan,

Thanks for the heads-up. I will take a look at these in the new year.

Charles

Declan,

Merry Christmas and Happy New Year to you as well.

Charles

Dear Sir,

With regards to Declan’s comment on the RStudent formula, I would be most grateful if you could let me know if it has been looked at.

Thanks for a brilliant site, makes it clear even to a maths dunderhead like me 🙂

Malcolm,

The only place I used RStudent is in the Residuals and Cook’s D data analysis tool, which I understood Declan thought was accurate. I also checked with Howell’s textbook to make sure that I implemented it correctly and found that it checked out perfectly.

Charles

Hi, Charles

Thought the new Regression Tool that incorporates the Excel data and your Residual and Cook’s Distance info is really great. Noticed that the Standardized Residuals from Excel’s output didn’t make it into your version. Wondered if a later update might include this column also.

Thanks

Rich

Might you include DFFITS in a future release?

Thanks,

Rich

Hi Rich,

Although I don’t support DFFITS currently, I do support studentized residuals and Cook’s Distance. A description of these can be found on webpage http://www.real-statistics.com/multiple-regression/outliers-and-influencers/. You can calculate DFFITS by multiplying the studentized residual when the ith observation is removed from the model, i.e. s

_{(i)}, by (h_{ii}/(1-h_{ii}))^{1/2}where h_{ii}is the ith position on the main diagonal of the hat matrix H.I will shortly add DFFITS to the Cook’s Distance data analysis tool found in the Real Statistics Resource Pack (just press Ctrl-m to see it on the list).

Charles

Update: DFFITS has now been added to the Cook’s Distance data analysis tool