Outliers and Influencers

We now look at how to detect outliers that have an undue influence on the multiple regression model. Keep in mind that since we are dealing with a multi-dimensional model, there may be data points that look perfectly fine in any single dimension but are multivariate outliers. E.g. for the general population there is nothing unusual about a 6-foot man or a 125 pound man, but a 6-foot man that weighs 125 pounds is unusual.

Definition 1: The following parameters are indicators that a sample point (xi1, …, xik, yi) is an outlier:

Distance – the residual
Raw residual

is the measure of distance of the ith sample point from the regression line. Points with large residuals are potential outliers.

Leverage – By Property 1 of Method of Least Squares for Multiple Regression, Y-hat = HY where H is the n × n hat matrix = [hij]. Thus for the ith point in the sample,

image2056

where each hij only depends on the x values in the sample. Thus the strength of the contribution of sample value yi on the predicted value ŷi is determined by the coefficient hii, which is called the leverage and usually abbreviated as hi.

Observation: Where there is only one independent variable, we have

Leverage

Leverage measures how far away the data point is from the mean value. In general 1/nhi ≤ 1. Where there are k independent variables in the model, the mean value for leverage is (k+1)/n. A rule of thumb (Steven’s) is that values 3 times this mean value are considered large.

As we saw in Residuals, the standard error of the residual ei is

image2061

and so the studentized residuals si have the following property:

image2063

We will use this measure when we define Cook’s distance below. For our purposes now, we need to look at the version of the studentized residual when the ith observation is removed from the model, i.e.
image2064

where
image2065

i.e.
image2066

Definition 2: If we remove a point from the sample, then equation for regression line changes. Points that have the most influence produce the largest change in the equation of the regression line. A measure of this influence is called Cook’s distance. For the ith point in the sample, Cook’s distance is defined as

Cook's distance

where ŷj(i) is the prediction of yj by the revised regression model when the point (x, …, xik, yi) is removed from the sample.

Another measure of influence is DFFITS, which is defined by the formula

image9207Whereas Cook’s distance is a measure of the change in the mean vector when the ith point is removed, DFFITS is a measure of the change in the ith mean when the ith point is removed.

Property 1: Cook’s distance can be given by the following equation:

Cook's distance

Observation: Property 1 means that we don’t need to perform repeated regressions to obtain Cook’s distance. Furthermore, Cook’s distance combines the effects of distance and leverage to obtain one metric. This definition of Cook’s distance is equivalent to

Cook's distance

Values of Cook’s distance of 1 or greater are generally viewed as high.

Similarly DFFITS can be calculated without repeated regressions as shown by Property 2.

Property 2: DFFITS can be given by the following equation:

image9208

Observation: Values of |DFFITS| > 1 are potential problems in small to medium samples and values of |DFFITS| > 2\sqrt{k/n} are potential problems in large samples.

Example 1: Find any outliers or influencers for the data in Example 1 of Regression Analysis What happens if we change the Life Expectancy measure for the fourth data element from 53 to 83?

Figure 1 implements the various statistics described above for the data in Example 1.

Cook's distance analysis Excel

Figure 1 – Test for outliers and influencers for data in Example 1

Figure 2 describes some of the formulas in the worksheet in Figure 1:

Picture104

Figure 2 – Formulas in Figure 1

The formulas for the first observation in the table (row 6) of Figure 1 are displayed in Figure 3. The formulas for the other observations are similar.

Cook's distance key formulas

Figure 3 – Formulas for observation 1 in Figure 1

As you can see, no data element in Figure 1 has a significant t-test (< .05) nor a high Cook’s distance (> 1) or DFFITS (> 1). We now repeat the same analysis where the Life Expectancy measure for the fourth data element (cell B9) is changed from 53 to 83 (Figure 4).

Cook's distance DFFITS Excel

Figure 4 – Test for outliers and influencers for revised data

This time we see that the fourth observation has a significant t-test (.0096 < .05) indicating a potential outlier and a high Cook’s distance (1.58 > 1) and high DFFITS (2.71 > 1) indicating an influencer. Observation 13 also has a significant t-test (.034 < .05). Observations 3 and 14 are also close to having a significant t-test and observation 14 is close to having a high DFFITS value. The charts in Figure 5 graphically show the influence the change in observation 4 has on the regression line.

Change regression lines

Figure 5 – Change in regression lines

We can also see the change in the plot of the studentized residuals vs. x data elements. Here it is even more apparent that the revised fourth observation is an outlier (in Version 2).

Change studemtized residuals

Figure 6 – Change in studentized residuals

In the simple regression case it is relatively easy to spot potential outliers. This is not the case in the multivariate case. We consider this in the next example.

Example 2: Find any outliers or influencers for the data in Example 1 of Method of Least Squares for Multiple Regression.

The approach is similar to that used in Example 1. The output of the analysis is given in Figure 7.

Residuals Cook's D Excel

Figure 7 –Test for outliers and influencers for data in Example 2

The formulas in Figure 7 refer to cells described in Figure 3 of Method of Least Squares for Multiple Regression and Figure 1 of Residuals, which contain references to n, k, MSE, dfE and Y-hat.

All the calculations are similar to those in Example 1, except that this time we need to use the hat matrix H to calculate leverage. E.g. to calculate the leverage for the first observation (cell X22) we use the following formula

=INDEX(Q4:AA14,Q22,Q22)

i.e. the diagonal values in the hat matrix contained in range Q4:AA14 (see Figure 1 of Residuals).

Alternatively, we can calculate the k × 1 vector of leverage entries, using the DIAG supplemental function (see Basic Concepts of Matrices), as follows:

=DIAG(Q4:AA14)

Note too that we have added the standardized residuals (column W), which we didn’t show in Figure 1. We use MSE as an estimate of the variance, and so e.g. cell W22 contains the formula =V22/SQRT($O$19). Note that this estimate of variance is different from the one used in Excel’s Regression data analysis tool (see Figure 6 of Multiple Regression Analysis).

As we can see from Figure 7 there are no clear outliers or influencers, although the t-test for the first observation is .050942, which is close to being significant (as a potential outlier). This is consistent with what we can observe in Figure 8 of Multiple Regression Analysis and Figure 2 of Residuals.

Real Statistics Function The Real Statistics Resource Pack provides the following supplemental array function where R1 is an n × k array containing X sample data.

LEVERAGE(R1) = n × 1 vector of leverage values

Real Statistics Data Analysis Tool: The Real Statistics Resource Pack also provides a Cook’s D supplemental data analysis tool which outputs a table similar to that shown in Figure 7.

To use this tool for Example 2, perform the following steps: Enter Ctrl-m and then select Linear Regression from the menu. A dialog box will then appear as in Figure 12 of Multiple Regression Analysis. Next enter A3:B14 for Input Range X and C3:C14 for Input range Y, click on Column Headings included with data, retain the value .05 for Alpha, select the Residuals and Cook’s D option and click on the OK button. The output is similar to the table displayed in Figure 7.

28 Responses to Outliers and Influencers

  1. emre says:

    Leverage Value which is analysed by SPPS is different from this Leverage Values, what is the resason of this?

    • Charles says:

      I don’t know since I don’t use SPSS. If you send me an Excel file with your data and calculations I will try to understand why there may be a difference.
      Charles

  2. Artur says:

    Hi

    Your results in Figure 3 for Mod MSE, Rstudent , T-test and DFFITS are incorrect.

    To calculate Mod MSE in Figure 3 you used MS = 63.5 from Figure 1 rather than MS=90.29 from Figure 3

  3. Claudio says:

    Dear Charles, when reading about testing on residuals ,it is always stated about variance homegeneity in data, but it is not the case in surveying data. What would you sugest? Thank you.

    • Charles says:

      Claudio,
      Looking at residuals is important for many reasons, not just for analyzing homogeneity of variances.
      I am not sure what you mean, by “What do you suggest?”, though.
      Charles

      • Claudio says:

        Charles, I perfomed a least square adjustment where data have different variances applying a weight matrix(the inverse of covariance matrix) . Is it possible the qq plot of residuals when data have different variances? Do I have to standarize residuals first?
        Claudio

        • Charles says:

          Claudio,
          Sorry, but I don’t understand your question. You are creating a QQ plot of what data?
          Charles

          • CLAUDIO JUSTO says:

            sorry by the delay. data are a levelling network and I’m to use different methods tu find outliers. I’ll plot studentized residuals vs adjusted height differences .

  4. Ehsan says:

    Hi Charles,

    Could you please provide me with some information on how I should delete outlier data in each data set other than multi regression?

    Thanks,
    Ehsan

  5. Phillip says:

    Thanks for the reply Charles. Can you provide any insight into whether or not IVs are automatically excluded from the model if they have a high degree of multi-collinearity. It seems spss does this automatically and i’m thinking it could account for some difference. Also, is your’s a SS type III estimator?
    Thanks again; Best, Phillip

  6. Kevin says:

    Your equation for Mod MSE in Figure 3 has an error. You seem to have flipped df and MSe. Should be (I$25-I6^2/(1-J6)/I$24)*….

    • Charles says:

      Kevin,
      The formula I actually used (as shown in the Examples Workbook) is =(I$25-I6^2/((1-J6)*$I$24))*$I$24/($I$24-1). I believe this is the same as the formula you listed (let me know if this is not so). The formula that is shown in Figure 3 of the referenced webpage is wrong. I will correct this shortly. Thanks for catching this error.
      Charles

  7. Phillip says:

    Thanks for producing these Charles. Cool stuff! I’m running some Cook’s D compared to the Cook’s D output from SPSS. They are very close, but not exactly the same results. Any clues as to where the difference may be? (in my case, your Cook’s D results are slightly lower — from .001 to .01 lower — than the SPSS results; not a huge difference but it led to different classification of 1% of cases)

    • Charles says:

      Phillip,
      I don’t know what the difference is. I don’t have SPSS and so I can’t check. I will try to investigate this in the future.
      Charles

  8. Colin says:

    Sir

    You wrote: “cell W22 contains the formula =V22/SQRT($O$19)” Or should it be V22/SQRT($O$19*(1-hi))?

    Colin

    • Charles says:

      Colin,
      Cell W22 only contains the standardized residual and so the formula is the one given. Perhaps I should have used STDRESID (or something similar) instead of SRESIDUAL to make it clear that the S stands for “standard” not “studentized”.
      Charles

  9. Declan says:

    Hi Rich

    Charles does have a formula in the tool that shows how to do this. It’s the residual divided by the standard deviation of the residuals.

    Regards
    Declan

  10. Declan says:

    Hi there Charles

    Hope you are well and Merry Christmas to you.

    I think the RStudent formula needs to be relooked at, as it does not give the most accurate answer.

    However, when it is done via the hat matrix, your answers are spot on!

    Kind regards
    Declan

    • Charles says:

      Declan,

      Thanks for the heads-up. I will take a look at these in the new year.

      Charles

    • Charles says:

      Declan,

      Merry Christmas and Happy New Year to you as well.

      Charles

      • Malcolm says:

        Dear Sir,

        With regards to Declan’s comment on the RStudent formula, I would be most grateful if you could let me know if it has been looked at.

        Thanks for a brilliant site, makes it clear even to a maths dunderhead like me 🙂

        • Charles says:

          Malcolm,
          The only place I used RStudent is in the Residuals and Cook’s D data analysis tool, which I understood Declan thought was accurate. I also checked with Howell’s textbook to make sure that I implemented it correctly and found that it checked out perfectly.
          Charles

  11. Rich says:

    Hi, Charles

    Thought the new Regression Tool that incorporates the Excel data and your Residual and Cook’s Distance info is really great. Noticed that the Standardized Residuals from Excel’s output didn’t make it into your version. Wondered if a later update might include this column also.

    Thanks
    Rich

  12. Rich says:

    Might you include DFFITS in a future release?
    Thanks,
    Rich

    • Charles says:

      Hi Rich,
      Although I don’t support DFFITS currently, I do support studentized residuals and Cook’s Distance. A description of these can be found on webpage http://www.real-statistics.com/multiple-regression/outliers-and-influencers/. You can calculate DFFITS by multiplying the studentized residual when the ith observation is removed from the model, i.e. s(i), by (hii/(1-hii))1/2 where hii is the ith position on the main diagonal of the hat matrix H.
      I will shortly add DFFITS to the Cook’s Distance data analysis tool found in the Real Statistics Resource Pack (just press Ctrl-m to see it on the list).
      Charles

      Update: DFFITS has now been added to the Cook’s Distance data analysis tool

Leave a Reply

Your email address will not be published. Required fields are marked *