We now look at how to detect outliers that have an undue influence on the multiple regression model. Keep in mind that since we are dealing with a multi-dimensional model, there may be data points that look perfectly fine in any single dimension but are multivariate outliers. E.g. for the general population there is nothing unusual about a 6-foot man or a 125 pound man, but a 6-foot man that weighs 125 pounds is unusual.
Definition 1: The following parameters are indicators that a sample point (xi1, …, xik, yi) is an outlier:
is the measure of distance of the ith sample point from the regression line. Points with large residuals are potential outliers.
Leverage – By Property 1 of Method of Least Squares for Multiple Regression, Y-hat = HY where H is the n × n hat matrix = [hij]. Thus for the ith point in the sample,
where each hij only depends on the x values in the sample. Thus the strength of the contribution of sample value yi on the predicted value ŷi is determined by the coefficient hii, which is called the leverage and usually abbreviated as hi.
Observation: Where there is only one independent variable, we have
Leverage measures how far away the data point is from the mean value. In general 1/n ≤ hi ≤ 1. Where there are k independent variables in the model, the mean value for leverage is (k+1)/n. A rule of thumb (Steven’s) is that values 3 times this mean value are considered large.
As we saw in Residuals, the standard error of the residual ei is
and so the studentized residuals si have the following property:
We will use this measure when we define Cook’s distance below. For our purposes now, we need to look at the version of the studentized residual when the ith observation is removed from the model, i.e.
Definition 2: If we remove a point from the sample, then equation for regression line changes. Points that have the most influence produce the largest change in the equation of the regression line. A measure of this influence is called Cook’s distance. For the ith point in the sample, Cook’s distance is defined as
where ŷj(i) is the prediction of yj by the revised regression model when the point (x, …, xik, yi) is removed from the sample.
Another measure of influence is DFFITS, which is defined by the formula
Property 1: Cook’s distance can be given by the following equation:
Observation: Property 1 means that we don’t need to perform repeated regressions to obtain Cook’s distance. Furthermore, Cook’s distance combines the effects of distance and leverage to obtain one metric. This definition of Cook’s distance is equivalent to
Values of Cook’s distance of 1 or greater are generally viewed as high.
Similarly DFFITS can be calculated without repeated regressions as shown by Property 2.
Property 2: DFFITS can be given by the following equation:
Observation: Values of |DFFITS| > 1 are potential problems in small to medium samples and values of |DFFITS| > 2 are potential problems in large samples.
Example 1: Find any outliers or influencers for the data in Example 1 of Regression Analysis What happens if we change the Life Expectancy measure for the fourth data element from 53 to 83?
Figure 1 implements the various statistics described above for the data in Example 1.
Figure 2 describes some of the formulas in the worksheet in Figure 1:
Figure 2 – Formulas in Figure 1
The formulas for the first observation in the table (row 6) of Figure 1 are displayed in Figure 3. The formulas for the other observations are similar.
Figure 3 – Formulas for observation 1 in Figure 1
As you can see, no data element in Figure 1 has a significant t-test (< .05) nor a high Cook’s distance (> 1) or DFFITS (> 1). We now repeat the same analysis where the Life Expectancy measure for the fourth data element (cell B9) is changed from 53 to 83 (Figure 4).
This time we see that the fourth observation has a significant t-test (.0096 < .05) indicating a potential outlier and a high Cook’s distance (1.58 > 1) and high DFFITS (2.71 > 1) indicating an influencer. Observation 13 also has a significant t-test (.034 < .05). Observations 3 and 14 are also close to having a significant t-test and observation 14 is close to having a high DFFITS value. The charts in Figure 5 graphically show the influence the change in observation 4 has on the regression line.
We can also see the change in the plot of the studentized residuals vs. x data elements. Here it is even more apparent that the revised fourth observation is an outlier (in Version 2).
In the simple regression case it is relatively easy to spot potential outliers. This is not the case in the multivariate case. We consider this in the next example.
Example 2: Find any outliers or influencers for the data in Example 1 of Method of Least Squares for Multiple Regression.
The approach is similar to that used in Example 1. The output of the analysis is given in Figure 7.
All the calculations are similar to those in Example 1, except that this time we need to use the hat matrix H to calculate leverage. E.g. to calculate the leverage for the first observation (cell X22) we use the following formula
i.e. the diagonal values in the hat matrix contained in range Q4:AA14 (see Figure 1 of Residuals).
Alternatively, we can calculate the k × 1 vector of leverage entries, using the DIAG supplemental function (see Basic Concepts of Matrices), as follows:
Note too that we have added the standardized residuals (column W), which we didn’t show in Figure 1. We use MSE as an estimate of the variance, and so e.g. cell W22 contains the formula =V22/SQRT($O$19). Note that this estimate of variance is different from the one used in Excel’s Regression data analysis tool (see Figure 6 of Multiple Regression Analysis).
As we can see from Figure 7 there are no clear outliers or influencers, although the t-test for the first observation is .050942, which is close to being significant (as a potential outlier). This is consistent with what we can observe in Figure 8 of Multiple Regression Analysis and Figure 2 of Residuals.
Real Statistics Function The Real Statistics Resource Pack provides the following supplemental array function where R1 is an n × k array containing X sample data.
LEVERAGE(R1) = n × 1 vector of leverage values
Real Statistics Data Analysis Tool: The Real Statistics Resource Pack also provides a Cook’s D supplemental data analysis tool which outputs a table similar to that shown in Figure 7.
To use this tool for Example 2, perform the following steps: Enter Ctrl-m and then select Linear Regression from the menu. A dialog box will then appear as in Figure 12 of Multiple Regression Analysis. Next enter A3:B14 for Input Range X and C3:C14 for Input range Y, click on Column Headings included with data, retain the value .05 for Alpha, select the Residuals and Cook’s D option and click on the OK button. The output is similar to the table displayed in Figure 7.