Residuals

In Multiple Regression Analysis, we noted that the assumptions for the regression model can be expressed in terms of the error random variables  as follows:

  • Linearity: The εi have mean of 0
  • Independence: The εi are independent
  • Normality: The εi are normally distributed
  • Homogeneity of variances: The εi have the same variance σ2

If these assumptions are satisfied then the random errors εi can be regarded as a random sample from a N(0, σ) distribution. It is natural, therefore, to test our assumptions for the regression model by investigating the sample observations of the residuals

Raw residual

It turns out that the raw residuals ei have the distribution

image2041

where the hii are the terms in the diagonal of the hat matrix defined in Definition 3 of Method of Least Squares for Multiple Regression. It also turns out that raw residuals are not independent.

By Property 3b of Expectation, we know that

Standardized residual

The ri have the desired distribution, but they are still not independent. If, however, the hii are reasonably close to zero then the ri can be considered to be independent. As usual, MSE can be used as an estimate for σ.

Definition 1: The studentized residuals are defined by

Studentized residual

Observation: If the εi have the same variance σ2, then the studentized residuals have a Student’s t distribution, namely

image2048

where n = the number elements in the sample and k = the number of independent variables.

We can now use the studentized residuals to test the various assumptions of the multiple regression model. In particular, we can use the various tests described in Testing for Normality and Symmetry, especially QQ plots, to test for normality, and we can use the tests found in Homogeneity of Variance to test whether the homogeneity of variance assumption is met.

It should also be noted that if the linearity and homogeneity of variances assumptions are met then a plot of the studentized residuals should show a randomized pattern. If this is not the case then one of these assumptions is not being met. This approach works quite well where there is only one independent variable. With multiple independent variables, then the plot of the residual against each independent variable will be necessary, and even then multi-dimensional issues may not be captured.

Example 1: Check the assumptions of regression analysis for the data in Example 1 of Method of Least Squares for Multiple Regression by using the studentized residuals.

We start by calculating the studentized residuals (see Figure 1).

Hat matrix studentized residuals

Figure 1 – Hat matrix and studentized residuals for data in Example 1

First we calculate the hat matrix H (from the data in Figure 1 of Multiple Regression Analysis in Excel) by using the array formula

=MMULT(MMULT(E4:G14,E17:G19),TRANSPOSE(E4:G14))

where E4:G14 contains the design matrix X. Alternatively, H can be calculated using the supplemental function HAT(A4:B14). From H, the vector of studentized residuals is calculated by the array formula

=O4:O14/SQRT(O19*(1-INDEX(Q4:AA14,AB4:AB14,AB4:AB14)))

where O4:O14 contains the matrix of raw residuals E and O19 contains MSRes. See Example 2 in Matrix Operations for more information about extracting the diagonal elements from a square matrix.

We now plot the studentized residuals against the predicted values of y (in cells M4:M14 of Figure 2).

Studentized residuals chart Excel

Figure 2 – Studentized residual plot for Example 1

The values are reasonably spread out, but there does seem to be a pattern of rising value on the right, but with such a small sample it is difficult to tell. Also some of this might be due to the presence of outliers (see Outliers and Influencers).

Real Statistics Excel Function: The Real Statistics Resource Pack provides the following supplemental array function where R1 is an n × k array containing X sample data and R2 is an n × 1 array containing Y sample data.

RegStudE(R1, R2) =  n × 1 vector of studentized residuals

Thus, the values in the range AC4:AC14 of Figure 1 can be generated via the array formula RegStudE(A4:B14,C4:C14), again referring to the data in Figure 3 of Method of Least Squares for Multiple Regression

9 Responses to Residuals

  1. Alhanouf says:

    Can you please explain to me what is the Raw Residual ?
    Thanks in advance !

  2. Tara says:

    Hi, Charles. Thank you for your website. Above, you state: “First we calculate the hat matrix H (from the data in Figure 3 of Method of Least Squares for Multiple Regression) by using the array formula…”. There is no Figure 3 of Method of Least Squares for Multiple Regression.

    Regards,
    Tara

    • Charles says:

      Tara,
      Thanks for catching this mistake. The reference should be to Figure 1 of Multiple Regression Analysis in Excel. I have now corrected the referenced webpage.
      Charles

  3. Declan says:

    Hi Charles,

    Thanks, much appreciated.

    Kind regards
    Declan

  4. Declan says:

    Hi there

    The SUMXMY2 formula is great for Durbin Watson test.

    Kind regards
    Declan

  5. Rich says:

    Perhaps include PRESS and Durbin-Watson statistics in a future release?
    Thanks, Rich

Leave a Reply

Your email address will not be published. Required fields are marked *