In Multiple Regression Analysis, we noted that the assumptions for the regression model can be expressed in terms of the error random variables as follows:
- Linearity: The εi have mean of 0
- Independence: The εi are independent
- Normality: The εi are normally distributed
- Homogeneity of variances: The εi have the same variance σ2
If these assumptions are satisfied then the random errors εi can be regarded as a random sample from a N(0, σ) distribution. It is natural, therefore, to test our assumptions for the regression model by investigating the sample observations of the residuals
It turns out that the raw residuals ei have the distribution
where the hii are the terms in the diagonal of the hat matrix defined in Definition 3 of Method of Least Squares for Multiple Regression. It also turns out that raw residuals are not independent.
By Property 3b of Expectation, we know that
The ri have the desired distribution, but they are still not independent. If, however, the hii are reasonably close to zero then the ri can be considered to be independent. As usual, MSE can be used as an estimate for σ.
Definition 1: The studentized residuals are defined by
Observation: If the εi have the same variance σ2, then the studentized residuals have a Student’s t distribution, namely
where n = the number elements in the sample and k = the number of independent variables.
We can now use the studentized residuals to test the various assumptions of the multiple regression model. In particular, we can use the various tests described in Testing for Normality and Symmetry, especially QQ plots, to test for normality, and we can use the tests found in Homogeneity of Variance to test whether the homogeneity of variance assumption is met.
It should also be noted that if the linearity and homogeneity of variances assumptions are met then a plot of the studentized residuals should show a randomized pattern. If this is not the case then one of these assumptions is not being met. This approach works quite well where there is only one independent variable. With multiple independent variables, then the plot of the residual against each independent variable will be necessary, and even then multi-dimensional issues may not be captured.
Example 1: Check the assumptions of regression analysis for the data in Example 1 of Method of Least Squares for Multiple Regression by using the studentized residuals.
We start by calculating the studentized residuals (see Figure 1).
Figure 1 – Hat matrix and studentized residuals for data in Example 1
First we calculate the hat matrix H (from the data in Figure 1 of Multiple Regression Analysis in Excel) by using the array formula
where E4:G14 contains the design matrix X. Alternatively, H can be calculated using the supplemental function HAT(A4:B14). From H, the vector of studentized residuals is calculated by the array formula
where O4:O14 contains the matrix of raw residuals E and O19 contains MSRes. See Example 2 in Matrix Operations for more information about extracting the diagonal elements from a square matrix.
We now plot the studentized residuals against the predicted values of y (in cells M4:M14 of Figure 2).
Figure 2 – Studentized residual plot for Example 1
The values are reasonably spread out, but there does seem to be a pattern of rising value on the right, but with such a small sample it is difficult to tell. Also some of this might be due to the presence of outliers (see Outliers and Influencers).
Real Statistics Excel Function: The Real Statistics Resource Pack provides the following supplemental array function where R1 is an n × k array containing X sample data and R2 is an n × 1 array containing Y sample data.
RegStudE(R1, R2) = n × 1 vector of studentized residuals
Thus, the values in the range AC4:AC14 of Figure 1 can be generated via the array formula RegStudE(A4:B14,C4:C14), again referring to the data in Figure 3 of Method of Least Squares for Multiple Regression