Definition 1: If y is a dependent variable and x1, …, xk are independent variables, then the general multiple regression model provides a prediction of y from the xi of the form
where β0 + β1x1 + … + βkxk is the deterministic portion of the model and ε is the random error. We further assume that for any given values of the xi the random error ε is normally and independently distributed with mean zero.
Observation: In practice we will build the multiple regression model from the sample data using the least squares method. Thus we seek coefficients bj such that
For the data in our sample we will have
where ŷi is the y value predicted by the model for the sample data xi1, …, xik. Thus the ith error term for the model is given by
As we will see shortly, all the results presented in Linear Regression regarding linear regression extend to multiple regression. First we reformulate the multiple linear regression model using matrix notation.
Definition 2: Let X, Y and B be as defined as in Definition 3 of Method of Least Squares for Multiple Regression and let E be an n × 1 column vector with entries e1, …, en. Then the regression model can be expressed as
Observation: B can be expressed as in Property 1 of Method of Least Squares for Multiple Regression, namely
Definition 3: We use the same terminology as in Definition 3 of Regression Analysis, except that the degrees of freedom dfRes and dfReg are modified to account for the number k of dependent variables.
Proof: The proof is the same as for Property 1 of Regression Analysis.
Property 2: Where R is the multiple correlation coefficient (defined in Definition 1 of Multiple Correlation)
Proof: These properties are the multiple regression counterparts to Property 2, 3 and 5f of Regression Analysis, respectively, and their proofs are similar.
Observation: From Property 2 and the second assertion of Property 3,
which is the multivariate version of Property 1 of Basic Concepts of Correlation.
Observation: The multiple regression model is based on the following assumptions:
- Linearity: The mean E[y] of the independent variable y can be expressed as a linear combination of the dependent variables x1, …, xk.
- Independence: Observations yi are selected independently and randomly from the population
- Normality: Observations yi are normally distributed
- Homogeneity of variances: Observations yi have the same variance
These assumptions can be expressed in terms of the error random variables :
- Linearity: The εi have a mean of 0
- Independence: The εi are independent
- Normality: The εi are normally distributed
- Homogeneity of variances: The εi have the same variance σ2
These requirements are the same as for the simple linear regression model described in Regression Analysis. The main difference is that instead of requiring that the variables (or the error term based on these variables) have a bivariate normal distribution, we now require that they have a multivariate normal distribution (i.e. normality in k + 1 dimensions). See Multivariate Normal Distribution for more details. Also see Residuals for further discussion about residuals.
Observation: The table in Figure 1 summarizes the minimum sample size and value of R2 that is necessary for a significant fit for the regression model (with a power of at least 0.80) based on the given number of independent variables and value of α.
E.g. with 5 independent variables and α = .05, a sample of 50 is sufficient to detect values of R2 ≥ 0.23.
With too small a sample, the model may overfit the data, meaning that it fits the sample data well, but does not generalize to the entire population.
Excel Functions TREND and LINEST and Regression data analysis tool: The functions SLOPE, INTERCEPT, STEYX and FORECAST don’t work for multiple regression, but the functions TREND and LINEST do support multiple regression as does the Regression data analysis tool.
TREND works exactly as described in Method of Least Squares, except that the second parameter R2 will now contain data for all the independent variables.
LINEST works just as in the simple linear regression case, except that instead of using a 5 × 2 region for the output a 5 × k region is required where k = the number of independent variables + 1. Thus for a model with 3 independent variables you need to highlight a blank 5 × 4 region. As before, you need to manually add the appropriate labels for clarity.
The Regression data analysis tool works exactly as in the simple linear regression case, except that additional charts are produced for each of the independent variables.
Example 1: We revisit Example 1 of Multiple Correlation, analyzing the model in which the poverty rate can be estimated as a linear combination of the infant mortality rate, the percentage of the population that is white and the violent crime rate (per 100,000 people).
We need to find the parameters b0, b1 and such that
Poverty (predicted) = b0 + b1 ∙ Infant + b2 ∙ White + b3 ∙ Crime.
We illustrate how to use TREND and LINEST in Figure 2.
Here we show the data for the first 15 of 50 states (columns A through E) and the percentage of poverty forecasted when infant mortality, percentage of whites in the population and crime rate are as indicated (range G6:J8). Highlighting the range J6:J8, we enter the array formula =TREND(B4:B53,C4:E53,G6:I8). As we can see from Figure 2, the model predicts a poverty rate of 12.87% when infant mortality is 7.0, whites make up 80% of the population and violent crime is 400 per 100,000 people.
Figure 2 also shows the output from LINEST after we highlight the shaded range H13:K17 and enter =LINEST(B4:B53,C4:E53,TRUE,TRUE). The column headings b1, b2, b3 and intercept refer to the first two rows only (note the order of the coefficients). The remaining three rows have two values each, labeled on the left and the right.
Thus, we see that the regression line is
Poverty = 0.437 + 1.279 ∙ Infant Mortality + .0363 ∙ White + 0.00142 ∙ Crime
Here Poverty represents the predicted value. We also see that R Square is .337 (i.e. 33.7% of the variance in the poverty rate is explained by the model), the standard error of the estimate is 2.47, etc.
We can also use the Regression data analysis tool to produce the output in Figure 3.
Since the p-value = 0.00026 < .05 = α, we conclude that the regression model is a significantly good fit; i.e. there is only a 0.026% possibility of getting a correlation this high (.58) assuming that the null hypothesis is true.
Note that the p-values for all the coefficients with the exception of the coefficient for infant mortality are bigger than .05. This means that we cannot reject the hypothesis that they are zero (and so can be eliminated from the model). This is also confirmed from the fact that 0 lies in the interval between the lower 95% and upper 95% (i.e. the 95% confidence interval) for each of these coefficients.
If we rerun the Regression data analysis tool only using the infant mortality variable we get the results shown in Figure 4.
Once again we see that the model Poverty = 4.27 + 1.23 ∙ Infant Mortality is a good fit for the data (p-value = 1.96E-05 < .05). We also see that both coefficients are significant. Most importantly we see that R Square is 31.9%, which is not much smaller than the R Square value of 33.7% that we obtained from the larger model (in Figure 3). All of this indicates that the White and Crime variables are not contributing much to the model and can be dropped.
See Testing the Significance of Extra Variables on the Regression Model for more information about how to test whether independent variables can be eliminated from the model.
Click here to see an alternative way for determining whether the regression model is a good fit.
Example 2: Determine whether the regression model for the data in Example 1 of Method of Least Squares for Multiple Regression is a good fit using the Regression data analysis tool.
The results of the analysis are displayed in Figure 5.
Since the p-value = 0.00497 < .05, we reject the null hypothesis and conclude that the regression model of Price = 1.75 + 4.90 ∙ Color + 3.76 ∙ Quality is a good fit for the data. Note that all the coefficients are significant. That R square = .85 indicates that a good deal of the variability of price is captured by the model.
Observation: We can calculate all the entries in the Regression data analysis in Figure 5 using Excel formulas as follows:
- Multiple R – SQRT(F7) or calculate from Definition 1 of Multiple Correlation
- R Square = G14/G16
- Adjusted R Square – calculate from R Square using Definition 2 of Multiple Correlation
- Standard Error = SQRT(H14)
- Observations = COUNT(A4:A14)
- SST = DEVSQ(C4:C14)
- SSReg = DEVSQ(M4:M14) from Figure 3 of Method of Least Squares for Multiple Regression
- SSRes = G16-G14
- All the other entries can be calculated in a manner similar to how we calculated the ANOVA values for Example 18.3 (see Figure 18.3).
Coefficients (in third table) – we show how to calculate the intercept fields; the color and quality fields are similar
- The coefficient and standard error can be calculated as in Figure 3 of Method of Least Squares for Multiple Regression
- t Stat = F19/G19
- P-value = TDIST(ABS(H19),F15,2)
- Lower 95% = F19-TINV(0.05,F15)*G19
- Upper 95% = F19+TINV(0.05,F15)*G19
The remaining output from the Regression data analysis is shown in Figure 6.
Observations 1 through 11 correspond to the raw data in A4:C14 (from Figure 5). In particular, the entries for Observation 1 can be calculated as follows:
- Predicted Price =F19+A4*F20+B4*F21 (from Figure 5)
- Residuals =C4-F26
- Std Residuals =G26/STDEV(G26:G36)
- Percentile: cell J26 contains the formula =100/(2*E36), cell J27 contains the formula =J26+100/E36 (and similarly for cells J28 through J36)
- Price: these are simply the price values in the range C4:C14 (from Figure 5) in sorted order. E.g. the supplemental array formula =QSORT(C4:C14) can be placed in range K26:K36.
Finally, the data analysis tool produces the following scatter diagrams.
Normal Probability Plot
- This plots the Percentile vs. Price from the table output in Figure 6. This plot is used to determine whether the data fits a normal distribution. It can be helpful to add the trend line to see whether the data fits a straight line. This is done by clicking on the plot and selecting Layout > Analysis|Trendline and choosing Linear Trendline.
- It plays the same role as the QQ plot. In fact except for the scale it generates the same plot as the QQ plot generated by the supplemental data analysis tool (switching the axes).
The plot in Figure 7 shows that the data is a reasonable fit with the normal assumption.
- One plot is generated for each independent variable. For Example 2, two plots are generated: Color vs. Residuals and Quality vs. Residuals.
- These plots are used to determine whether the data fits the linearity and homogeneity of variance assumptions. For the homogeneity of variance assumption to be met each plot should show a random pattern of points. If the a definitive shape of dots emerges or if the vertical spread of points is not constant over similar length horizontal intervals, then this indicates that the homogeneity of variances assumption is violated.
- For the linearity assumption to be met the residuals should have a mean of 0, which is indicated by an approximately equal spread of dots above and below the x-axis.
The Color Residual plot in Figure 8 shows a reasonable fit with the linearity and homogeneity of variance assumptions. The Quality Residual plot is a little less definitive, but for so few sample points it is not a bad fit.
The two plots in Figure 9 show clear problems. Fortunately, these are not based on the data in Example 2.
For the chart on the left of Figure 9 the vertical spread of dots on the right side of the chart is larger than on the left. This is a clear indication that the variances are not homogeneous. For the chart on the right the dots don’t seem to be random and also few of the points are below the x-axis (which indicates a violation of linearity). The chart in Figure 10 is ideally what we are looking for: a random spread of dots, with an equal number above and below the x-axis.
Figure 10 – Residuals showing fit for linearity and variance assumptions
Line Fit Plots
- One plot is generated for each independent variable. For Example 2, two plots are generated: one for Color and one for Quality. For each chart the observed y values (Price) and predicted y values are plotted against the observed values of the independent variable.
Observation: The Excel Regression data analysis tool is limited to 16 independent variables. The LINEST function as well as the various Real Statistics functions and the Real Statistics data analysis tool described below have a higher limit, namely 64 independent variables.
Real Statistics Excel Functions: The Real Statistics Resource Pack supplies the following supplemental functions. Here R1 is an n × k array containing the X sample data and R2 is an n × 1 array containing the Y sample data.
A second R2 parameter can be used with each of the df functions above, although this parameter is not used. Similarly you can use SSRegTot(R1, R2) and its value will be equivalent to SSRegTot(R2).
There is also a second form of the RSquare function in which RSquare(R1, k) = R2 where the X data consist of all the columns in R1 except the kth column and the Y data consist of the kth column of R1.
Observation: As we observed previously
It turns out that R2 = RSquare(R1, k) can also be calculated by first finding the correlation matrix for R1, then taking its inverse and then selecting the kth element on the diagonal. If this has the value c then the desired value of R2 is the square root of 1–1/c.
Thus R2 can also be calculated by the formula:
Real Statistics Data Analysis Tool: The Real Statistics Resource Pack provides a data analysis tool whose output is similar to that of the Excel Regression data analysis tool shown in Figure 5, except that, as usual, the output will contain formulas instead of values. In this way if you change any of the input data, the output will automatically be updated without having to rerun the data analysis tool.
To use the tool for Example 2 you perform the following steps: Enter Ctrl-m and then select Linear Regression from the menu. A dialog box will then appear as in Figure 12.
Figure 12 – Dialog box for Linear Regression Data Analysis Tool
Now enter A3:B14 for Input Range X and C3:C14 for Input range Y, click on Column Headings included with data, retain the value .05 for Alpha, select the Regression Analysis option and click on OK. The output appears in Figure 13.