Multiple Regression Analysis

Definition 1: If y is a dependent variable and x1, …, xk are independent variables, then the general multiple regression model provides a prediction of y from the xi of the form

General multiple regression model

where β0 + β1x1 + … + βkxk is the deterministic portion of the model and ε is the random error. We further assume that for any given values of the xi the random error ε is normally and independently distributed with mean zero.

Observation: In practice we will build the multiple regression model from the sample data using the least squares method. Thus we seek coefficients bj such that

image1874

For the data in our sample we will have

image1875

where ŷi is the y value predicted by the model for the sample data xi1, …, xik Thus the ith error term for the model is given by

image1877

As we will see shortly, all the results presented in Linear Regression regarding linear regression extend to multiple regression. First we reformulate the multiple linear regression model using matrix notation.

Definition 2: Let X, Y and B be as defined as in Definition 3 of Method of Least Squares for Multiple Regression and let E be an n × 1 column vector with entries e1, …, en. Then the regression model can be expressed as

image1880

Observation: B can be expressed as in Property 1 of Method of Least Squares for Multiple Regression, namely

image1855

Definition 3: We use the same terminology as in Definition 3 of Regression Analysis, except that the degrees of freedom dfRes and dfReg are modified to account for the number of dependent variables.

Picture70

Property 1:
image1711 image1712 image1713

Proof: The proof is the same as for Property 1 of Regression Analysis.

Property 2: Where R is the multiple correlation coefficient (defined in Definition 1 of Multiple Correlation)

image1892 image1893image7296

Proof: These properties are the multiple regression counterparts to Property 2, 3 and 5f of Regression Analysis, respectively, and their proofs are similar.

Observation: From Property 2 and the second assertion of Property 3,

image1895

which is the multivariate version of Property 1 of Basic Concepts of Correlation.

Observation: The multiple regression model is based on the following assumptions:

  • Linearity: The mean E[y] of the independent variable y can be expressed as a linear combination of the dependent variables x1, …, xk.
  • Independence: Observations yi are selected independently and randomly from the population
  • Normality: Observations yi are normally distributed
  • Homogeneity of variances: Observations yi have the same variance

These assumptions can be expressed in terms of the error random variables :

  • Linearity: The εi have a mean of 0
  • Independence: The εi are independent
  • Normality: The εi are normally distributed
  • Homogeneity of variances: The εi have the same variance σ2

These requirements are the same as for the simple linear regression model described in Regression Analysis. The main difference is that instead of requiring that the variables (or the error term based on these variables) have a bivariate normal distribution, we now require that they have a multivariate normal distribution (i.e. normality in k + 1 dimensions). See Multivariate Normal Distribution for more details. Also see Residuals for further discussion about residuals.

Observation: The table in Figure 1 summarizes the minimum sample size and value of R2 that is necessary for a significant fit for the regression model (with a power of at least 0.80) based on the given number of independent variables and value of α.

Minimum sample size regression

Figure 1 – Minimum sample size needed for regression model

E.g. with 5 independent variables and α = .05, a sample of 50 is sufficient to detect values of R2 ≥ 0.23.

With too small a sample, the model may overfit the data, meaning that it fits the sample data well, but does not generalize to the entire population.

Excel Functions TREND and LINEST and Regression data analysis tool: The functions SLOPE, INTERCEPT, STEYX and FORECAST don’t work for multiple regression, but the functions TREND and LINEST do support multiple regression as does the Regression data analysis tool.

TREND works exactly as described in Method of Least Squares, except that the second parameter R2 will now contain data for all the independent variables.

LINEST works just as in the simple linear regression case, except that instead of using a 5 × 2 region for the output a 5 × k region is required where k = the number of independent variables + 1. Thus for a model with 3 independent variables you need to highlight a blank 5 × 4 region. As before, you need to manually add the appropriate labels for clarity.

The Regression data analysis tool works exactly as in the simple linear regression case, except that additional charts are produced for each of the independent variables.

Example 1: We revisit Example 1 of Multiple Correlation, analyzing the model in which the poverty rate can be estimated as a linear combination of the infant mortality rate, the percentage of the population that is white and the violent crime rate (per 100,000 people).

We need to find the parameters b0, b1 and  such that

Poverty (predicted) = b0 + b1 ∙ Infant + b2 ∙ White + b3 ∙ Crime.

We illustrate how to use TREND and LINEST in Figure 2.

TREND LINEST Excel

Figure 2 – TREND and LINEST for data in Example 1

Here we show the data for the first 15 of 50 states (columns A through E) and the percentage of poverty forecasted when infant mortality, percentage of whites in the population and crime rate are as indicated (range G6:J8). Highlighting the range J6:J8, we enter the array formula =TREND(B4:B53,C4:E53,G6:I8). As we can see from Figure 2, the model predicts a poverty rate of 12.87% when infant mortality is 7.0, whites make up 80% of the population and violent crime is 400 per 100,000 people.

Figure 2 also shows the output from LINEST after we highlight the shaded range H13:K17 and enter =LINEST(B4:B53,C4:E53,TRUE,TRUE). The column headings b1, b2, b3 and intercept refer to the first two rows only (note the order of the coefficients). The remaining three rows have two values each, labeled on the left and the right.

Thus, we see that the regression line is

Poverty = 0.437 + 1.279 ∙ Infant Mortality + .0363 ∙ White + 0.00142 ∙ Crime

Here Poverty represents the predicted value. We also see that R Square is .337 (i.e. 33.7% of the variance in the poverty rate is explained by the model), the standard error of the estimate is 2.47, etc.

We can also use the Regression data analysis tool to produce the output in Figure 3.

Regression data analysis Excel

Figure 3 – Output from Regression data analysis tool

Since the p-value = 0.00026 < .05 = α, we conclude that the regression model is a significantly good fit; i.e. there is only a 0.026% possibility of getting a correlation this high (.58) assuming that the null hypothesis is true.

Note that the p-values for all the coefficients with the exception of the coefficient for infant mortality are bigger than .05. This means that we cannot reject the hypothesis that they are zero (and so can be eliminated from the model). This is also confirmed from the fact that 0 lies in the interval between the lower 95% and upper 95% (i.e. the 95% confidence interval) for each of these coefficients.

If we rerun the Regression data analysis tool only using the infant mortality variable we get the results shown in Figure 4.

Reduced regression model Excel

Figure 4 – Reduced regression model for Example 1

Once again we see that the model Poverty = 4.27 + 1.23 ∙ Infant Mortality is a good fit for the data (p-value = 1.96E-05 < .05). We also see that both coefficients are significant. Most importantly we see that R Square is 31.9%, which is not much smaller than the R Square value of 33.7% that we obtained from the larger model (in Figure 3). All of this indicates that the White and Crime variables are not contributing much to the model and can be dropped.

See Testing the Significance of Extra Variables on the Regression Model  for more information about how to test whether independent variables can be eliminated from the model.

Click here to see an alternative way for determining whether the regression model is a good fit.

Example 2: Determine whether the regression model for the data in Example 1 of Method of Least Squares for Multiple Regression is a good fit using the Regression data analysis tool.

The results of the analysis are displayed in Figure 5.

Multiple regression Excel

Figure 5 – Output from Regression data analysis tool for Example 2

Since the p-value = 0.00497 < .05, we reject the null hypothesis and conclude that the regression model of Price = 1.75 + 4.90 ∙ Color + 3.76 ∙ Quality is a good fit for the data. Note that all the coefficients are significant. That R square = .85 indicates that a good deal of the variability of price is captured by the model.

Observation: We can calculate all the entries in the Regression data analysis in Figure 5 using Excel formulas as follows:

Regression Statistics

  • Multiple R – SQRT(F7) or calculate from Definition 1 of Multiple Correlation
  • R Square = G14/G16
  • Adjusted R Square – calculate from R Square using Definition 2 of Multiple Correlation
  • Standard Error = SQRT(H14)
  • Observations = COUNT(A4:A14)

ANOVA

  • SST = DEVSQ(C4:C14)
  • SSReg = DEVSQ(M4:M14) from Figure 3 of Method of Least Squares for Multiple Regression
  • SSRes = G16-G14
  • All the other entries can be calculated in a manner similar to how we calculated the ANOVA values for Example 18.3 (see Figure 18.3).

Coefficients (in third table) – we show how to calculate the intercept fields; the color and quality fields are similar

  • The coefficient and standard error can be calculated as in Figure 3 of Method of Least Squares for Multiple Regression
  • t Stat = F19/G19
  • P-value = TDIST(ABS(H19),F15,2)
  • Lower 95% = F19-TINV(0.05,F15)*G19
  • Upper 95% = F19+TINV(0.05,F15)*G19

The remaining output from the Regression data analysis is shown in Figure 6.

Residuals regression Excel

Figure 6 – Residuals/percentile output from Regression data analysis

Residual Output

Observations 1 through 11 correspond to the raw data in A4:C14 (from Figure 5). In particular, the entries for Observation 1 can be calculated as follows:

  • Predicted Price =F19+A4*F20+B4*F21 (from Figure 5)
  • Residuals =C4-F26
  • Std Residuals =G26/STDEV(G26:G36)

Probability Output

  • Percentile: cell J26 contains the formula =100/(2*E36), cell J27 contains the formula =J26+100/E36 (and similarly for cells J28 through J36)
  • Price: these are simply the price values in the range C4:C14 (from Figure 5) in sorted order. E.g. the supplemental array formula =QSORT(C4:C14) can be placed in range K26:K36.

Finally, the data analysis tool produces the following scatter diagrams.

Normal Probability Plot

  • This plots the Percentile vs. Price from the table output in Figure 6. This plot is used to determine whether the data fits a normal distribution. It can be helpful to add the trend line to see whether the data fits a straight line. This is done by clicking on the plot and selecting Layout > Analysis|Trendline and choosing Linear Trendline.
  • It plays the same role as the QQ plot. In fact except for the scale it generates the same plot as the QQ plot generated by the supplemental data analysis tool (switching the axes).
Normal probability plot Excel

Figure 7 – Normal Probability Plot

The plot in Figure 7 shows that the data is a reasonable fit with the normal assumption.

Residual Plots

  • One plot is generated for each independent variable. For Example 2, two plots are generated: Color vs. Residuals and Quality vs. Residuals.
  • These plots are used to determine whether the data fits the linearity and homogeneity of variance assumptions. For the homogeneity of variance assumption to be met each plot should show a random pattern of points. If the a definitive shape of dots emerges or if the vertical spread of points is not constant over similar length horizontal intervals, then this indicates that the homogeneity of variances assumption is violated.
  • For the linearity assumption to be met the residuals should have a mean of 0, which is indicated by an approximately equal spread of dots above and below the x-axis.
Residuals plot Excel

Figure 8 – Residual Plots

The Color Residual plot in Figure 8 shows a reasonable fit with the linearity and homogeneity of variance assumptions. The Quality Residual plot is a little less definitive, but for so few sample points it is not a bad fit.

The two plots in Figure 9 show clear problems. Fortunately, these are not based on the data in Example 2.

Residual plots regression Excel

Figure 9 – Residual Plots showing violation of assumptions

For the chart on the left of Figure 9 the vertical spread of dots on the right side of the chart is larger than on the left. This is a clear indication that the variances are not homogeneous. For the chart on the right the dots don’t seem to be random and also few of the points are below the x-axis (which indicates a violation of linearity). The chart in Figure 10 is ideally what we are looking for: a random spread of dots, with an equal number above and below the x-axis.

Residuals linearity variance assumptions

Figure 10 – Residuals showing fit for assumptions

Figure 10 – Residuals showing fit for linearity and variance assumptions

Line Fit Plots

  • One plot is generated for each independent variable. For Example 2, two plots are generated: one for Color and one for Quality. For each chart the observed y values (Price) and predicted y values are plotted against the observed values of the independent variable.
Line fit plot Excel

Figure 11 – Line fit plots for Example 2

Observation: The Excel Regression data analysis tool is limited to 16 independent variables. The LINEST function as well as the various Real Statistics functions and the Real Statistics data analysis tool described below have a higher limit, namely 64 independent variables.

Real Statistics Excel Functions: The Real Statistics Resource Pack supplies the following supplemental functions. Here R1 is an n × k array containing the X sample data and R2 is an n × 1 array containing the Y sample data.

Picture65

A second R2 parameter can be used with each of the df functions above, although this parameter is not used. Similarly you can use SSRegTot(R1, R2) and its value will be equivalent to SSRegTot(R2).

There is also a second form of the RSquare function in which RSquare(R1, k) = R2 where the X data consist of all the columns in R1 except the kth column and the Y data consist of the kth column of R1.

Observation: As we observed previously

image1895

It turns out that R2 = RSquare(R1, k) can also be calculated by first finding the correlation matrix for R1, then taking its inverse and then selecting the kth element on the diagonal. If this has the value c then the desired value of R2 is the square root of 1–1/c.

Thus R2 can also be calculated by the formula:

=1–1/INDEX(DIAG(MINVERSE(CORR(R1))),k)

Real Statistics Data Analysis Tool: The Real Statistics Resource Pack provides a data analysis tool whose output is similar to that of the Excel Regression data analysis tool shown in Figure 5, except that, as usual, the output will contain formulas instead of values. In this way if you change any of the input data, the output will automatically be updated without having to rerun the data analysis tool.

To use the tool for Example 2 you perform the following steps: Enter Ctrl-m and then select Linear Regression from the menu. A dialog box will then appear as in Figure 12.

Linear regression dialog box

Figure 12 – Dialog box for Linear Regression Data Analysis Tool

Now enter A3:B14 for Input Range X and C3:C14 for Input range Y, click on Column Headings included with data, retain the value .05 for Alpha, select the Regression Analysis option and click on OK. The output appears in Figure 13.

Linear Regression data analysis

Figure 13 – Real Statistics Linear Regression Data Analysis

9 Responses to Multiple Regression Analysis

  1. Neetu sharma says:

    very knowledgable article.

  2. Michael says:

    Very useful and practical. Actually the section for coefficient analisys is sustantive.

  3. Tapan kumar mahanta says:

    Sir i am unable to get the result by applying the command =TREND(B4:B53,C4:E53,G6:I8) at figure 2 of multiple regression analysis.
    please inform me how i will over come the problem and i will get the same result what you got at table 2.

    • Charles says:

      TREND is an array formula. It is essential that you press Ctrl-Shift-Enter (and not just Enter) after inserting the formula. If this is not the problem, let me know.
      Charles

  4. Tapan kumar mahanta says:

    i am doing the same Ctrl-Shift-Enter but the result is not coming. final result is coming like this #Value.
    for my problem Y= b0+b1x1 +b2x2 +b3x3 +b4x4 +b5x5 +b6x6
    Y matrix=(22*1)
    x matrix=(x1 x2 x3 x4 x5 x6)
    x1….x6 matrix=(22*6)
    please tell how i can solve this problem using multiple regression. using TREND and LINSET command.

  5. umesh patil india says:

    Compressive Strength Tensile Strength % polymer added
    25.33 2.44 0
    28 2.69 1
    33.63 3.11 2
    34.52 3.3 3
    32.74 3.06 4
    28.89 2.78 5
    25.93 2.55 6
    24 2.12 7
    21.62 1.93 8
    thirdt column values are independent and remaining 2 columns r dedenedent how 2 do multiple regression? plz

    • Charles says:

      The usual multiple regression model would have two independent variables and one dependent variable. Since you have two dependent variables and one independent, you can create two separate regression models with one dependent and one independent variable. If you want a single regression model with two dependent variables then you probably want multivariate regression. R provides this capability.
      Charles

  6. just a guy says:

    According to this formula
    Poverty = 0.457 + .00142 ∙ Infant Mortality + .0363 ∙ White + 1.279 ∙ Crime

    for a certain case of Alabama where Infant Mortality =9, White= 71 and Crime= 448
    Poverty should be 0.457 + 0.01278 + 2.577 + 573 = 573

    And not around 15.7 as it is in the table. The result is clearly off and contribution from Crime is significantly higher then contribution from first 2 factors. This would make sense if first 2 factors do not correlate at all with Poverty but i think this is not the case.

    I might be wrong but it appears to me that something is a bit off…

    • Charles says:

      You are correct. I made two typing mistakes when I copied the coefficients in Figure 3. The correct regression line should be:

      Poverty (predicted) = 0.437 + 1.279 ∙ Infant Mortality + .0363 ∙ White + 0.00142 ∙ Crime

      Thanks for catching this error.

      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>