In Example 1 of Multiple Regression Analysis we used 3 independent variables: Infant Mortality, White and Crime, and found that the regression model was a significant fit for the data. We also commented that the White and Crime variables could be eliminated from the model without significantly impacting the accuracy of the model. The following property can be used to test whether all of these variables add significantly to the model.

where *m* = number of independent variables being tested for elimination and *SS’ _{E}* is the value of

*SS*for the model without these variables.

_{E}E.g. suppose we consider the multiple regression model

and want to determine whether *b*_{3}*, b*_{4} and *b*_{5} add significant benefit to the model (i.e. whether the **reduced** model y = *b*_{0}* + b _{1}x*

_{1}

*+ b*

_{2}

*x*

_{2}is significantly no worse than the

**complete**model). The null hypothesis H

_{0}:

*b*

_{3 }

*= b*

_{4}

*= b*

_{5}= 0 is tested using the statistic

*F*as described in Property 1 where

*m*= 3 and

*SS’*references the reduced model, while

_{E}*SS*,

_{E}*MS*and

_{E}*df*refer to the complete model.

_{E}**Example 1**: Determine whether the White and Crime variables can be eliminated from the regression model for Example 1 of Multiple Regression Analysis.

Figure 1 implements the test described in Property 1 (using the output in Figure 3 and 4 of Multiple Regression Analysis to determine the values of cells AD4, AD5, AD6, AE4 and AE5).

**Figure 1 – Determine if White and Crime can be eliminated**

Since p-value = .536 > .05 = *α*, we cannot reject the null hypothesis, and so conclude that White and Crime do not add significantly to the model and so can be eliminated.

**Observation**: An alternative way of determining whether certain independent variables are making a significant contribution to the regression model is to use the following property.

where *R ^{2}* and

*df*are the values for the full model,

_{E}*m*= number of independent variables being tested for elimination and is the value of

*R*for the model without these variables (i.e. the reduced model).

^{2}**Observation**: If we redo Example 1 using Property 2, once again we see that the White and Crime variables do not make a significant contribution (see Figure 2, which uses the output from Figure 3 and 4 from Using the output in Figure 3 and 4 of Multiple Regression Analysis to determine the values of cells AD14, AD15, AE14 and AE15).

**Figure 2 – Using R-square to decide whether to drop variables**

**Observation**: When there are a large number of potential independent variables which can be used to model the dependent variable, the general approach is to use the fewest number of independent variables that accounts for a sufficiently large portion of the variance (as measured by *R ^{2}*). Of course you may prefer to include certain variables based on theoretical criteria rather than on simply statistical considerations.

If your only objective is to explain the greatest amount of variance with the fewest independent variables, generally the independent variable *x* with the largest correlation coefficient with the dependent variable y should be chosen first. Additional independent variables can then be added until the desired level of accuracy is achieved.

In particular, the **stepwise estimation method** is as follows:

- Select the independent variable
*x*which most highly correlates with the dependent variable y. This provides the simple regression model y =_{1}*b*_{0}+ b_{1}x_{1} - Examine the partial correlation coefficients to find the independent variable
*x*that explains the largest significant portion of the unexplained (error) variance) from among the remaining independent variables. This yields the regression equation y =_{2}*b*._{0}+ b_{1}x_{1}+ b_{2}x_{2} - Examine the partial
*F*value for*x*in the model to determine whether it still makes a significant contribution. If it does not then eliminate this variable._{1} - Continue the procedure by examining all independent variables not in the model to determine whether one would make a significant addition to the current equation. If so, select the one that makes the highest contribution, generate a new regression model and then examine all the other independent variables in the model to determine whether they should be kept.
- Stop the procedure when no additional independent variable makes a significant contribution to the predictive accuracy. This occurs when all the remaining partial regression coefficients are non-significant.

From Property 2 of Multiple Correlation, we know that

Thus we are seeking the order* x _{1}, x_{2}, …, x_{k} *such that the leftmost terms on the right side of the equation above explain the most variance. In fact the goal is to choose an

*m < k*such

**Observation**: We can use the following alternatives to this approach:

- Start with all independent variables and remove variables one at a time until there is a significant loss in accuracy
- Look at all combinations of independent variables to see which ones generate the best model. For k independent variables there are 2
such combinations.^{k}

Since multiple significance tests are performed, when using the stepwise procedure it is better to have a larger sample space and to employ more conservative thresholds when adding and deleting variables (e.g. *α* = .01). In fact, it is better not to use a mechanized approach and instead evaluate the significance of adding or deleting variables based on theoretical considerations.

Note that if two independent variables are highly correlated (multicollinearity) then if one of these is used in the model, it is highly unlikely that the other will enter the model. One should not conclude, however, that the second independent variable is inconsequential.

**Observation**: In the approaches considered thus far, we compare a complete model with a reduced model. We can also compare models using Akaike’s Information Criterion (AIC).

**Definition 1**: For multiple linear regression models, **Akaike’s Information Criterion** (**AIC**) is defined by

When *n* < 40(*k*+2) it is better to use the following modified version

Another such measure is the **Schwarz Baysean Criterion** (**SBC**), which puts more weight of the sample size.

**Observation**: All things being equal it is better to choose a model with lower AIC, although given two models with similar AICs there is no test to determine whether the difference in AIC values is significant.

**Example 2**: Determine whether the regression model for Example 1 with the White and Crime variables is better than the model without these variables.

**Figure 3 – Comparing the two models using AIC**

Since the AIC and SBC for the reduced model is lower than the AIC and SBC for the complete model, once again we see that the reduced model is a better choice.

**Observation**: AIC (or SBC) can be useful when deciding whether or not to use a transformation for one or more independent variables since we can’t use Property 1 or 2. AIC is calculated for each model, and all other things being equal the model with the lower AIC (or SBC) should be chosen.

**Observation**: Augmented versions of AIC and SBC, used in some texts, are as follows:

**Real Statistics Excel Functions**: The Real Statistics Resource Pack contains the following two functions where R1 is an *n* × *k* array containing the X sample data and R2 is an *n* × 1 array containing the Y sample data.

**RegAIC**(R1, R2,, *aug*) = AIC for regression model for the data in R1 and R2

**RegAICc**(R1, R2,, *aug*) = AICc for regression model for the data in R1 and R2

**RegSBC**(R1, R2,, *aug*) = SBC for regression model for the data in R1 and R2

If *aug* = FALSE (default), the first version of AIC, AICc, SBC are returned, while if *aug* =TRUE, then the augmented versions are returned.

We also have the following Real Statistics function where R1 is an *n* × *k* array containing the X sample data for the full model, R3 contains the X sample data for the reduced model and R2 is an *n* × 1 array containing the Y sample data.

**RSquareTest**(R1, R3, R2) = the p-value of the test defined by Property 2

Thus for the data in Example 1 (referring to Figure 2 of Multiple Regression Analysis), we have RegAIC(C4:E53,B4:B53) = 94.26, RegAICc(C4:E53,B4:B53) = 95.63 and RSquareTest(C4:E53,C4:C53,B4:B53) = .536.

**Observation**: This webpage focuses on whether some of the independent variables make a significant contribution to the accuracy of a regression model. The same approach can be used to determine whether interactions between variables of the square or higher orders of some variables make a significant contribution.

**Observation**: You can also ask the question, which of the independent variables has the largest effect? There are two ways of addressing this issue.

- You standardize each of the independent variables (e.g. by using the STANDARDIZE function) before conducting the regression. In this case, the variable whose regression coefficient is highest (in absolute value) has the largest effect. If you don’t standardize the variables each of the variables first, then the variable with the highest regression coefficient is not necessarily the one with the highest effect (since the units are different).
- You rerun the regression removing one independent variable from the model and record the value of R-square. If you have
*k*independent variables you will run*k*reduced regression models. The model which has the smallest value of R-square corresponds to the variable which has the largest effect. This is because the removal of that variable reduces the fit of the model the most.

Can you comment on the AIC correction that you use. In reviewing Burnham and Anderson 2002, they provide the correction as AICc= AIC + [(2K(K+1))/(n-K-1)]. I was looking for a reference to the correction you use here. Thanks.

Tim,

It all depends on what k is. I understood that in the formula that you are using, k includes all the model parameters. These are the variables, intercept and variance. In the formula I am using k = the number of independent variables, and so I need to add 2 for the intercept and variance parameters.

There is a discussion of these issues on the following website

http://stats.stackexchange.com/questions/69723/two-different-formulas-for-aicc

I believe that the formula I am using is consistent that found on the website

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.506.1715&rep=rep1&type=pdf

Charles

i need to solve the question pls. help me

y – Production of beef in kg/ha;

x1- Average quantity of potatoes (kg)

intended for the animal feed within 1 day;

x2- The farm size in hectares;

x3- The average purchase price of beef in a

given region (in zł/kg);

x4- Number of employed persons on the

farm.

y x1 x2 x3 x4

1950 20 10 5 4

2200 24 13 5,4 4

2600 25 15 5,6 5

2900 33 20 5,2 6

3000 32 20 5,3 7

3750 38 25 5,8 7

4900 49 30 6 9

5100 50 35 5,2 9

5800 60 37 5,9 10

Based on data from nine farms build a linear econometric model. Perform appropriate

statistical tests (F test and t-Student tests) , remove an independent variable and recalculate

the model if necessary

To build the multiple linear regression model see the Multiple Regression webpage. You can perform this analysis manually, using Excel or using the Real Statistics tools. To determine what happens when you remove one independent variable, see the referenced webpage.

Charles

excellent course. thanks