# Collinearity

From Definition 3 and Property 1 of Method of Least Squares for Multiple Regression, recall that

where

If XTX is singular, i.e. doesn’t have an inverse (see Matrix Operations), then B won’t be defined, and as a result Y-hat will not be defined. This occurs when one column in X is a non-trivial linear combination of some other columns in X, i.e. one independent variable is a non-trivial linear combination of the other independent variables. Even when XTX is almost singular, i.e. det(XTX) is close to zero, the values of B and Y-hat will be unstable; i.e. small changes in X may result in significant changes in B and Y-hat. Such a situation is called multicollinearity, or simply collinearity, and should be avoided.

E.g., in the following table, X1 is double X2. Thus XTX is singular. Excel detects this and creates a regression model equivalent to that obtained by simply eliminating column X2.

Fig 1 – Collinearity

Figure 1 – Collinearity

Observation: In the case where k = 2, the coefficient estimates produced by the least square process turn out to be

If x1 and x2 are correlated, i.e.

then the denominators of the coefficients are zero and so the coefficients are undefined.

Observation: Unfortunately, you can’t always count on one column being an exact linear combination of the others. Even when one column is almost a linear combination of the other columns, an unstable situation can result. We now define some metrics that help determine whether such a situation is likely.

Definition 1: Tolerance is 1 – $R_j^2$, where

i.e. the multiple coefficient between xj and all the other independent variables. The variance inflation factor (VIF) is the reciprocal of the tolerance.

Observation: Tolerance ranges from 0 to 1. We want a low value of VIF and a high value of tolerance. A tolerance value of less than 0.1 is a red alert, while values below 0.2 can be cause for concern.

Real Statistics Excel Functions: The Real Statistics Resource contains the following two functions:

TOLERANCE(R1, j) = Tolerance of the jth variable for the data in range R1; i.e. 1 – $R_j^2$

VIF(R1, j) = VIF of the jth variable for the data in range R1

Observation: TOLERANCE(R1, j) = 1–RSquare(R1, j)

Example 1: Check the Tolerance and VIF for the data displayed in Figure 1 of Multiple Correlation (i.e. the data for the first 12 states in Example 1 of Multiple Correlation).

The top part of Figure 2 shows the data for the first 12 states in Example 1. From these data we can calculate the Tolerance and VIF for each of the 8 independent variables.

Figure 2 – Tolerance and VIF

Figure 2 – Tolerance and VIF

For example, to calculate the Tolerance for Crime we need to run the Regression data analysis tool for the data in the range C4:J15 excluding the E column as the Input X vs. the data in the E column as Input Y. (To do this we first need to copy the data so that Input X consists of contiguous cells). We then see that Multiple R Square = .518, and so Tolerance = 1 – .518 = .482 and VIF = 1/.482 = 2.07.

Alternatively we can use the supplemental functions TOLERANCE(C4:J15,3) = .482 and VIF(C4:J15,3) = 2.07 (since Crime is the 3rd variable). The results are shown in the bottom part of Figure 1. Note that we should be concerned about the Traffic Deaths and University variables since their tolerance values are about .1.

### 5 Responses to Collinearity

1. Ryan says:

In layman’s terms can you explain the danger in multicollinearity when performing multiple linear regression for predictive modeling purposes. I understand the issues it brings with interpretation of the model (strange parameter coefficients and p-values) but with respect to making predictions, what else should I know?

• Charles says:

Ryan,

Essentially it means that one of the independent variables is not really necessary to the model because its effect/impact on the model is already captured by some of the other variables. This variable is not contributing anything extra to the predictions and can be removed. The danger is mathematical since it makes the model unstable in the sense that a small change in the values of this variable can have a big impact on the model.

You can think of it almost like 1/(a-b). If a and b are almost equal then the value of 1/(a-b) is very large; if a = b then its value is undefined (or infinity).

If you have true multicollinearity, then the “problem” variable will automatically be deleted by Excel. The real problem occurs when you don’t have exact multicollinearity (similar to the case where a = b), but close to multicollinearity (similar to the case where a is close to b). In this case, depending on the sophistication of the regression model, the “problem” variable won’t be eliminated, and the unstable situation described above can result.

Charles

2. Faseeh says:

Sir,

I tried entering VER() and that give me 3.6.2 in the cell. (That should be the version number). But still VIF is not been calculated.

My Data in present in range A1:K23 with Header Row(Row 1) and Header Column(Col A), and formula i am using is =VIF(B2:K23,2)

Can i attach a file to this message? I am into the final Thesis of MBA program and this add-in could be a real help for me. Thank you.

3. Faseeh says:

Sir,

I am trying to use your add-in for Excel 2007 to calculate the VIF for my data but is giving me error:

” Compiler error in hidden module ”

and results in value error. Please guide what i am doing wrong.

Thank you.
Faseeh

• Charles says:

Faseeh,
Try entering the formula =VER() in any blank cell in a worksheet. If you get an error, then the add-in was not installed properly. If you get the release number (e.g. 3.6.2) of the Real Statistics add-in then the cause is different and we will need to diagnose the problem in a different way.
Charles