Collinearity

From Definition 3 and Property 1 of Method of Least Squares for Multiple Regression, recall that

Least squares model matrix

where
Coefficient matrix regression

If XTX is singular, i.e. doesn’t have an inverse (see Matrix Operations), then B won’t be defined, and as a result Y-hat will not be defined. This occurs when one column in X is a non-trivial linear combination of some other columns in X, i.e. one independent variable is a non-trivial linear combination of the other independent variables. Even when XTX is almost singular, i.e. det(XTX) is close to zero, the values of B and Y-hat will be unstable; i.e. small changes in X may result in significant changes in B and Y-hat. Such a situation is called multicollinearity, or simply collinearity, and should be avoided.

E.g., in the following table, X1 is double X2. Thus XTX is singular. Excel detects this and creates a regression model equivalent to that obtained by simply eliminating column X2.

Collinearity

Fig 1 – Collinearity

Figure 1 – Collinearity

Observation: In the case where k = 2, the coefficient estimates produced by the least square process turn out to be

image2093

If x1 and x2 are correlated, i.e.

image2094

then the denominators of the coefficients are zero and so the coefficients are undefined.

Observation: Unfortunately, you can’t always count on one column being an exact linear combination of the others. Even when one column is almost a linear combination of the other columns, an unstable situation can result. We now define some metrics that help determine whether such a situation is likely.

Definition 1: Tolerance is 1 – R_j^2, where

image2096

i.e. the multiple coefficient between xj and all the other independent variables. The variance inflation factor (VIF) is the reciprocal of the tolerance.

Observation: Tolerance ranges from 0 to 1. We want a low value of VIF and a high value of tolerance. A tolerance value of less than 0.1 is a red alert, while values below 0.2 can be cause for concern.

Real Statistics Excel Functions: The Real Statistics Resource contains the following two functions:

TOLERANCE(R1, j) = Tolerance of the jth variable for the data in range R1; i.e. 1 – R_j^2

VIF(R1, j) = VIF of the jth variable for the data in range R1

Observation: TOLERANCE(R1, j) = 1–RSquare(R1, j)

Example 1: Check the Tolerance and VIF for the data displayed in Figure 1 of Multiple Correlation (i.e. the data for the first 12 states in Example 1 of Multiple Correlation).

The top part of Figure 2 shows the data for the first 12 states in Example 1. From these data we can calculate the Tolerance and VIF for each of the 8 independent variables.

Tolerance VIF Excel

Figure 2 – Tolerance and VIF

Figure 2 – Tolerance and VIF

For example, to calculate the Tolerance for Crime we need to run the Regression data analysis tool for the data in the range C4:J15 excluding the E column as the Input X vs. the data in the E column as Input Y. (To do this we first need to copy the data so that Input X consists of contiguous cells). We then see that Multiple R Square = .518, and so Tolerance = 1 – .518 = .482 and VIF = 1/.482 = 2.07.

Alternatively we can use the supplemental functions TOLERANCE(C4:J15,3) = .482 and VIF(C4:J15,3) = 2.07 (since Crime is the 3rd variable). The results are shown in the bottom part of Figure 1. Note that we should be concerned about the Traffic Deaths and University variables since their tolerance values are about .1.

18 Responses to Collinearity

  1. Guero says:

    Charles,
    Would performing PCA on a set of variables allow us to avoid multicollinearity? I mean, the variability/variance effects could not be then linearly related, since, by orthogonality, they are pairwise – and overall – linearly independent?
    Thanks in Advance.

  2. TooLegit says:

    Charles,
    What can I conclude if my correlation matrix has a lot of values close to 1, can I conclude (almost) collinearity from a high degree of correlation?

    Also, say I have a model Y=a1x1+a2x2+…akxk

    And I compute Y|Xi , i.e., I regress Y with respect to any of the Xi’s and I get
    a coefficient ai close to 1 (and a high R^2) , can I conclude Xi is not needed in
    the model? If Y|Xi = c+Xi then Xi is just a translate of Y and so provides no new
    information?
    Thanks.

    • Charles says:

      You only need one value in the correlation matrix off the main diagonal to be close to 1 to have collinearity. It the two independent variables with a correlation of 1 are v1 and v2, then this would mean that v1 could be expressed as v1 = b * v2 + a for some constants a and b.

      Note that you could have collinearity even when none of the elements in the correlation matrix are near 1. This could happen if you have three independent variables v1, v2 and v3 where v3 = v1 + v2.

      Regarding your second question, I guess it depends on how you look at things. In some sense Xi is a great predictor of Y and so the other variables may not be needed (or at least they need to cancel each other out in some way). If indeed Xi is Y, then there is no point in including it among the independent variables.

      Charles

  3. Matt says:

    Hi Charles,
    So unless Tolerance is low and VIF is high, then we should not worry about multicollinearity? Even if two variables have very similar VIF and tolerance values but not necessarily low (tolerance) and high(VIF)?

  4. Wytek says:

    Hi Charles,

    I am wondering why VIF consistently produces exactly 1.0 as the first VIF value. I just used the RAND() function to generate a bunch of columns and then used =VIF($F$58:$H$68,1). The output was exactly 1.0. I then proceeded to calculate VIF according to VIFi = 1/(1-R2i). To get exactly 1.0 the R2 value would have to be exactly 0.

    Can you shed some light on it?

  5. Emmanuel Israel Edache says:

    Good day sir, Please i want to no how to calculate mean effect. in-short, i need the step by step of how to calculate mean effect. The equation i have is confusing me.
    Thanks. Emmanuel.

  6. Ryan says:

    In layman’s terms can you explain the danger in multicollinearity when performing multiple linear regression for predictive modeling purposes. I understand the issues it brings with interpretation of the model (strange parameter coefficients and p-values) but with respect to making predictions, what else should I know?

    • Charles says:

      Ryan,

      Essentially it means that one of the independent variables is not really necessary to the model because its effect/impact on the model is already captured by some of the other variables. This variable is not contributing anything extra to the predictions and can be removed. The danger is mathematical since it makes the model unstable in the sense that a small change in the values of this variable can have a big impact on the model.

      You can think of it almost like 1/(a-b). If a and b are almost equal then the value of 1/(a-b) is very large; if a = b then its value is undefined (or infinity).

      If you have true multicollinearity, then the “problem” variable will automatically be deleted by Excel. The real problem occurs when you don’t have exact multicollinearity (similar to the case where a = b), but close to multicollinearity (similar to the case where a is close to b). In this case, depending on the sophistication of the regression model, the “problem” variable won’t be eliminated, and the unstable situation described above can result.

      Charles

      • Jonathan says:

        Great explanation – very clear, thank you!

      • MSIS says:

        Charles,
        Would it be helpful to do Gaussian elimination on X^TX in order to determine collinearity ? Or maybe we can somehow use the size of Det(X^TX) ; if it is close to 0 we suspect collinearity? I suspect there is a connection between PCA and collinearity; after PCA, it seems any collinearity would disappear, since the axes are pairwise perpendicular and are therefore linearly independent?
        Thanks.

  7. Faseeh says:

    Sir,

    I tried entering VER() and that give me 3.6.2 in the cell. (That should be the version number). But still VIF is not been calculated.

    My Data in present in range A1:K23 with Header Row(Row 1) and Header Column(Col A), and formula i am using is =VIF(B2:K23,2)

    Can i attach a file to this message? I am into the final Thesis of MBA program and this add-in could be a real help for me. Thank you.

  8. Faseeh says:

    Sir,

    I am trying to use your add-in for Excel 2007 to calculate the VIF for my data but is giving me error:

    ” Compiler error in hidden module ”

    and results in value error. Please guide what i am doing wrong.

    Thank you.
    Faseeh

    • Charles says:

      Faseeh,
      Try entering the formula =VER() in any blank cell in a worksheet. If you get an error, then the add-in was not installed properly. If you get the release number (e.g. 3.6.2) of the Real Statistics add-in then the cause is different and we will need to diagnose the problem in a different way.
      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *