We can also calculate the correlation between more than two variables.
Definition 1: Given variables x, y and z, we define the multiple correlation coefficient
where rxz, ryz, rxy are as defined in Definition 2 of Basic Concepts of Correlation. Here x and y are viewed as the independent variables and z is the dependent variable.
We also define the multiple coefficient of determination to be the square of the multiple correlation coefficient.
Often the subscripts are dropped and the multiple correlation coefficient and multiple coefficient of determination are written simply as R and R2 respectively. These definitions may also be expanded to more than two independent variables. With just one independent variable the multiple correlation coefficient is simply r.
Unfortunately R is not an unbiased estimate of the population multiple correlation coefficient, which is evident for small samples. A relatively unbiased version of R is given by R adjusted.
Definition 2: If R is Rz,xy as defined above (or similarly for more variables) then the adjusted multiple coefficient of determination is
where k = the number of independent variables and n = the number of data elements in the sample for z (which should be the same as the samples for x and y).
Excel Data Analysis Tools: In addition to the various correlation functions described elsewhere, Excel provides the Covariance and Correlation data analysis tools. The Covariance tool calculates the pairwise population covariances for all the variables in the data set. Similarly the Correlation tool calculates the various correlation coefficients as described in the following example.
Example 1: We expand the data in Example 2 of Correlation Testing via the t Test to include a number of other statistics. The data for the first few states are as described in the Figure 1:
Figure 1 – Data for Example 1
Using Excel’s Correlation data analysis tool we can compute the pairwise correlation coefficients for the various variables in the table in Figure 1. The results are shown in Figure 2.
Figure 2 – Correlation coefficients for data in Example 1
We can also single out the first three variables, poverty, infant mortality and white (i.e. the percentage of the population that is white) and calculate the multiple correlation coefficients, assuming poverty is the dependent variable, as defined in Definition 1 and 2. We use the data in Figure 2 to obtain the values , and .
Definition 3: Given x, y and z as in Definition 1, the partial correlation of x and z holding y constant is defined as follows:
In the semi-partial correlation, the correlation between x and y is eliminated, but not the correlation between x and z and y and z:
Observation: Suppose we look at the relationship between GPA (grade point average) and Salary 5 years after graduation and discover there is a high correlation between these two variables. As has been mentioned elsewhere, this is not to say that doing well in school causes a person to get a higher salary. In fact it is entirely possible that there is a third variable, say IQ, that correlates well with both GPA and Salary (although this would not necessarily imply that IQ is the cause of the higher GPA and higher salary).
In this case, it is possible that the correlation between GPA and Salary is a consequence of the correlation between IQ and GPA and between IQ and Salary. To test this we need to determine the correlation between GPA and Salary eliminating the influence of IQ from both variables, i.e. the partial correlation .
Proof: The first assertion follows since
The second assertion follows since:
Example 2: Calculate and for the data in Example 1.
We can see that Property 1 holds for this data since
Observation: Since the coefficients of determination is a measure of the portion of variance attributable to the variables involved, we can look at the meaning of the concepts defined above using the following Venn diagram, where the rectangular represents the total variance of the poverty variable.
Figure 3 – Breakdown of variance of poverty
Using the data from Example 1, we can calculate the breakdown of the variance for poverty in Figure 4:
Figure 4 – Breakdown of variance of poverty continued
Note that we can calculate B in a number of ways: (A + B – A, (B + C) – C, (A + B + C) – (A + C), etc. and get the same answer in each case. Also note that
where D = 1 – (A + B + C).
Figure 5 – Breakdown of variance of poverty continued
Property 2: From Property 1, it follows that:
If the independent variables are mutually independent, this reduces to
Real Statistics Functions: The Real Statistics Resource Pack contains the following supplemental functions:
CORREL_ADJ(R1, R2) = adjusted correlation coefficient for the data sets defined by ranges R1 and R2
MCORREL(R, R1, R2) = multiple correlation of dependent variable z with x and y where the samples for z, x and y are the ranges R, R1 and R2 respectively
Observation: Definition 1 defines the multiple correlation coefficient Rz,xy and corresponding multiple coefficient of determination for three variables x, y and z. These definitions can be extended to more than three variables as described in Advanced Multiple Correlation.
E.g. if R1 is an m × n data range containing the data for n variables then the supplemental function RSquare(R1, k) calculates the multiple coefficient of determination for the kth variable with respect to the other variables in R1. The multiple correlation coefficient for the kth variable with respect to the other variables in R1 can be calculated by the formula =SQRT(RSquare(R1, k)).
Thus if R1, R2 and R3 are the three columns of the m × 3 data range R, with R1 and R2 containing the samples for the independent variables x and y and R3 containing the sample data for dependent variable z, then =MCORREL(R3, R1, R2) yields the same result as =SQRT(RSquare(R, 3)).
Observation: Similarly the definition of the partial correlation coefficient (Definition 3) can be extended to more than three variables as described in Advanced Multiple Correlation.