Multiple Correlation

In Correlation Basic Concepts we define the correlation coefficient, which measures the size of the linear association between two variables. We now extend this definition to the situation where there are more than two variables.

Multiple Correlation Coefficient

Definition 1: Given variables x, y, and z, we define the multiple correlation coefficient

Multiple correlation coefficient

where rxz, ryz, rxy are as defined in Definition 2 of Basic Concepts of Correlation. Here x and y are viewed as the independent variables and z is the dependent variable.

Coefficient of Determination

We also define the multiple coefficient of determination to be the square of the multiple correlation coefficient.

Often the subscripts are dropped and the multiple correlation coefficient and multiple coefficient of determination are written simply as R and R2 respectively. These definitions may also be expanded to more than two independent variables. With just one independent variable the multiple correlation coefficient is simply r.

Unfortunately, R is not an unbiased estimate of the population multiple correlation coefficient, which is evident for small samples. A relatively unbiased version of R is given by R adjusted.

Definition 2: If R is Rz,xy as defined above (or similarly for more variables) then the adjusted multiple coefficient of determination is

where k = the number of independent variables and n = the number of data elements in the sample for z (which should be the same as the samples for x and y).

Data Analysis Tools

Excel Data Analysis Tools: In addition to the various correlation functions described elsewhere, Excel provides the Covariance and Correlation data analysis tools. The Covariance tool calculates the pairwise population covariances for all the variables in the data set. Similarly, the Correlation tool calculates the various correlation coefficients as described in the following example.

Example 1: We expand the data in Example 2 of Correlation Testing via the t Test to include a number of other statistics. The data for the first few states are displayed in Figure 1.

US states statistics

Figure 1 – Data for Example 1

Using Excel’s Correlation data analysis tool we can compute the pairwise correlation coefficients for the various variables in the table in Figure 1. The results are shown in Figure 2.

Correlation coefficients array

Figure 2 – Correlation coefficients for data in Example 1

We can also single out the first three variables, poverty, infant mortality, and white (i.e. the percentage of the population that is white) and calculate the multiple correlation coefficients, assuming poverty is the dependent variable, as defined in Definitions 1 and 2. We use the data in Figure 2 to obtain the values r_{PW}, r_{PI} and r_{WI}.

image1607 image1608

Adjusted R-square

Partial and Semi-Partial Correlation

Definition 3: Given x, y, and z as in Definition 1, the partial correlation of x and z holding y constant is defined as follows:

Partial correlation

In the semi-partial correlation, the correlation between x and y is eliminated, but not the correlation between x and z and y and z:

Semi-partial correlation coefficient

Causation

Suppose we look at the relationship between GPA (grade point average) and Salary 5 years after graduation and discover there is a high correlation between these two variables. As has been mentioned elsewhere, this is not to say that doing well in school causes a person to get a higher salary. In fact, it is entirely possible that there is a third variable, say IQ, that correlates well with both GPA and Salary (although this would not necessarily imply that IQ is the cause of the higher GPA and higher salary).

In this case, it is possible that the correlation between GPA and Salary is a consequence of the correlation between IQ and GPA and between IQ and Salary. To test this we need to determine the correlation between GPA and Salary eliminating the influence of IQ from both variables, i.e. the partial correlation r_{(GS,I)}.

Property

Property 1:
image1613 image1614

Proof: The first assertion follows since

The second assertion follows since:

Example 2: Calculate r_{PW,I} and r_{P(W,I)} for the data in Example 1.

image1620

image1621

We can see that Property 1 holds for this data since

image1622 image5041

Partitioning Variance

Since the coefficient of determination is a measure of the portion of variance attributable to the variables involved, we can look at the meaning of the concepts defined above using the following Venn diagram, where the rectangular represents the total variance of the poverty variable.

Partitioning variance

Figure 3 – Breakdown of variance for poverty

Using the data from Example 1, we can calculate the breakdown of the variance for poverty in Figure 4:

Variance breakdown

Figure 4 – Breakdown of variance for poverty continued

Note that we can calculate B in a number of ways: (A + B –  A, (B + C) – C, (A + B + C) – (A + C), etc., and get the same answer in each case. Also note that

image5043 image5042

where D = 1 – (A + B + C).

Variance breakdown correlation

Figure 5 – Breakdown of variance for poverty continued

Property 2: From Property 1, it follows that:

image7248

If the independent variables are mutually independent, this reduces to

image5050

Worksheet Functions

Real Statistics Functions: The Real Statistics Resource Pack contains the following functions where the samples for z, x, and y are contained in the arrays or ranges R, R1, and R2 respectively.

CORREL_ADJ(R1, R2) = adjusted correlation coefficient for the data sets defined by ranges R1 and R2

MCORREL(R, R1, R2) = multiple correlation of dependent variable z with x and y

PART_CORREL(R, R1, R2) = partial correlation rzx,y of variables z and x holding y constant

SEMIPART_CORREL(R, R1, R2) = semi-partial correlation rz(x,y)

Multiple Correlation for more than 3 variables

Definition 1 defines the multiple correlation coefficient Rz,xy and the corresponding multiple coefficient of determination for three variables x, y, and z. We can extend these definitions to more than three variables as described in Advanced Multiple Correlation.

E.g. if R1 is an m × n array containing the data for n variables then the Real Statistics function RSquare(R1, k) calculates the multiple coefficient of determination for the kth variable with respect to the other variables in R1. The multiple correlation coefficient for the kth variable with respect to the other variables in R1 can then be calculated by the formula =SQRT(RSquare(R1, k)).

Thus if R1, R2, and R3 are the three columns of the m × 3 data array or range R, with R1 and R2 containing the samples for the independent variables x and y and R3 containing the sample data for dependent variable z, then =MCORREL(R3, R1, R2) yields the same result as =SQRT(RSquare(R, 3)).

Similarly, the definition of the partial correlation coefficient (Definition 3) can be extended to more than three variables as described in Advanced Multiple Correlation.

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

Howell, D. C. (2010) Confidence intervals on effect size
https://www.uvm.edu/~statdhtx/methods8/Supplements/MISC/Confidence%20Intervals%20on%20Effect%20Size.pdf

Schmuller, J. (2009) Statistical analysis with Excel for dummies. Wiley
https://www.wiley.com/en-us/Statistical+Analysis+with+Excel+For+Dummies%2C+3rd+Edition-p-9781118464311

315 thoughts on “Multiple Correlation”

  1. Hi Charles!

    Thanks for the great article, I wanted to ask how would you measure if the multiple correlation coefficient calculated is significant or not. For example, when working with two variable, it’s enough to get the p-value so we can know if the computed correlation is statistically significant, but since in this formula we’re already using the pairwise correlations between our variables, should we first make sure that all of those are significant (i.e. p-value > 0.05)?

    Thanks!

    Reply
    • Hello Matias,
      This coefficient is calculated when using ordinary linear regression. The regression data analysis tool reports whether the value is significant or not (i.e. significantly different from zero).
      Charles

      Reply
  2. Say there’s a collection of 1000 products in a particular category. And there are six different methodologies for ranking the desirability of each product, where each methodology produces ranks 1-1000, with room for ties within. Usually there is some general agreement amongst a majority of the methodologies, while one or two of them might widely disagree. My question is: How can I determine which ranking methodology is “most in agreement” with “most of the other” methodologies, for that particular product category? Which statistical approach would be appropriate for making such a determination? Thank you.

    Reply
      • Thank you. How would you go about isolating the winning methodology (“agrees the most with most of the others”)? Would you do multiple pair-wise runs and aggregate the W scores? Or is that too simplistic? (Or use the derived r correlation metric instead of W?)

        Also is there a typo on that page for when W=0? “If all the Ri are the same (i.e. the raters are in complete agreement), then as we have seen, W = 0. In fact, it is always the case that 0 ≤ W ≤ 1. If W = 0 then there is no agreement among the raters.” (Observations section)

        Reply
        • 1. This approach doesn’t measure which methodology is best, only whether there is agreement. One approach that might be appropriate is to leave out one of the methodologies, one at a time, and compare which methodology results in the smaller W. You could also use a different approach entirely; e.g. ANOVA.
          2. Yes, you are correct. There is a typo. It should read W = 1 instead of W = 0. I have now corrected this on the webpage. Thank you very much for catching this error.
          Charles

          Reply
  3. real stats seems to have stopped working, i’ve tried to uninstall, even uninstalled and reinstalled office. not sure how to fix it at this point without uninstalling windows

    Reply
    • Hello Everett,
      I don’t have enough information to understand why it stopped working. That is unusual.
      In any case, since you have uninstalled it, I suggest that you download a new copy of the Real Statistics software from the website. Then rename the file to ZRealStats.xlam and install it as described on the website.
      Charles

      Reply
      • ok. I had actually reinstalled everything. But what I discovered in trying to follow this instruction is that it’s only some components of the addin that are not working. In particular MCORREL is not working but some other functions as well. I can open the VBA code so it’s hard to see exactly where the problem is occurring.

        Reply
  4. Hi Mr. Charles Zaiontz,

    I am working on a study correlating two independent variables and one dependent variable. I referenced your theory about the triple correlation coefficient, but I am supposed to show proof of how we got to that formula. Would you be able to offer some clarifications on how to get to that formula from individual correlation coefficients? Thank you

    Reply
    • Hi Ben,
      If you are referring to Definition 1 of multiple correlation, it is a definition, and not a theorem or property, and so there is no proof.
      There are properties about multiple correlation that will require a proof, but you need to state which property you have in mind.
      It turns out that this definition yields the same results as the square root of the coefficient of determination from regression. (This is also the motivation behind Definition 1.) This assertion can be proved. In fact, if you follow the steps in the section on multiple regression, you will be able to create the proof.
      Charles

      Reply
  5. Hi Mr. Charles,

    How would I go about using the formula for the multiple correlation coefficient when I am trying to determine the correlation between 2 dependent variables and one independent since the formula presented above states its functionality only for 2 independent and 1 dependent.

    Reply
  6. Dear Charles,

    I have this dataset based on measurements from the same individuals in three different specimens; in serum (blood) tissue and lung lavage. I wanted to check if serum could be a good surrogate marker for the immune response going on in the lungs and therefore performed the MCORREL on serum as z and the other two as x and y (independent variables). But I’m not entirely confident that this is allowed. It seems like the measurements have to be completely independent to be able to do this.
    Can you advise on this?

    Kind regards,
    Fien

    Reply
    • For each subject, the three measurements (serum tissue, lung lavage, immune response) don’t have to be independent. The subjects have to be independent of each other.
      Charles

      Reply
  7. I wanted to know if there is a relationship between the variables Product Quality, Competitive Price, Accessibility, and Advertising Capability towards Customer Satisfaction what appropriate statistical tool should be used?

    Reply
    • It depends on what you mean by relationship, but you could look into using multiple correlation. This is equivalent to performing regression with Customer Satisfaction as the dependent variable and the others as independent variables. The R-square (or R) value gives an indication of the size of the relationship (R = the multiple correlation coefficient). The regression coefficients provide an indication of the significance of each independent variable.
      Charles

      Reply
  8. I intend to determine the correlation of instructional resource availability, adequacy and students performance. Pls which of the correlation should I use, and how would the result look like?

    Reply
    • I can’t answer your question without more specific information. Are you trying to find the correlation between two variables or more than two variables? If more than two variables, are you looking for (1) the correlation between multiple independent variables and one dependent variable or (2) pairwise correlations between each pair of variables?
      Charles

      Reply
  9. I am doing a study to find the relationship between students’ academic performance, aptitude, instruction, and environment. Which statistic can I use for analysis?

    Reply
    • Hello,
      There are many possible statistics that might be appropriate for such an analysis. Which statistic to use depends on the nature of the data and what you mean by “relationship”. One approach is to use multiple correlation or equivalently regression.
      Charles

      Reply
  10. hi Charles,
    I am doing a study to find the relationship amongst mindset, perceived stress and self-esteem. Which statistic can I use for analysis?

    Reply
  11. Hello, i hope you may enlighten me on this on.
    We are aiming to study the correlation of sleepwear color to the sleeping pattern, what should we use to compute and analyze the data we will obtain?

    Reply
    • Hello Ralph,
      If sleepwear color and sleeping pattern take numeric values, then presumably you could calculate Pearson’s correlation coefficient, but the appropriate correlation depends on the nature of your data.
      How are you measuring the sleepwear color? How are you measuring the sleeping pattern?
      Charles

      Reply
  12. Hi,

    I wanted know correlation between number of cases selected for PCR test, cases tested for positive and cases who got out of COVID 19 in a day where independent variable is cases who got out of COIVID 19 in a day. Could you please give me the detail.

    Reply
    • Let x = # of cases selected for PCR test
      Let y = # of cases that test positive (presumably via the PCR test)
      Let z = # of cases who got out of covid-19
      I presume that you mean that z is the dependent variable. Thus you calculate the multiple correlation coefficient R_z.xy as described on this webpage
      Charles

      Reply
  13. Hey Charles,

    I have a dataset containing household income, no. of store purchases and no. of web purchases made.
    I want to find the correlation between income and choice of sales channel (store v/s web).
    What is the best way to approach this?

    Reply
  14. hi,
    I want to study the relationship between gender, stress levels and frequency of consumption of food groups. What is the best statistical method that can be used?

    Reply
  15. Dear Charles,

    trying to calcolate Pearson’s coefficient with three variables, I came across your site. I would like to ask a question: in passing from the Pearson’s two-variables formula to the three-variables formula which mathematical steps have been made?
    If the answer is too long, could you send me and email?

    Thanks for your work and for your time.
    Valentina.

    Reply
      • Really sorry Charles but I don’t understand the exact point on this webpage where from the two-variable formula you get the three-variable formula (the formula that appears first on this page I mean). Can you tell me?

        Reply
        • Valentina,
          I don’t explain further where the formula comes from. I use it as the definition of the multiple correlation coefficient.
          It is the same correlation that is reported from the multiple linear regression analysis.
          Charles

          Reply
  16. Hi Charles,

    After calculating the pearson r correlation between three variables, how can I calculate a p-value? I am trying to test for high correlation between the variables (r > 0.75 and p-value < 0.05). So H0: r 0.75.

    Most calculations of p-values (using t-test or Fischer z-transformation) seem intended for 2 variables and are based on H0: r = 0. Does the method change for 3 variables or for H0: r <= 0.75?

    Thanks for your great site! It has been very helpful.

    Reply
    • Chris,
      First, note that the correlation between y and x1, x2 is equal to the correlation between y and y-pred. Here, y-pred is the forecasted value of y based on the regression of y on x1 and x2. This means that you can use the techniques for two variables (y and y-pred) to analyze three or more variables.
      Charles

      Reply
  17. Hola Carlos, estoy tratando de demostrar que el coeficiente de correlación parcial entre X e Y, fijada Z, rXY.Z, es igual al coeficiente de correlación lineal entre los residuos ei y vi siendo ei los residuos de la regresión lineal simple entre Y y Z y vi los residuos de la regresión lineal simple entre X y Z. Lo pude comprobar numéricamente pero me gustaría hacer la demostración teórica. Si me puede indicar alguna “pista” para poder hacer la demostración, estaré muy agradecida.
    Su aporte a la estadística es sublime.
    Muchas gracias

    Reply
  18. Hello Charles,

    Thank you for the sharing the useful information on this website.
    I have a question regarding how to calculate the fourth order correlation coefficients among four variables. If we have the variable x,y,z,and w.
    Also for the case we have vectors or matrices.
    I would appreciate if you can help me.

    Reply
  19. Is it logic, if the results of manually counting the multiple correlation coefficients, are slightly different from the results of SPSS calculations?

    Reply
    • Hello Herlan,
      I can’t think of any reason why the results for multiple correlation would be different between Real Statistics and SPSS.
      Can you send me an example where there is a difference?
      Charles

      Reply
  20. Hi,
    For my research work , i am comparing one variable with 6 or 7 variables as you shown in Fig.2 how i could compile my result into one. How i could write my result. My results are in 0.8 or 0.9

    Reply
  21. This question isn’t specifically related to this. Here’s my question:

    How can I find out the independent correlation of a confounding variable with Y:

    I know that the correlation between 1 variable (X1) with Y is = 0.11, and that there’s a confounding variable (let’s call this variable X2), which has a correlation of 0.46 with X1. The only reason why X1 correlates with Y is because of the confounding influence with X2. How can I figure out what the independent correlation X2 (the confounding variable) has with Y? What is the correlation between X2 and Y then? There must be a way to know how to calculate this).

    Reply
    • Hello Emil,

      I don’t believe this is possible. E.g. here is a counter-example where knowing the correl(X1,Y) and correl(X1,X2) values doesn’t mean that you know correl(X2,Y).

      Data sets A
      X1 = 1, 3, 3.55542145030932
      X2 = 3, 2, 3.50170050934321
      Y = 1, 3, 4

      Data sets B
      X1 = 0.699565395107396, 2.7, 0
      X2 = 3, 1.5, 4.5
      Y = 2.5, 5.5, 3

      You will see that although correl(X1,Y) is the same for both sets and correl(X1,X2) is the same for both sets, correl(X2,Y) is not the same for both sets.

      Charles

      Reply
  22. Hello Charles,

    I read both this article and the Advanced Multiple Correlation article, however I still cannot fully understand how to adapt the Definition 1(this article) formula in order to calculate the multiple correlation coefficient for 3 or more independent variables and 1 dependent variable.

    Can you provide me with some explanation on how the formula would look for 3 or more independent variables?

    Kind regards,

    Martin

    Reply
    • Martin,
      I don’t know how you would expand the formula. The calculation with multiple independent variables generally uses matrices and so you could breakdown the matrix calculations to create a formula. I haven’t needed to do this since the approach described in the Advanced Multiple Correlation article works fine.
      Charles

      Reply

Leave a Reply to John Brown Cancel reply