Multiple Correlation

In Correlation Basic Concepts we define the correlation coefficient, which measures the size of the linear association between two variables. We now extend this definition to the situation where there are more than two variables.

Multiple Correlation Coefficient

Definition 1: Given variables x, y, and z, we define the multiple correlation coefficient

where r_xz, r_yz, r_xy are as defined in Definition 2 of Basic Concepts of Correlation. Here x and y are viewed as the independent variables and z is the dependent variable.

Coefficient of Determination

We also define the multiple coefficient of determination to be the square of the multiple correlation coefficient.

Often the subscripts are dropped and the multiple correlation coefficient and multiple coefficient of determination are written simply as R and R² respectively. These definitions may also be expanded to more than two independent variables. With just one independent variable the multiple correlation coefficient is simply r.

Unfortunately, R is not an unbiased estimate of the population multiple correlation coefficient, which is evident for small samples. A relatively unbiased version of R is given by R adjusted.

Definition 2: If R is R_z,xy as defined above (or similarly for more variables) then the adjusted multiple coefficient of determination is

where k = the number of independent variables and n = the number of data elements in the sample for z (which should be the same as the samples for x and y).

Data Analysis Tools

Excel Data Analysis Tools: In addition to the various correlation functions described elsewhere, Excel provides the Covariance and Correlation data analysis tools. The Covariance tool calculates the pairwise population covariances for all the variables in the data set. Similarly, the Correlation tool calculates the various correlation coefficients as described in the following example.

Example 1: We expand the data in Example 2 of Correlation Testing via the t Test to include a number of other statistics. The data for the first few states are displayed in Figure 1.

Figure 1 – Data for Example 1

Using Excel’s Correlation data analysis tool we can compute the pairwise correlation coefficients for the various variables in the table in Figure 1. The results are shown in Figure 2.

Figure 2 – Correlation coefficients for data in Example 1

We can also single out the first three variables, poverty, infant mortality, and white (i.e. the percentage of the population that is white) and calculate the multiple correlation coefficients, assuming poverty is the dependent variable, as defined in Definitions 1 and 2. We use the data in Figure 2 to obtain the values $r_{PW}$ , $r_{PI}$ and $r_{WI}$ .

Partial and Semi-Partial Correlation

Definition 3: Given x, y, and z as in Definition 1, the partial correlation of x and z holding y constant is defined as follows:

In the semi-partial correlation, the correlation between x and y is eliminated, but not the correlation between x and z and y and z:

Causation

Suppose we look at the relationship between GPA (grade point average) and Salary 5 years after graduation and discover there is a high correlation between these two variables. As has been mentioned elsewhere, this is not to say that doing well in school causes a person to get a higher salary. In fact, it is entirely possible that there is a third variable, say IQ, that correlates well with both GPA and Salary (although this would not necessarily imply that IQ is the cause of the higher GPA and higher salary).

In this case, it is possible that the correlation between GPA and Salary is a consequence of the correlation between IQ and GPA and between IQ and Salary. To test this we need to determine the correlation between GPA and Salary eliminating the influence of IQ from both variables, i.e. the partial correlation $r_{(GS,I)}$ .

Property

Property 1:

Proof: The first assertion follows since

The second assertion follows since:

Example 2: Calculate $r_{PW,I}$ and $r_{P(W,I)}$ for the data in Example 1.

We can see that Property 1 holds for this data since

Partitioning Variance

Since the coefficient of determination is a measure of the portion of variance attributable to the variables involved, we can look at the meaning of the concepts defined above using the following Venn diagram, where the rectangular represents the total variance of the poverty variable.

Figure 3 – Breakdown of variance for poverty

Using the data from Example 1, we can calculate the breakdown of the variance for poverty in Figure 4:

Figure 4 – Breakdown of variance for poverty continued

Note that we can calculate B in a number of ways: (A + B – A, (B + C) – C, (A + B + C) – (A + C), etc., and get the same answer in each case. Also note that

where D = 1 – (A + B + C).

Figure 5 – Breakdown of variance for poverty continued

Property 2: From Property 1, it follows that:

If the independent variables are mutually independent, this reduces to

Worksheet Functions

Real Statistics Functions: The Real Statistics Resource Pack contains the following functions where the samples for z, x, and y are contained in the arrays or ranges R, R1, and R2 respectively.

CORREL_ADJ(R1, R2) = adjusted correlation coefficient for the data sets defined by ranges R1 and R2

MCORREL(R, R1, R2) = multiple correlation of dependent variable z with x and y

PART_CORREL(R, R1, R2) = partial correlation r_zx,y of variables z and x holding y constant

SEMIPART_CORREL(R, R1, R2) = semi-partial correlation r_z(x,y)

Multiple Correlation for more than 3 variables

Definition 1 defines the multiple correlation coefficient R_z,x_y and the corresponding multiple coefficient of determination for three variables x, y, and z. We can extend these definitions to more than three variables as described in Advanced Multiple Correlation.

E.g. if R1 is an m × n array containing the data for n variables then the Real Statistics function RSquare(R1, k) calculates the multiple coefficient of determination for the kth variable with respect to the other variables in R1. The multiple correlation coefficient for the kth variable with respect to the other variables in R1 can then be calculated by the formula =SQRT(RSquare(R1, k)).

Thus if R1, R2, and R3 are the three columns of the m × 3 data array or range R, with R1 and R2 containing the samples for the independent variables x and y and R3 containing the sample data for dependent variable z, then =MCORREL(R3, R1, R2) yields the same result as =SQRT(RSquare(R, 3)).

Similarly, the definition of the partial correlation coefficient (Definition 3) can be extended to more than three variables as described in Advanced Multiple Correlation.

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

References

Howell, D. C. (2010) Confidence intervals on effect size
https://www.uvm.edu/~statdhtx/methods8/Supplements/MISC/Confidence%20Intervals%20on%20Effect%20Size.pdf

Schmuller, J. (2009) Statistical analysis with Excel for dummies. Wiley
https://www.wiley.com/en-us/Statistical+Analysis+with+Excel+For+Dummies%2C+3rd+Edition-p-9781118464311

315 thoughts on “Multiple Correlation”

Mfon umoren

October 30, 2022 at 7:51 pm

Data of 2 related variables of 20 data points
Reply
- Charles
  
  October 30, 2022 at 9:35 pm
  
  Sorry, but I don’t understand your comment.
  Charles
  Reply
kirito

May 20, 2022 at 2:52 am

How to solve 1 dependent variable and 3 independent variable
Reply
- Charles
  
  May 20, 2022 at 8:18 am
  
  See https://www.real-statistics.com/multiple-regression/multiple-correlation-advanced/
  Charles
  Reply
Matías

April 27, 2022 at 10:19 pm

Hi Charles!

Thanks for the great article, I wanted to ask how would you measure if the multiple correlation coefficient calculated is significant or not. For example, when working with two variable, it’s enough to get the p-value so we can know if the computed correlation is statistically significant, but since in this formula we’re already using the pairwise correlations between our variables, should we first make sure that all of those are significant (i.e. p-value > 0.05)?

Thanks!
Reply
- Charles
  
  April 28, 2022 at 11:27 am
  
  Hello Matias,
  This coefficient is calculated when using ordinary linear regression. The regression data analysis tool reports whether the value is significant or not (i.e. significantly different from zero).
  Charles
  Reply
B M R

April 26, 2022 at 9:40 am

Say there’s a collection of 1000 products in a particular category. And there are six different methodologies for ranking the desirability of each product, where each methodology produces ranks 1-1000, with room for ties within. Usually there is some general agreement amongst a majority of the methodologies, while one or two of them might widely disagree. My question is: How can I determine which ranking methodology is “most in agreement” with “most of the other” methodologies, for that particular product category? Which statistical approach would be appropriate for making such a determination? Thank you.
Reply
- Charles
  
  April 26, 2022 at 4:32 pm
  
  Perhaps the approach described at the following webpage is appropriate
  https://www.real-statistics.com/reliability/interrater-reliability/kendalls-w/
  Charles
  Reply
  - B M R
    
    April 27, 2022 at 1:45 am
    
    Thank you. How would you go about isolating the winning methodology (“agrees the most with most of the others”)? Would you do multiple pair-wise runs and aggregate the W scores? Or is that too simplistic? (Or use the derived r correlation metric instead of W?)
    
    Also is there a typo on that page for when W=0? “If all the Ri are the same (i.e. the raters are in complete agreement), then as we have seen, W = 0. In fact, it is always the case that 0 ≤ W ≤ 1. If W = 0 then there is no agreement among the raters.” (Observations section)
    Reply
    - Charles
      
      April 27, 2022 at 8:16 am
      
      1. This approach doesn’t measure which methodology is best, only whether there is agreement. One approach that might be appropriate is to leave out one of the methodologies, one at a time, and compare which methodology results in the smaller W. You could also use a different approach entirely; e.g. ANOVA.
      2. Yes, you are correct. There is a typo. It should read W = 1 instead of W = 0. I have now corrected this on the webpage. Thank you very much for catching this error.
      Charles
      Reply
E

April 12, 2022 at 1:35 am

real stats seems to have stopped working, i’ve tried to uninstall, even uninstalled and reinstalled office. not sure how to fix it at this point without uninstalling windows
Reply
- Charles
  
  April 12, 2022 at 12:15 pm
  
  Hello Everett,
  I don’t have enough information to understand why it stopped working. That is unusual.
  In any case, since you have uninstalled it, I suggest that you download a new copy of the Real Statistics software from the website. Then rename the file to ZRealStats.xlam and install it as described on the website.
  Charles
  Reply
  - E
    
    April 12, 2022 at 4:42 pm
    
    ok. I had actually reinstalled everything. But what I discovered in trying to follow this instruction is that it’s only some components of the addin that are not working. In particular MCORREL is not working but some other functions as well. I can open the VBA code so it’s hard to see exactly where the problem is occurring.
    Reply
    - E
      
      April 12, 2022 at 5:00 pm
      
      yeah just was able to point at the different file and get RSquare function to work, but point at the new xlam doesn’t
      Reply
    - Charles
      
      April 13, 2022 at 8:22 pm
      
      Can you email me an Excel spreadsheet with the functions that you have found not to work, including the data and any error messages?
      Charles
      Reply
Ben Olken

December 11, 2021 at 7:45 am

Hi Mr. Charles Zaiontz,

I am working on a study correlating two independent variables and one dependent variable. I referenced your theory about the triple correlation coefficient, but I am supposed to show proof of how we got to that formula. Would you be able to offer some clarifications on how to get to that formula from individual correlation coefficients? Thank you
Reply
- Charles
  
  December 11, 2021 at 9:36 am
  
  Hi Ben,
  If you are referring to Definition 1 of multiple correlation, it is a definition, and not a theorem or property, and so there is no proof.
  There are properties about multiple correlation that will require a proof, but you need to state which property you have in mind.
  It turns out that this definition yields the same results as the square root of the coefficient of determination from regression. (This is also the motivation behind Definition 1.) This assertion can be proved. In fact, if you follow the steps in the section on multiple regression, you will be able to create the proof.
  Charles
  Reply
Max

December 8, 2021 at 4:12 pm

Hi Mr. Charles,

How would I go about using the formula for the multiple correlation coefficient when I am trying to determine the correlation between 2 dependent variables and one independent since the formula presented above states its functionality only for 2 independent and 1 dependent.
Reply
- Charles
  
  December 8, 2021 at 9:43 pm
  
  Hi Max,
  See https://www.researchgate.net/post/How_can_I_measure_the_relationship_between_one_independent_variable_and_two_or_more_dependent_variables2
  Charles
  Reply
Fien De Winter

October 27, 2021 at 5:22 pm

Dear Charles,

I have this dataset based on measurements from the same individuals in three different specimens; in serum (blood) tissue and lung lavage. I wanted to check if serum could be a good surrogate marker for the immune response going on in the lungs and therefore performed the MCORREL on serum as z and the other two as x and y (independent variables). But I’m not entirely confident that this is allowed. It seems like the measurements have to be completely independent to be able to do this.
Can you advise on this?

Kind regards,
Fien
Reply
- Charles
  
  October 27, 2021 at 6:00 pm
  
  For each subject, the three measurements (serum tissue, lung lavage, immune response) don’t have to be independent. The subjects have to be independent of each other.
  Charles
  Reply
eldin lumanog

October 24, 2021 at 5:44 am

I wanted to know if there is a relationship between the variables Product Quality, Competitive Price, Accessibility, and Advertising Capability towards Customer Satisfaction what appropriate statistical tool should be used?
Reply
- Charles
  
  October 24, 2021 at 12:54 pm
  
  It depends on what you mean by relationship, but you could look into using multiple correlation. This is equivalent to performing regression with Customer Satisfaction as the dependent variable and the others as independent variables. The R-square (or R) value gives an indication of the size of the relationship (R = the multiple correlation coefficient). The regression coefficients provide an indication of the significance of each independent variable.
  Charles
  Reply
Baba Jauro

August 10, 2021 at 8:29 am

I intend to determine the correlation of instructional resource availability, adequacy and students performance. Pls which of the correlation should I use, and how would the result look like?
Reply
- Charles
  
  August 10, 2021 at 9:49 pm
  
  I can’t answer your question without more specific information. Are you trying to find the correlation between two variables or more than two variables? If more than two variables, are you looking for (1) the correlation between multiple independent variables and one dependent variable or (2) pairwise correlations between each pair of variables?
  Charles
  Reply
rio

July 28, 2021 at 11:23 am

I am doing a study to find the relationship between students’ academic performance, aptitude, instruction, and environment. Which statistic can I use for analysis?
Reply
- Charles
  
  July 28, 2021 at 6:25 pm
  
  Hello,
  There are many possible statistics that might be appropriate for such an analysis. Which statistic to use depends on the nature of the data and what you mean by “relationship”. One approach is to use multiple correlation or equivalently regression.
  Charles
  Reply
Jessica

July 13, 2021 at 11:42 am

hi Charles,
I am doing a study to find the relationship amongst mindset, perceived stress and self-esteem. Which statistic can I use for analysis?
Reply
- Charles
  
  July 13, 2021 at 3:45 pm
  
  This depends on the details and your objective in the analysis. Possible analyses tools include: regression, ANOVA, correlation.
  Charles
  Reply
  - Jessica
    
    July 14, 2021 at 11:52 am
    
    well I am trying to find correlation amongst these three. Which correlation can I use for three variables?
    Reply
Ralph

June 20, 2021 at 10:50 pm

Hello, i hope you may enlighten me on this on.
We are aiming to study the correlation of sleepwear color to the sleeping pattern, what should we use to compute and analyze the data we will obtain?
Reply
- Charles
  
  June 21, 2021 at 10:13 pm
  
  Hello Ralph,
  If sleepwear color and sleeping pattern take numeric values, then presumably you could calculate Pearson’s correlation coefficient, but the appropriate correlation depends on the nature of your data.
  How are you measuring the sleepwear color? How are you measuring the sleeping pattern?
  Charles
  Reply
S.Jeyarajan

June 1, 2021 at 8:40 pm

Hi,

I wanted know correlation between number of cases selected for PCR test, cases tested for positive and cases who got out of COVID 19 in a day where independent variable is cases who got out of COIVID 19 in a day. Could you please give me the detail.
Reply
- Charles
  
  June 2, 2021 at 11:50 am
  
  Let x = # of cases selected for PCR test
  Let y = # of cases that test positive (presumably via the PCR test)
  Let z = # of cases who got out of covid-19
  I presume that you mean that z is the dependent variable. Thus you calculate the multiple correlation coefficient R_z.xy as described on this webpage
  Charles
  Reply
Prakhar

May 27, 2021 at 9:18 pm

Hey Charles,

I have a dataset containing household income, no. of store purchases and no. of web purchases made.
I want to find the correlation between income and choice of sales channel (store v/s web).
What is the best way to approach this?
Reply
- Charles
  
  May 28, 2021 at 8:43 am
  
  I am not clear what you are looking for. What is the dependent variable and the independent variables?
  Charles
  Reply
Aishwarya Deorane

April 14, 2021 at 5:35 am

hi,
I want to study the relationship between gender, stress levels and frequency of consumption of food groups. What is the best statistical method that can be used?
Reply
- Charles
  
  April 14, 2021 at 8:19 am
  
  It depends on the details, but regression, ANOVA, correlation are typical approaches.
  Charles
  Reply
Valentina

January 17, 2021 at 7:08 pm

Dear Charles,

trying to calcolate Pearson’s coefficient with three variables, I came across your site. I would like to ask a question: in passing from the Pearson’s two-variables formula to the three-variables formula which mathematical steps have been made?
If the answer is too long, could you send me and email?

Thanks for your work and for your time.
Valentina.
Reply
- Charles
  
  January 17, 2021 at 8:00 pm
  
  Valentina,
  All steps are described on this webpage. Do you have questions about the formula?
  Charles
  Reply
  - Valentina
    
    January 17, 2021 at 11:26 pm
    
    Really sorry Charles but I don’t understand the exact point on this webpage where from the two-variable formula you get the three-variable formula (the formula that appears first on this page I mean). Can you tell me?
    Reply
    - Charles
      
      January 18, 2021 at 10:02 pm
      
      Valentina,
      I don’t explain further where the formula comes from. I use it as the definition of the multiple correlation coefficient.
      It is the same correlation that is reported from the multiple linear regression analysis.
      Charles
      Reply
Chris Moirano

November 24, 2020 at 9:10 am

Hi Charles,

After calculating the pearson r correlation between three variables, how can I calculate a p-value? I am trying to test for high correlation between the variables (r > 0.75 and p-value < 0.05). So H0: r 0.75.

Most calculations of p-values (using t-test or Fischer z-transformation) seem intended for 2 variables and are based on H0: r = 0. Does the method change for 3 variables or for H0: r <= 0.75?

Thanks for your great site! It has been very helpful.
Reply
- Charles
  
  November 26, 2020 at 9:58 pm
  
  Chris,
  First, note that the correlation between y and x1, x2 is equal to the correlation between y and y-pred. Here, y-pred is the forecasted value of y based on the regression of y on x1 and x2. This means that you can use the techniques for two variables (y and y-pred) to analyze three or more variables.
  Charles
  Reply
Susana

July 23, 2020 at 7:04 pm

Hola Carlos, estoy tratando de demostrar que el coeficiente de correlación parcial entre X e Y, fijada Z, rXY.Z, es igual al coeficiente de correlación lineal entre los residuos ei y vi siendo ei los residuos de la regresión lineal simple entre Y y Z y vi los residuos de la regresión lineal simple entre X y Z. Lo pude comprobar numéricamente pero me gustaría hacer la demostración teórica. Si me puede indicar alguna “pista” para poder hacer la demostración, estaré muy agradecida.
Su aporte a la estadística es sublime.
Muchas gracias
Reply
- Charles
  
  July 23, 2020 at 7:36 pm
  
  Hello Susana,
  Were you able to demonstrate that this was true for a simple example with say 10 values for (x, y, z)?
  Charles
  Reply
  - Susana Pasciullo
    
    July 24, 2020 at 2:29 pm
    
    Yes, I demostrated that this is true for a few values X, Y, Z
    Reply
    - Charles
      
      August 6, 2020 at 8:43 pm
      
      Hello Susana,
      Since you were able to demonstrate this for some values of X, Y and Z, the assertion is likely to be true, but I haven’t had the time to look into this further.
      Charles
      Reply
INTISAR ALBANDAR

June 20, 2020 at 5:40 am

Thank you so much, I just wanted to ask about r2 if I got the R2= o.42, is this accepted?
Reply
- Charles
  
  June 20, 2020 at 10:13 am
  
  I am not sure what you are referring to, but R-square = .331 based on the calculations right after Figure 2.
  Charles
  Reply
Zara

June 15, 2020 at 6:07 pm

Hello Charles,

Thank you for the sharing the useful information on this website.
I have a question regarding how to calculate the fourth order correlation coefficients among four variables. If we have the variable x,y,z,and w.
Also for the case we have vectors or matrices.
I would appreciate if you can help me.
Reply
- Charles
  
  June 16, 2020 at 11:19 am
  
  Hello Zara,
  See Advanced Multiple Correlation
  Charles
  Reply
  - Zara
    
    June 17, 2020 at 4:05 pm
    
    Thank you so much.
    Reply
HERLAN

April 16, 2020 at 6:57 am

Is it logic, if the results of manually counting the multiple correlation coefficients, are slightly different from the results of SPSS calculations?
Reply
- Charles
  
  April 18, 2020 at 11:02 am
  
  Hello Herlan,
  I can’t think of any reason why the results for multiple correlation would be different between Real Statistics and SPSS.
  Can you send me an example where there is a difference?
  Charles
  Reply
Kavya

April 13, 2020 at 5:34 pm

Hi,
For my research work , i am comparing one variable with 6 or 7 variables as you shown in Fig.2 how i could compile my result into one. How i could write my result. My results are in 0.8 or 0.9
Reply
- Charles
  
  April 13, 2020 at 6:00 pm
  
  Sorry, but I don’t understand your question.
  Charles
  Reply
Emil Kristensen

April 4, 2020 at 6:45 pm

This question isn’t specifically related to this. Here’s my question:

How can I find out the independent correlation of a confounding variable with Y:

I know that the correlation between 1 variable (X1) with Y is = 0.11, and that there’s a confounding variable (let’s call this variable X2), which has a correlation of 0.46 with X1. The only reason why X1 correlates with Y is because of the confounding influence with X2. How can I figure out what the independent correlation X2 (the confounding variable) has with Y? What is the correlation between X2 and Y then? There must be a way to know how to calculate this).
Reply
- Charles
  
  April 7, 2020 at 10:50 am
  
  Hello Emil,
  
  I don’t believe this is possible. E.g. here is a counter-example where knowing the correl(X1,Y) and correl(X1,X2) values doesn’t mean that you know correl(X2,Y).
  
  Data sets A
  X1 = 1, 3, 3.55542145030932
  X2 = 3, 2, 3.50170050934321
  Y = 1, 3, 4
  
  Data sets B
  X1 = 0.699565395107396, 2.7, 0
  X2 = 3, 1.5, 4.5
  Y = 2.5, 5.5, 3
  
  You will see that although correl(X1,Y) is the same for both sets and correl(X1,X2) is the same for both sets, correl(X2,Y) is not the same for both sets.
  
  Charles
  Reply
Martin

December 4, 2019 at 3:01 pm

Hello Charles,

I read both this article and the Advanced Multiple Correlation article, however I still cannot fully understand how to adapt the Definition 1(this article) formula in order to calculate the multiple correlation coefficient for 3 or more independent variables and 1 dependent variable.

Can you provide me with some explanation on how the formula would look for 3 or more independent variables?

Kind regards,

Martin
Reply
- Charles
  
  December 4, 2019 at 7:04 pm
  
  Martin,
  I don’t know how you would expand the formula. The calculation with multiple independent variables generally uses matrices and so you could breakdown the matrix calculations to create a formula. I haven’t needed to do this since the approach described in the Advanced Multiple Correlation article works fine.
  Charles
  Reply