In Multiple Correlation we explore correlations with three random variables. We now extend some of these results to more than three variables.

**Correlation Coefficient and Coefficient of Determination**

Here we summarize some of the results from Multiple Regression Analysis about the correlation coefficient and coefficient of determination for any number of variables. In particular, the coefficient of determination can be calculated by

The Real Statistics Resource Pack provides the array function **RSquare**(R1, R2) which computes this value where R1 is an *n* × *k* array containing the *X* sample data and R2 is an *n* × 1 array containing the *Y* sample data.

Thus the correlation coefficient can be calculated by the formula =SQRT(RSquare(R1, R2)).

Alternatively, the correlation coefficient and coefficient of determination can be calculated using either Excel’s **Regression** data analysis tool or the Real Statistics **Linear Regression** data analysis tool.

**Observation**: As mentioned in Multiple Regression Analysis, there is also a second form of the RSquare function in which **RSquare**(R1, *j*) = *R*^{2} where the *X* data consist of all the columns in R1 except the *j*th column and the *Y* data consist of the *j*th column of R1.

Thus if R1 is an *n* × *k* array consisting of all the *X* data followed by a column of *Y* data, then *R*^{2} = RSquare(R1, *n*).

In particular, if R1, R2 and R3 are the three columns of the *n* × 3 data range R, with R1 and R2 containing the samples for the independent variables *x*_{1} and *x*_{2} and R3 containing the sample data for dependent variable y, then =MCORREL(R3, R1, R2) yields the same result as =SQRT(RSquare(R, 3)).

**Observation**: It turns out that *R*^{2} = RSquare(R1, *j*) can also be calculated by first finding the correlation matrix for R1, then taking its inverse and then selecting the *j*th element on the diagonal. If this has the value *c* then the desired value of *R*^{2} is .

Thus *R*^{2} can also be calculated by the formula:

=1–1/INDEX(DIAG(MINVERSE(CORR(R1))), *j*)

**Partial Correlation Coefficient**

**Definition 1**: Suppose that we have random variables *x*_{1}*, …, x _{k}* and for each

*x*we have a sample of size

_{j}*n*. Now suppose that Z consists of all the random variables

*x*

_{1}

*,…, x*excluding

_{k}*x*and

_{i}*x*where

_{j}*i ≠ j*. Then the

**partial correlation coefficient**between variables

*x*and

_{i}*x*is the correlation coefficient between

_{j}*x*and

_{i}*x*controlling for all the other variables (i.e. keeping all the variables in Z constant).

_{j}**Observation**: This is consistent with Definition 3 of Multiple Correlation where there are only three variables (i.e. *k* = 3). It turns out that the partial correlation coefficient can be calculated as described in the following property.

**Property 1**: Suppose that *x*_{1}*, …, x _{k}* and

*Z*are as in Definition 1 and let

*X*be the

*n*×

*k*matrix

*X*= [

*x*]. The partial correlation coefficient between variables

_{ij}*x*and

_{i}*x*where

_{j}*i ≠ j*controlling for all the other variables is given by the formula

where the inverse of the correlation matrix *R* of X is *R*^{-1} = [*p _{ij}*].

**Example 1**: Calculate the partial correlation coefficient between Crime and Doctor controlling for Traffic Deaths and University based on the data in Figure 1 (which is a subset of the data for Example 1 of Multiple Correlation).

**Figure 1 – US State Data**

We now calculate the correlation matrix and inverse correlation for the data in Figure 1.

**Figure 2 – Inverse of Correlation Matrix**

The correlation matrix (range H4:K7) can be calculated as described in Multiple Regression Least Squares. Thus the inverse of the correlation matrix (range H11:K14) can be calculated via the formula =MINVERSE(CORR(B4:E18)).

Based on Property 1, the partial correlation coefficient between Crime and Doctor controlling for Traffic Deaths and University can be calculated by =–I11/SQRT(H11*I12), yielding a value of 0.0919913.

In general, the partial correlation coefficient between the *i*th and *j*th variables is given by the formula

=INDEX(-H11:K14/MMULT(M11:M14,TRANSPOSE(M11:M14)), *i*,* j*)

Or alternatively the following formula which is less resource intensive to calculate:

=INDEX(-H11:K14/(M11:M14*TRANSPOSE(M11:M14)), *i*,* j*)

Here the range M11:M14 represents the square root of the elements on the diagonal of the inverse of the correlation matrix. This can be calculated using the array formula

=SQRT(INDEX(H11:K14,G11:G14,G11:G14))

Or alternatively the array formula =SQRT(DIAG(H11:K14).

**Observation**: As described above, we can calculate *R*^{2}* _{C,DTU}* using any of the following formulas to get the value .103292:

=RSquare(C4:E18,B4:B18)

=Rsquare(B4:E18,1)

=1-1/H11

**Definition 2**: Suppose that *x*_{1}, …, *x _{k}* and

*X*are as described in Definition 1 and Property 1. Then the

**partial correlation matrix**of

*X*is the

*k*×

*k*matrix

*S*= [

*s*] where for all

_{ij}*i*≠

*j*

**Example 2**: Calculate the partial correlation matrix for the data in Figure 1.

The result is shown in Figure 3.

**Figure 3 – Partial Correlation Matrix**

The partial correlation matrix in range H19:K22 is calculated using the array formula

=-H11:K14/(M11:M14*TRANSPOSE(M11:M14))

**Observation**: Note that Definition 2 does not define the values on the diagonal of the partial correlation matrix (i.e. where *i* = *j*). The formula above results in a matrix whose main diagonal consists of minus ones. If we prefer to have the main diagonal consist of all ones, we can use the following modified formula:

=-H11:K14/(M11:M14*TRANSPOSE(M11:M14))+2*IDENTITY()

**Real Statistics Function**: The Real Statistics Resource Pack provides the following functions where R1 is an *n* × *k* matrix representing the samples of size *n* for the random variables .

**PCORREL**(R1, *i*, *j*) = the partial correlation coefficient of *x _{i}*

_{ }with

*x*based on the data in R1

_{j}**PCORR**(R1) is an array function which outputs the partial correlation matrix for the data in R1

**Observation**: For the data in Figure 1, PCORREL(B4:E18, 1, 2) = 0.0919913 and PCORR(B4:E18) is the range H19:K22 of Figure 3 (except that the main diagonal consists only of ones).

When we use this application, we can calculate the partial correlation matrix. But I don’t know how to calculate the anti-image Covariance and Anti-image Correlation. I see it in Image (Kaiser, 1963) (factor analysis algorithms)

“Factor Analysis of a Correlation Matrix

Eigenvalues and eigenvectors of S^(−1)RS^(−1) are found.

S^(2)= diag (1/r11,…,1/rnn) /” S^(2) and (S^(-1)RS^(-1)) –> What does it mean?/”

rii=i th diagonal element of R−1 /” Note: ^ –> power function in excel”/

The factor pattern matrix is

Fm=SΩm(Λm−Im)Λ−1/2m

where Ωm and Λm correspond to the m eigenvalues greater than 1.

If m=0, the procedure is terminated.

The communalities are

hi=mΣj=1(γj−1)2ω2ij/(γjrii)

The image covariance matrix is

R+S^(2)R^(−1)S^(2)−2*S^(2) /” it like cov() in your package,is it?”/

The anti-image covariance matrix is

S^(2)R^(−1)S^(2)

~~/” This ‘s my problem,which I met. I don’t know this formulas, and I can calculate this matrix, please answer me ,or give an example, as this topic”/~~Factor Analysis of a Covariance Matrix

We are using the covariance matrix Σ instead of the correlation matrix R. The calculation is similar to the correlation matrix case.

~~The rescaled factor pattern matrix is FmR=[diagΣ]−1/2Fm. The rescaled communality of variable i is hiR=σ−1iihi.~~Phu Hai,

For information on how to calculate the anti-image correlation matrix in Excel, please look at the webpage http://www.real-statistics.com/multivariate-statistics/factor-analysis/validity-of-correlation-matrix-and-sample-size/.

Charles

Hi Charles

You write below figure 1:”..The correlation matrix (range H4:K7) can be calculated as described in Multiple Regression Least Squares”. I can follow that post to a point, and created the covariance matrix (using my data), but do not see the example of how to create the correlation matrix (although I can create the pairwise correlation using the default tool in Excel). Am I missing something?

Thanks, Matteo

Matteo,

Perhaps this is what you are missing:

Let R1 be a k × n range which contains only numeric value, let R2 be a 1 × n range containing the means of the columns in R1 and let R3 be a 1 × n range containing the standard deviations of the columns in R1. Then you can use the following array formulas:

COV(R1) = MMULT(TRANSPOSE(R1-R2),R1-R2)/(ROWS(R1)–1) covariance matrix of R1

CORR(R1) = MMULT(TRANSPOSE((R1-R2)/R3),(R1-R2)/R3)/(ROWS(R1)–1) correlation matrix of R1

Let me know if this helps or confuses things more.

Charles

Hi Charles

Thanks for gathering the formulas next to one another and for generalizing the example. All clear. Formulas work like a charm.

Hi again

For my data, the coefficients in the correlation matrix calculated in your formula are the same as the pairwise correlation coefficients calculated using Excel’s correlation data analysis tool.

But if I compare yours they are different. For example, in this post, the correlation between Doctors and Crime is -0.117804, but in Figure 2 in here

http://www.real-statistics.com/correlation/multiple-correlation

the pairwise coefficient between Doctors and Crime is -0.09431. Can you explain the discrepancy?

Matteo,

The difference between the calculations on my website is that the data sets are different. In the case where the correlation is -0.117804 only the data for the first 15 states are being used. In the case where the correlation is -.09431 the data for all 50 states are being used.

Charles

Thanks for the clarification Charles.

Hi Charles

I think just above your definition 2, where you write:

“Observation: As described above, we can calculate using any of the following formulas to get the value .103292”

you are missing a formula image after “we can calculate”?

Hi Matteo,

Yes, you are correct. I have just added the missing expression. Thanks for catching this omission.

Charles

Hi Charles, there’s another omission in here:

Definition 2: Suppose that __ and X are as described in Definition 1

Matteo

Matteo,

Thanks for catching this. I have corrected the webpage adding the symbols that were inadvertently deleted. I appreciate your flagging the problem since it will make it easier for other readers to understand what is being said.

Charles

Another question Charles:

Is there a test that will tell me if there’s a lurking variable I have not considered in an experiment?

In practice: say I got a correlation coefficient of 0.6 for my multi-attribute prediction of production (from my previous comment). Is there any way to tell if that ‘unexplained’ 0.4 is due to just noise or lurking/missing variable?

I am thinking that if I looked at the residuals and they were random, I’d be tempted to attribute the low correlation coefficient to noise, and accept the results, but if I saw a structure, perhaps linear, in the residuals, this could be an indication of a lurking variable (whether it can be measured or not, it would still be useful insight).

What do you think?

Thanks again,

Matteo

Matteo,

I can’t think of such a test at this moment.

Charles

Thanks Charles. Would you take a linear structure in the residuals as at least a qualitative hint that there may be a missing variable?

Matteo,

Sorry for the delay in responding. I can see why you would view a linear structure in the residuals as indicating the potential for a missing variable, but I haven’t had time to consder it further. Sorry for such a limited response. I will think about this issue further when I revise the residuals portion of the website.

Charles

If some of my independent variables have high correlation >70%, would be a good idea to keep them in the regression model? If I keep all of them, how could I justify that and would the results be feasible?

Thank you very much.

Daniel

Daniel,

I would leave them in the model and see whether there is a significant difference if you take one (or more) of these independent variables out, as described in Significance of extra variables.

Charles

Hi Charles

There is a way to Rsquare work with LN ranges (i.e. =RSquare((LN(C4:E18),B4:B18)?

Thank you very much.

Benjamim

Benjamin,

I’m not sure that I understand your question. You can certainly calculate the formula =RSquare((LN(C4:E18),B4:B18), but how and why do you want to do this?

Charles

Dear sir

Greetings,

I am doing research work on marine biology, and collected animal samples (both male and female) from three different coastal habitat such as rocky, muddy and sandy shores, also quantified nutrient content of these animals.

All of my reference articles they have applied Paired sample mean ‘t’ test and they have collected data in two places only, but in my case which statistical analysis is applicable, please kindly give me suggestion with any example and how to interpret that result.

Assuming that you want to determine whether there are significant differences between quantified nutrient content of males and females and between animals that live in the three habitats, you can use Two Factor Anova.

Charles

Dear Sir

I have climate (temperature, rain fall) and plant growth data over time. how can I show that plant growth is correlated to the influence from both temp. and rain fall together, not each one separately.

You can calculate the correlation of plant growth (P) with temp (T) and rain fall (R) using the MCORREL function and compare this correlation with CORREL(P,T) and CORREL(P,R).

To determine whether this difference is significant you can use the approach described in Significance of extra variables in multiple regression.

Charles

Pingback: Links | Kaper's Excel

Hi Charles

I have two data sets of same variables (example time and depth) like time1= 0.2, 0.4, 0.6, 0.7…and depth1 = 2300, 2600,2519, ….. and time2 = 0, 0.1, 0.22, 0.34, 0.5 ….. and depth2 =2100,2400,2500, 2700 …….. Now I want to calculate how much percentage is increase or decrease from first dataset to second dataset and what is the relation so that I can apply the other datasets.

N.B. :I am assuming that my first data set is correct

Thanks in advance

Sorry, but I don’t completely understand the scenario.

Charles

Hi there I was wondering if you could explain more about how you got the figures in your inverse of correlation matrix as I’m struggling to understand it. I’m studying a level maths and I’m doing coursework on correlation. I have 3 sets of data and I’ve followed example 1 on this page and have made a correlation matrix but I’m struggling to understand how you got the figures for the inverse matrix and how you use that to get a correlation figure. I’m also not the best at excel so if you don’t mind could you explain the inverse matrix figures and the final correlation figure more simply for me.

I’d really appreciate any time given to help.

Conner,

For any n x n range R1, the Excel array formula =MINVERSE(R1) will return the inverse of the matrix in range R1. In the example the range R1 is the output from the Real Statistic array function CORR.

For the example, the data is contained in range B4:E18 and so =CORR(B4:E18) returns the correlation matrix for the data. This is shown in range H4:K7.

To get the inverse of the correlation matrix, we placed the array formula =MINVERSE(H4:K7) in range H11:K14.

Note that since we are using array formulas we need to press Ctrl-Shft-Enter and not simply Enter to get the correct output.

Charles

Thanks for the quick reply i have more questions though if you don’t mind.

Firstly is CORR the same as CORREL?

Secondly i have tried doing as you say by holding CTRL-SHIFT-ENTER but it only gives me a single number and i was wondering is this because i have done my correlation matrix manually rather than using the formula (which i did as i couldn’t understand still).

Also on a separate note and i know this is asking a lot but would it be possible for you to do a video of you doing example 1 so i can better follow your steps.

thanks again and for any time given to help.

Hi Conner,

CORR is not the same as CORREL. CORREL(R1,R2) returns one correlation coefficient for the data in ranges R1 and R2. CORR(R1) creates a matrix of correlation coefficients; if R1 has n columns, then the output from CORR is an n x n matrix containing the correlation coefficients for each pair of columns from range R1.

If the output from an array function is say a 3 x 5 range, you need to first highlight a 3 x 5 range, enter the formula containing the function and then press CTRL-SHIFT-ENTER. If you only highlight one cell, then the output you get will only be the upper-left cell from the 3 x 5 range. See the following webpage for details:

Array functions and formulas

I will eventually create videos of various topics, but for now I am focused on adding more content to the website. Next topic> time series analysis. This will appear in a few days.

Charles

Hello!!

I am working on a project which deals with the population (independent variable) and demand for housing loan (dependent variable) and housing loan disbursed (dependent variable). How can I correlate the housing loan disbursed amount with the housing demand and the population? looking forward for your reply

You need to have one dependent variable to use the approach described on the referenced webpage. It sounds like you should use the approach described in the referenced webpage, considering housing loan demand to be an independent variable.

Charles

Charles

I have been racking my brains trying to use your partial correlation matrix with the figures you give until I looked at the spreadsheet and realised that the matrix you show is probably from the full set and the figures you quote for the PCorrel are what comes from the partial set.

For the purposes of variable inclusion, I would find it easier if the results could be displayed as partial regression given y and independent variables included.

Great site –

Bryan,

I am pleased that you like the site and I appreciate your suggestion. I will look into how I can make the referenced webpage clearer.

Charles

Hi Charles,

To make my first question that I posted earlier, I am making it a bit finer. My question is: Can we use multiple correlation or multiple regression with four variables irrespective of being dependent or independent. These four variables contain further three categories from high to medium to lower. Thanks

Shabir,

Multiple regression is used to predict the value of a dependent variable. If there is no dependent variable, then there is no multiple regression.

Charles