Multiple Correlation

We can also calculate the correlation between more than two variables.

Definition 1: Given variables x, y and z, we define the multiple correlation coefficient

Multiple correlation coefficient

where rxz, ryz, rxy are as defined in Definition 2 of Basic Concepts of Correlation. Here x and y are viewed as the independent variables and z is the dependent variable.

We also define the multiple coefficient of determination to be the square of the multiple correlation coefficient.

Often the subscripts are dropped and the multiple correlation coefficient and multiple coefficient of determination are written simply as R and R2 respectively. These definitions may also be expanded to more than two independent variables. With just one independent variable the multiple correlation coefficient  is simply r.

Unfortunately R is not an unbiased estimate of the population multiple correlation coefficient, which is evident for small samples. A relatively unbiased version of R is given by R adjusted.

Definition 2: If R is Rz,xy as defined above (or similarly for more variables) then the adjusted multiple coefficient of determination is

where k = the number of independent variables and n = the number of data elements in the sample for z (which should be the same as the samples for x and y).

Excel Data Analysis Tools: In addition to the various correlation functions described elsewhere, Excel provides the Covariance and Correlation data analysis tools. The Covariance tool calculates the pairwise population covariances for all the variables in the data set. Similarly the Correlation tool calculates the various correlation coefficients as described in the following example.

Example 1: We expand the data in Example 2 of Correlation Testing via the t Test to include a number of other statistics. The data for the first few states are as described in the Figure 1:

US states statistics

Figure 1 – Data for Example 1

Using Excel’s Correlation data analysis tool we can compute the pairwise correlation coefficients for the various variables in the table in Figure 1. The results are shown in Figure 2.

Correlation coefficients array

Figure 2 – Correlation coefficients for data in Example 1

We can also single out the first three variables, poverty, infant mortality and white (i.e. the percentage of the population that is white) and calculate the multiple correlation coefficients, assuming poverty is the dependent variable, as defined in Definition 1 and 2. We use the data in Figure 2 to obtain the values r_{PW}r_{PI} and r_{WI}.

image1607 image1608

image7007

Definition 3: Given x, y and z as in Definition 1, the partial correlation of x and z holding y constant is defined as follows:

Partial correlation

In the semi-partial correlation, the correlation between x and y is eliminated, but not the correlation between x and z and y and z:

Semi-partial correlation coefficient

Observation: Suppose we look at the relationship between GPA (grade point average) and Salary 5 years after graduation and discover there is a high correlation between these two variables. As has been mentioned elsewhere, this is not to say that doing well in school causes a person to get a higher salary. In fact it is entirely possible that there is a third variable, say IQ, that correlates well with both GPA and Salary (although this would not necessarily imply that IQ is the cause of the higher GPA and higher salary).

In this case, it is possible that the correlation between GPA and Salary is a consequence of the correlation between IQ and GPA and between IQ and Salary. To test this we need to determine the correlation between GPA and Salary eliminating the influence of IQ from both variables, i.e. the partial correlation r_{(GS,I)}.

Property 1:
image1613 image1614

Proof: The first assertion follows since

image1615 image1616

The second assertion follows since:

image1617

Example 2: Calculate r_{(PW,I)} and r_{P(W,I)} for the data in Example 1.

image1620 image1621

We can see that Property 1 holds for this data since

image1622 image5041

Observation: Since the coefficients of determination is a measure of the portion of variance attributable to the variables involved, we can look at the meaning of the concepts defined above using the following Venn diagram, where the rectangular represents the total variance of the poverty variable.

Partitioning variance

Figure 3 – Breakdown of variance of poverty

Using the data from Example 1, we can calculate the breakdown of the variance for poverty in Figure 4:

Variance breakdown

Figure 4 – Breakdown of variance of poverty continued

Note that we can calculate B in a number of ways: (A + B –  A, (B + C) – C, (A + B + C) – (A + C), etc. and get the same answer in each case. Also note that

image5043 image5042

where D = 1 – (A + B + C).

Variance breakdown correlation

Figure 5 – Breakdown of variance of poverty continued

Property 2: From Property 1, it follows that:

image7248

If the independent variables are mutually independent, this reduces to

image5050

Real Statistics Functions: The Real Statistics Resource Pack contains the following supplemental functions:

CORREL_ADJ(R1, R2) = adjusted correlation coefficient for the data sets defined by ranges R1 and R2

MCORREL(R, R1, R2) = multiple correlation of dependent variable z with x and y where the samples for z, x and y are the ranges R, R1 and R2 respectively

Observation: Definition 1 defines the multiple correlation coefficient Rz,xy and corresponding multiple coefficient of determination for three variables x, y and z. These definitions can be extended to more than three variables as described in Advanced Multiple Correlation.

E.g. if R1 is an m × n data range containing the data for n variables then the supplemental function RSquare(R1, k) calculates the multiple coefficient of determination for the kth variable with respect to the other variables in R1. The multiple correlation coefficient for the kth variable with respect to the other variables in R1 can be calculated by the formula =SQRT(RSquare(R1, k)).

Thus if R1, R2 and R3 are the three columns of the m × 3 data range R, with R1 and R2 containing the samples for the independent variables x and y and R3 containing the sample data for dependent variable z, then =MCORREL(R3, R1, R2) yields the same result as =SQRT(RSquare(R, 3)).

Observation: Similarly the definition of the partial correlation coefficient (Definition 3) can be extended to more than three variables as described in Advanced Multiple Correlation.

123 Responses to Multiple Correlation

  1. Pankaj says:

    Hi all,
    How to find the correlation b/w the variables such as x,y,z in which x has only 0,1 values, y has continuous values and z has categorical values.
    Is there any correlation b/w them ??

    • Charles says:

      You should be able to use the technique shown on the referenced webpage. Whether the result is meaningful is another issue.
      Charles

  2. Baahr says:

    Hi Charles,
    I am trying to see if there is a correlation between three antibacterial medicine on three species (specifically on genes).
    Example:
    Gene SpeciesA SpeciesB SpeciesC
    aa 1 1 0
    ab 1 0 0
    ac 0 0 1
    1=sensitive, 0=resistant
    I wanted to check if SpeciesA response is dependent or independent of SpeciesB or C for the particular medicine. In other words, is there any correlation of sensitivity or resistance to different species?
    Thanks in advance.
    Baahr

    • Charles says:

      Sorry, but I don’t know how to define the concept of “correlation of sensitivity or resistance to different species”
      Charles

      • Baahr says:

        Hi Charles,
        Basically I wanted to see the correlation between 3 things (species) using binary data. I have 123 observations or data points. What are possible options for such data.
        Thanks.
        B

  3. Catarina says:

    Hi!
    I’m trying to see the correlation between a dependent variable and a vector that is defined by two independent variables (one is the x and the other is the y, when put together they define an exact point). How can I do that?

    • Charles says:

      Catarina,
      This certainly sounds like the situation where you use multiple correlation. In particular, you can use the Real Statistics MCORREL function.
      Charles

  4. semir says:

    Hi charles i have one question regarding to rank correlation coefficient .how i can compute Rs in having three rank like rank by A,B,and C and how to find the nearest approach. Thank you

    • Charles says:

      Sorry, but I don’t understand your question. Perhaps you are looking to use the RANK.AVG function to establish the ranks and then the Real Statistics MCORREL to calculate the correlation.
      Charles

  5. hemjay says:

    i have five independent variables and i want to determine the relationship between them please how can i?

  6. BGoertz says:

    Dr. Zaiontz,

    Good afternoon. I am hoping I can get your guidance on performing a forecast analysis.

    I have two independent variable: operating days and number of physicians

    I have two dependent variables: total revenue and total number of patients

    I have 36 months worth of data and am trying to predict 12 months out but want to be able to understand how much ‘weight’ each independent variable may have over the dependent variables. Obviously, the more physicians you have, the more patients you can see. Same for days to patients. There is a relationship between Total Revenue and Total Patients as well.

    I have calculated the correlation between the variables and can these individual correlations to predict Total Revenue 12 months out but how do I factor in ALL of the variables? Is that possible?

    • Charles says:

      You can use multivariate multiple linear regression. Unfortunately, the website doesn’t yet support this technique.
      Charles

  7. Aku says:

    Dear Dr. Zaiontz,
    I have one dependent variable (binary categorical) and 7 independent variables; 6 of them are binary categorical and the seventh independent variable is age. I want to use correlation to determine if the independent variables affect the dependent variable (falls). Can I use multiple correlation for this?
    Thank you,
    Akubue

  8. Abiola says:

    Hello sir,
    I have the following data
    independent variable- age (in groups e.g 31 to 40, 41 to 50, 51-60)
    Dependent variables are 25 different factors influencing turnover e.g. pay, tenure, job insecurity, job stress. Responses are on a 4-point Likert-type scale.
    All respondents scored each of the 25 dependent variables on the Likert scale.
    I need to find out if age has an effect on the factors affecting turnover. For instance is there a difference in the factors important to age group 31 to 40 and age group 41-50 and age group 51-60?
    What technique do I employ?

    Thank you

    • Charles says:

      Abiola,
      If I understand the situation properly, MANOVA might be a good technique to use, assuming that there is some correlation between the factors. You have one independent variable (Age) and 25 dependent variables.
      Charles

  9. Alexa Manabat says:

    Hi Charles!
    What statistical treatment would you recommend for us to use if we are trying to detemine the relationship between a single dependent variable and 10 independent variables. Thank you.

    • Charles says:

      Alexa,
      It really depends on what you mean by “the relationship between”. E.g., you could use multiple regression for this. If, however, you want the correlation coefficient, then you could use the value calculated as shown on the following webpage:
      Multiple Correlation – Advanced.
      Charles

  10. Sylvia says:

    Hi!

    First of all, thanks a lot for your help, as it is very useful for people like me, who are not very familiar with statistics.

    I have maybe an stupid question. I would like to see the correlation between two different cell populations (let’s see cells A and cells B, how the population of A change according to the population of B). In addition, I have two different kind of animals (WT and KO). I can see correlation between these cell population in KO but not in WT, probably because i dont have too many animals (4 in each condition).

    My question is: as i would like to know the global changes of these 2 cells populations, is it possible to merge WT and KO to do the correlation? When i do this, i observe very nice correlations, and within the graph i can observe how WT mice have for example less A cell populations and KO more…My boss considered that you can only do the correlation using one mice population, but not altogether, but i think this is possible to do it (maybe im wrong!), as i wonder the global changes of these 2 cells populations independently of these cells comes from WT or KO mice.

    I am not sure if i explain appropiartly. I look forward to hearing from you. Again, Thanks a lot for your help!

  11. faizalhakim says:

    Hi Sir

    Hope you in good health always.

    I’m working on the land use change affecting the climate changes.
    I have 9 parameters of climate for 3 periods and
    8 type of land uses for same 3 periods.

    I want to observe whether changes in the land use affected by the climate parameters,
    and I want to identify the most affected climate parameter to the land use changes.

    I’m inquire your advice and opinion sir cause I’m really poor in statistical analysis.

    • Charles says:

      Sorry, but you haven’t provide me with enough information to be able to offer any advice. Perhaps you could use ANOVA, MANOVA or regression, but I can’t really tell.
      Charles

  12. Dear Sir
    Namaste
    I have soil data(nitrogen,phosphorus,potassium,calcium,magnesium content in soil),Plant data(nitrogen,phosphorus,potassium,calcium,magnesium content in leaf) and yield data of ber plant.Sir How i calculate multiple correlation between soil data and ber yield and between leaf data and ber yield.
    With regards

  13. H says:

    Charles,
    I believe I found the answer to this question as I read along, but I would like confirmation before I move forward. I have test scores that I would like to correlate–simply looking for the strength of relationship between skills assessed. The scores are on different scales (standard score–average 100, scaled score–average 10, and raw scores–# correct). I’m a little rusty on my stats. I’m thinking that I do not need to convert these scores to the same scale before I calculate correlation. Can you confirm? My other option would be to convert all scores to raw scores then correlate, but I’m thinking I don’t have to do that.
    Thanks much,
    H

    • Charles says:

      H,
      If the conversion that you have in mind is to multiple the score by some constant and/or add some constant, then the conversion will have no affect on the correlation coefficient. You will get the same answer whether you make the conversion or not. Try it.
      Charles

  14. Parth says:

    hey i want to find a relation between 3 quantities x,y and z; x and y are independent and z depends on x and y, such that z is directly proportional to x raised to m and y raised to n. is there a way i can find m and n?

    • Charles says:

      Parth,
      If the relationship is z = ax^m + by^n, then you have a nonlinear regression problem. One way to find values for m and n (and a and b) is to use Solver. The approach is similar to that shown for exponential regression. See the following webpage for details:
      Exponential Regression using Solver
      Charles

  15. Pragati says:

    Hi Charles,
    I need to predict number of orders for next week for each weekday at different time intervals based on previous 6-7 weeks.I was trying to use the trend function in Excel but problem is that my output is dependent on 2 factors-weekday and time of the day.How do I go about it?

  16. Pingback: What Is Pairwise Correlation – Safetyask.com

  17. Ivan says:

    Hello Charles,

    Thanks for the explanation of this, but, as a novice i am still having some troubles understanding. You can say I am a learn-by-reallife-example kind of guy.

    Say i have to do debt purchase scoring, meaning what % of the nominal price i am willing to pay based on account characteristics. I have benchmarks including various types of data, including those characteristics. Lets say that each account in my benchmark has:
    – product type
    – debt amount
    – days past due
    – age of debtor
    – employment status of debtor
    – principal/interest ratio
    – monthly cash collection hitory

    From benchmarks, i should somehow get weight, or score, for each of the mentioned characteristics, so when i compare it to new accounts, i can say that if i am estimating a new account that is a housing loan, i will give lets say 2% of value, if the age is 35-45, additional 1%, if due is 1000-2000$, additional 4%, etc. I think you get the point.

    My questions are:
    – how to determine the weight(or % value) of each account characteristic
    – how to compensate the missing data (if i do not have age for example, i can not hive it 0%, but i can not also give it max %; should i use some average, or utilize some sort of compensation factor?)

    Thanks in advance for your answer, ot at least a point in the right direction

    Ivan

    • Charles says:

      Ivan,

      1. how to determine the weight(or % value) of each account characteristic
      I don’t have enough information to help you about this.

      2. how to compensate the missing data
      See the following webpage
      Missing Data

      Charles

      • Ivan says:

        Charles,

        For the 1st question, What kind of data would you need?
        What i have is historical data for benchmarking. So basicaly, like in the example list above, i have an account like:

        Housing Loan, Debt is 2456$, Days past due is 411 days, debtor is unemploced and 46 years old, ratio between principal and interest is 1,8, and he has payments (cash inflow) in the amount of 220$ since he is in my portfolio.

        I have data like that for approx 60 thousand accounts. And based on that, i have to weigh the value of each characteristic regarding cash inflow, to basicaly see how valuable is the new portfolio to me (to break even with my investment and gain profit after the initial period).

        So based on benchmark i should be able to know is the loan, or current account, or anything else, in newcoming portfolio worth 2% of the nominal value, od 3% or something else? And all that based on historical data.

        Thank you

        • Charles says:

          Ivan,
          Unless I am missing something, you need to determine the weights based on your knowledge of the problem that you are trying to solve. This doesn’t appear to be a statistical issue, but one that has to do with your knowledge of the real-world problem.
          Charles

          • Ivan says:

            Charles,

            Well, in my opinion, if the historical data suggests that, for instance, 20% of Housing Loans will pay approx 30% of their debt in 1st year after purchase, that should be considered as a statistical conclusion? Unless I am missing something in the very foundation of the matter.

            Thank You,

            Ivan

          • Charles says:

            Ivan,
            Ok, maybe you are right that this can be considered a statistical matter, but in any case I don’t know how to calculate the weights based on the information that you have provided.
            Charles

          • Ivan says:

            Charles,

            This is a quote from a web site i stumbled upon a few days ago but i seem to have lost the hyperlink. But basicaly, it is what i am looking for, but can not seem to get a grasp of it in excel.

            —quote—
            Statistical models function in much the same way as judgmental models. However, in choosing the factors to be scored and weighted they rely on statistical methods rather than the experience and judgment of a credit executive.

            Statistical models consider many factors simultaneously, a process that calculates and analyzes multivariate correlation to identify the relevant tradeoffs among factors, and assigns statistically derived weights used in the model. The key factors are generally captured from credit agency reports and the credit files of the client.

            Statistical models are often described as a scorecard, a pooled scorecard, and a custom scorecard. A scorecard uses data from one firm. A pooled scorecard uses data from many firms. A custom scorecard blends a statistical model with some of the factors used in a judgmental model.
            —end quote—

  18. Kristina Fassbender says:

    Charles,

    I am analyzing language production data for 18 individuals across 3 sampling times. The stimuli are the same each time and my variable is number of productive words. Would a multivariate correlation be appropriate for this type of data to examine the test retest stability (i.e., is the group stable over three times or are they significantly different)?

    Thank you for making this software available for all!

    Cheers,

    Kristina

    • Charles says:

      Kristina,
      To find out whether there are differences over the 3 times, you might use Repeated Measures ANOVA. This is described on the Real Statistics website.
      Charles

  19. Pingback: Factor Analysis in Marketing - Marketing on Data

  20. Pingback: Factor Analysis in Marketing Example - Marketing on Data

  21. Gollapinni Karthik says:

    This is well explained and it is really helpful. But I have two doubts for the extension of these.
    1) How can I get correlation from correlation coefficient when there are two independent variables(x,y)?
    2) How can I get correlation from correlation coefficient when there are three independent variables(x,y,z)?

    I am talking about reverse engineering I guess. I have the standard correlation coefficient values but I want to get correlation out of them.
    The problem is in some cases I have 2 independent variables or 3 independent variables and in some cases I 4 independent variables.

    So my doubt is how do I get correlation value from correlation coefficients when I have many independent variables

    • Charles says:

      The referenced webpage describes how to calculate the correlation coefficient with 1 dependent variable and 2 independent variables. With more than two independent variables, please see the following webpage:
      Advanced Multiple Correlation
      Charles

      • Gollapinni Karthik says:

        Hi Charles,

        Thank you. I have gone through them.
        I have the correlation coefficient already with me. I want to to calculate correlations from the coefficients.

        Thanks,
        Karthik

  22. Maaz says:

    Hello, I Have to make a Report. It is complex data with more than 3 variables. My boss needs a Report for the sales (more than 1500 types of model) of two years separately, with their respective customers, their locations, Promoters at those locations. Each customer has more than 35 branches all over the country to whom we supply our Goods.

    Please help me to arrange this complex data in one file.

    I am not an expert.

    • Charles says:

      Sorry Maaz, but you have not provided enough information for me to even give you a suggestion as to how to proceed.
      Charles

  23. Wytek says:

    Hi Charles,
    This page really has a lot of readers responding. I am looking at this page from the regression point of view. I am looking at the State Rankings data and wonder how to explain the variation in say Poverty in terms of the remaining data. In other words, how much does the variation in income or education contribute to the variation in poverty?

    So, compute the overall R^2 for the entire data set and then compute the R^2 for the data set but leave income data out. The difference between the two R^2 values is the semi-partial correlation coefficient and it account for the contribution made by income to the variation in poverty.

    semi-partial r^2 = R^2 – Ri^2
    where the i in the second term indicates the ith independent variable excluded.

    There is another quantity VIF (Variance Inflation Factor) where
    VIF = 1.0/(1-Ri^2)

    Interestingly, we can compute the partial r^2 like so
    partial r^2 = semi-partial r^2 * VIF = (R^2 – Ri^2)/(1 – Ri^2)

  24. Amaka says:

    Pls how can I compute for 4 variables without y variables

  25. Jeffrey says:

    Hello Charles,

    According to wikipedia, the multiple correlation coefficient should be between 0 and 1. Let’s generate 10 random numbers between -1 and 1 for each of r_xz, r_yz, and r_xy and then compute R.

    r_xz = -0.42484496 0.57661027 -0.18204616 0.76603481 0.88093457 -0.90888700 0.05621098 0.78483809 0.10287003 -0.08677053

    r_yz = 0.91366669 -0.09333169 0.35514127 0.14526680 -0.79415063 0.79964994 -0.50782453 -0.91588093 -0.34415856 0.90900730

    r_xy = 0.77907863 0.38560681 0.28101363 0.98853955 0.31141160 0.41706094 0.08813205 0.18828404 -0.42168053 -0.70577271

    Using your formula, I get:
    R = 2.0302762 0.6704749 0.4608397 4.1256652 1.4283935 1.5836626 0.5178415 1.3375018 0.3472992 1.1998119

    It is clear that there are values of which R is greater than 1. Am I missing something here?

    I would really appreciate your help!

    Cheers

    • Charles says:

      Jeffrey,
      You can’t simply generate 10 random numbers since the values of r_xz, r_yz and r_xy are not independent. You can generate random numbers for x, y and z and then compute the values of r_xz, r_yz and r_xy using CORREL. When you then calculate the values for R, you should get values less than or equal to 1.
      Charles

  26. Richa Dubey says:

    Can u plz send me interpretation of any table of multiple correlation

  27. Dakota Parker says:

    I have a rather large data set and need to know how to calculate a multitude of correlations, for example the data set I have is combined of demographic data and crime data and I need to know how each demographic data point correlates to each crime data point. Is there anyway to use EXCEL to calculate such data correlations or is it only possible to calculate a small data set at a time.

    • Charles says:

      You can use the CORREL function to calculate the correlation between two data sets even if they are large.

      If you want to calculate all pairwise correlations between a number of data sets, you can use the Excel Correlation data analysis tool or the Real Statistics CORR function.

      There are many other possibilities in Excel depending on the specific problem you are trying to solve and the format of your data.

      Charles

  28. Victor says:

    Good day sir pls can you recomend text books on statistics for higher learning here is my line 08160817081

    • Charles says:

      Victor,
      There are lots of textbooks available. It really depends on (1) which topics you are interested in, (2) are you interested in the theory or just how to conduct the tests, (3) how mathematical should it be, etc.
      Charles

  29. Anurag Singh says:

    Hi,
    I have one variable output temperature which is dependent on 6 variable which are length,depth,velocity,conductivity,time and diameter. how can i make correlation between them and how accurate it will be? Thanks in advance

    • Charles says:

      Hi,

      You can simply run the regression model and this will calculate the desired correlation (as well as the adjusted correlation, which is a less biased estimate).

      Alternatively, you can calculate the correlation directly using the Rsquare function, as described on the webpage
      Advanced Multiple Correlation

      Charles

  30. S. Siamai Kromah says:

    This Excel’s CORREL(variable1, variable2) function to run autocorrelation (ACF)for four series – et, variables A, B, & C, results not correct.

    Can anybody help?

    • Charles says:

      Excel’s CORREL is intended to calculate the correlation between two data sets. If instead you are looking for the correlation between 4 data sets, then you need a different function. You can use the approach described on the webpage Multiple Correlation – Advanced.
      Charles

  31. Kareem says:

    Hi,

    can i use the multiple correlation formula of definition 1 for 3 independent variables?
    Also for the adjusted formula, what is n? Thanks.

    • Charles says:

      Hi Kareem,
      You need to consider one of the variables to be the dependent variable, otherwise Definition 1 doesn’t make sense.
      n is the sample size of variable z (which is equal to the sample size for x or y).
      Charles

  32. Tanuj says:

    Hi Charles,

    I am looking for a relationship or a limit on correlation coefficient between x and z, given the corr. coefficients b/w x and y, and y and z. Could you please elaborate on this kind of problem?

    Thanks a lot!

  33. Matteo says:

    Hi Charles

    Can you use interchangeably the pairwise correlation coefficient between independent variable x and dependent variable z, and partial dependence plot of z on x?

    Thanks,
    Matteo

    • Charles says:

      Matteo,
      If you are asking whether the correlation of z on x and y is the same as the correlation of z on y and x, the answer is yes.
      Charles

  34. shaista says:

    I mean how to calculate regression coefficient (R) for 5 variables.

  35. shaista says:

    Hello,
    Hope you will be fine. I need your assistant about computing R formula if i have 5 independent and one dependent variable. please guide me how can i modify R formula according to my scenario

  36. cesare says:

    Dear Charles,

    thanks for the nice breakdown.

    I have a simple problem where i have 3 dependent variables (a, b and c) and i would like to isolate c to see how the a affect b independent of c.
    i assume i have to use the partial correlation formula in definition 3 right? (the first of the two…
    and i assume the r values are the r (correlation coefficients) i get from a pearson correlation.
    i insert those in my formula and i get my results, but they are not between 1 and 0.
    how is that possible?

    i checked the excel formula several times.
    have I made a mistake or is it possible that i get a value above 1? (1.19)

    thanks for your help
    c

    • Charles says:

      Dear Cesare,
      Yes, Definition 3 and Pearson’s are correct, but you should never get a value larger than 1. If you send me an Excel spreadsheet with your data I will try to figure out what has gone wrong.
      Charles

  37. Gorge says:

    How you, Charles?
    I was confused by the printout. I did not how to interpret it. Please help me. Thank you!
    A scholar was interested in determining if there were differences in correlations between anger and depression when removing the effects of self-esteem for both variables among students (Group 1), teachers (Group 2), and farmers (Group 3). Results are as follows. Please provide a complete conclusion and explanation:
    The partial correlation for group 1=0.34
    The F value= 5.99 with a probability=o.018
    The partial correlation for group 2=0.8018
    The F value=84.6 with a probability =.00
    The partial correlation for group 3=.3
    The F value=4.7 with a probability=0.03
    The global test for equality=18 and it has a probability 0.0001

    3. Step Differences
    1 st partial corr 2 ND partial corr Rstat Prob
    (1) .30 (3) .80 5.4 0.00
    2. Step differences
    1 st partial corr 2Nd partial corr Rstat Prob
    (1) .3 (2) .33 .25 1
    (2) .3 (3) .8 5.11 0.00

    • Charles says:

      Gorge,
      It sounds like you are referring to a printout from some other statistics tool (SPSS, SAS, etc.), which I don’t have, and so I am not able to comment.
      Charles

  38. LauHai says:

    Dear Charles,

    I have some set of data for data analysis and l seek your assistance in analyzing them using correlation and regression techniques

  39. mhymhy says:

    sir,

    i have a question, what statistical instrument can i used in my thesis if i have 4 variables????

  40. Simao says:

    Hi

    Thanks for your website. Very helpful.
    I am trying to do a multiple correlation but my independent variables were obtained from different sized populations.
    What adjustments should i make?
    regards

    • Charles says:

      Hi Simao,

      Do you mean different sized populations or different sized samples? The calculation of the multiple correlation coefficient described on the referenced webpae is based on the sample data used and not on the underlying populations. I will try to answer the question based on missing data. If this is not what you intended, please elaborate.

      On the referenced webpage the correlation coefficient is calculated from a sample of 3-tuples. If one or more of the data elements in the 3-tuples is missing, generally you have three choices: (1) drop that 3-tuple from the sample, (2) calculate the correlation using a pairwise approach whereby only non-missing pairs are used in calculating the correlation coeffcient for that pair of variables (note that the calculation of the multiple correlation coefficient is based on the values of the three pairwise correlation coefficients) and (3) impute the value(s) of the missing elements(s).

      The first two approaches are described in more detail on http://www.real-statistics.com/multiple-regression/least-squares-method-multiple-regression/ when I explain two approaches for calculating a covariance or correlation matrix. The third approach is described in http://www.real-statistics.com/handling-missing-data/.

      Charles

      • Simao says:

        Thanks for your reply
        Yes I meant different sized populations. Each population produce the same amount of samples that I pretend to correlate. But the populations are different in size. Should I do any adjustment for this?

        regards

        • Charles says:

          Simao,

          I don’t have a precise answer for you, but I can offer the following suggestion:

          If the finite populations are large, you probably don’t need to do anything. The usual correction factor for finite populations is to multiple the standard deviation by the square root of (N-n)/(N-1) where N = the population size and n = the sample size.

          Charles

  41. Rahul says:

    Are multiple correlation and multiple regression same?

    • Charles says:

      Rahul,
      Not exactly, although they are clearly interrelated. To calculate the multiple correlation coefficient you can use the results for R^2 from multiple regression.
      Charles

  42. Daniel says:

    Hi Charles,

    You give the multiple correlation coefficient in Definition 1, and an adjusted multiple correlation coefficient in Definition 2. I’m trying to find a correlation between 1 dependent variable and 2 independent variables, so do I have to use the adjusted multiple correlation coefficient (Definition 2) to accomplish this? Or do I just use the multiple correlation coefficient in Definition 1?

    Thank you in advance,
    Dan

    • Daniel says:

      Also, how can you determine the p-value of such correlation?

    • Charles says:

      Daniel,
      Generally R^2 is used (where R = the unadjusted multiple correlation coefficient), even though the adjusted multiple correlation coefficient is a less biased estimate of the population correlation coefficient, and so should be a better estimate.
      Charles

  43. Claire says:

    My daughter is trying to do a correlation assignment using three variables. She tried using tennis results (height, weight and age) but found no correlation. She also tried cricket results (age, height & success rate). I am despairing as I can not help her. Do you have any suggestions? She is in Year 12 and has chosen sport results but it could be anything!

    • Charles says:

      Hi Claire,
      Not sure what the problem is. Is she trying to find three variables that correlate? What is she trying to accomplish?
      Charles

  44. Tim says:

    “Unfortunately R is not an unbiased estimate of the population multiple correlation coefficient, which is evident for small samples”

    Golly, that really is unfortunate!

  45. Dana says:

    Hi Charles,

    I’m trying to compute a partial correlation but from the examples above I am still not clear on how to do it. Let’s say I have a 30 x 5 matrix (column 1 = subjects in an experiment, column 2 = variable 1, column 3 = variable 2, column 4 = variable 3, column 5 = variable 4). I have computed the correlation between variable 1 and 2 using the correlation function in Analysis ToolPak. But what I’m really interested in is computing the partial correlation between variable 1 and 2 while controlling for variable 3 and/or variable 4.

    Your help would be greatly appreciated.

    Best,
    Dana

  46. shailesh verma says:

    sir ,
    sir i request for ,
    in multiple co-relation of cofficient what meaning of

    b123X2 + b132X3=e123

    this equation where e is the error. i dont no which error
    please help me sir……………………………..

  47. Serge says:

    Sir,
    I’m new to this, so I may be asking a rhetorical question. Are we allowed to compare a simple correlation r of a criterion with a multiple correlation of R of the same criterion?

    I think that R in this case will be larger than r and hence, R is closer to a perfect correlation. Is this always true?

    • Charles says:

      Serge,

      Both R and r represent correlation. In fact with two data sets R and r are the same thing. Depending on what sort of assertion you are trying to make you can certainly compare them.

      As can be seen from Figures 3 and 4 of the referenced webpage, R^2_p,wi = A + B + C and r^2_p,w = A + B, and so it follows that R^2_p,wi >= r^2_p,w. Thus R_p,wi >= r_p,w. This is generally true.

      Charles

  48. mikebinz says:

    Hi Charles

    Great posting, which I am really struggling to understand properly, having forgotten all the statistics I ever did (long ago!)

    I’m wondering if you could briefly interpret the meaning of each Line of Figures 4 & 5 to gibve me a starting point

    eg A+B+C Rp,wi2 0.33105
    means that Proverty is well correlated to Infant Mortality, independent of being White, or whatever

    • Charles says:

      Mike,

      I have updated the webpage so that things are explained a little better. Here is a bit more info:

      A+B+C = R_p,wi^2 = .33105 means that 33.1% of the variance in P is explained from the association with W and I (and so 66.9% is not explained in this way). This is especially useful with linear regression since it gives you some idea about how good a linear regression model for P would be based on W and I. Here it is best to look at the webpages on linear regression and multiple regression.

      From Figure 3 and 4, we see that the value R_p,wi^2 = r_pw^2 + r_p(i,w)^2. This is because the left side of the equation is A+B+C and the first term on the right side is A+B while the last term is C. This also follows by Property 2.

      r_PI = .5644 (cell M5 of Figure 2) means there is a moderately high correlation between P (poverty) and I (infant mortality). If we factor out the effects of being white (W) from this correlation we get the partial correlation r_PI,W. From cell X38 of Figure 5, we see that the square of this number is .322505, and so taking the square root we get that r_PI,W = .5679, which is only slightly higher that the value for r_PI. This means that the being white (W) has virtually no impact on the correlation between P and I (and the little impact that it has is a dampening effect).

      I know that this is a little confusing, but I hope that this explanation helps, at least a little bit.

      Charles

  49. tnzal says:

    If at initial state I only have 3 variables : length, width and thickness.
    Can I create new independent variables by taking above variables,
    for example : Weight = length x width x thickness x density, Volume = length x width x thickness.
    Now I have 5 variables length, width, thickness, Weight and Volume to be tested.

    All variables give a significant correlation factor When I did the correlation factor by using excel data analysis.
    how to select the variables will be giving more representing model.

  50. Ila says:

    Sir,
    I downloaded your Realstats Resource Pak , I put the Realstats.xlam in a folder called Realstats in my folder on a server. I clicked on Realstats.xlam, clicked on File>Help|Options>Add-Ins, clicked Go. Add-ins available are AnalysisToolPak, Analysis ToolPak – VBA, Euro Currency Tools, Solver Add-In. It does not contain Realstats. Could you please help me. Appreciate that very much. Thank you very much, Ila

    • Charles says:

      Hi Ila,
      That Realstats does not appear is not surprising. You need to click on the Browse button and navigate to where you stored the file and then click on OK. RealStats should now appear. Click on Realstats and you should be able to use the supplemental functions and data analysis tools.
      Charles

  51. radhika vyas says:

    Im sorry sir, may be I cant explain what i want to ask…i meant that just like every statistical tool has a purpose, why were matrices devised in the first place? we can add, subtract, multiply numbers in any case, why we do we need to put them into a rectangular array and then do basic math on them like they do in matrices? Sir, if I am not making sense, please ignore this msg. Regards

    • KP says:

      Dear Radhika,
      I will try my best, what i understand from your questions.
      Matrix is a rectangular array of variables or (numbers, symbols, or expressions). This arrangement is for systematically analyze the large number of variable or computation of complex relationship of variable (play with multivariate relation) . For example it is very easy to solve equations with two variables but it would be complex for more then two variable, but it is easy to compute through matrix system. It is useful to solve and generalize mathematical relationship of various field i.e. Statistics, Biological Sciences Economics etc. Because of its practical usefulness Matrix become new discipline in Mathematics.
      Thank You
      Regards

  52. radhika vyas says:

    Hello 🙂

    In my syllabus, there’s a topic called ‘basic matrices and their uses’…i dont understand relevance of matrices…so if u could please guide…or give some reference…thank u.

  53. Rizky Tri Asmono says:

    Dear
    Did you have a reference book about multiple correlation and multiple regression?
    If you have some e-book or book information about multiple correlation and multiple regression, please contact me at my email.

    • Charles says:

      Dear Rizky,
      The Bibliography of the website contains a number of good references which covers these topics. The website itself should serve as a good reference for these topics. Are there some things that you would like to see covered in more detail or more clearly in the website?
      Charles

  54. Vikas Handa says:

    Sir
    I came across your site while browsing some information on Multiple correlation. It is described very nicely. It would be great if you could expand it to four variables also.

    Vikas Handa

    • Charles says:

      Hi Vikas,

      When using more than 3 variables it is best to use matrices. For example in the Multiple Correlation webpage, I describe the function Rz,xy which is the multiple correlation coefficient of z with x and y. If there are 4 (or more variables), I refer to a matrix (or range in Excel) R1 which contains the data for all 4 variables and then look for the function which gives the correlation coefficient of the kth variable in R1 with respect to the other variables in R1.

      This is better explored in the Multiple Regression webpage http://www.real-statistics.com/multiple-regression/multiple-regression-analysis/. At the end of that page you will find the supplemental function RSquare(R1,k) which calculates the square of this correlation coefficient. Thus the correlation coefficient of the kth variable with respect to all the other variables is given by the formula =SQRT(RSquare(R1,k)).

      It turns out that this can also be calculated by first finding the correlation matrix for R1, then taking its inverse and then selecting the kth element on the diagonal. If this has the value c then the desired correlation coefficient is the square root of 1-1/c.

      Thus the desired correlation coefficient can also be calculated by the formula =SQRT(1-1/INDEX(DIAG(MINVERSE(CORR(R1))),k)).

      I hope this helps.

      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *