We can also calculate the correlation between more than two variables.

**Definition 1**: Given variables *x*, y and *z*, we define the **multiple correlation coefficient**

where *r _{xz}, r_{yz}, r_{xy} *are as defined in Definition 2 of Basic Concepts of Correlation. Here

*x*and y are viewed as the independent variables and

*z*is the dependent variable.

We also define the **multiple coefficient of determination** to be the square of the multiple correlation coefficient.

Often the subscripts are dropped and the multiple correlation coefficient and multiple coefficient of determination are written simply as *R* and *R ^{2}* respectively. These definitions may also be expanded to more than two independent variables. With just one independent variable the multiple correlation coefficient is simply

*r*.

Unfortunately *R* is not an unbiased estimate of the population multiple correlation coefficient, which is evident for small samples. A relatively unbiased version of *R* is given by *R* adjusted.

**Definition 2**: If *R* is *R _{z,xy} *as defined above (or similarly for more variables) then the

**adjusted**multiple coefficient of determination is

where *k* = the number of independent variables and *n* = the number of data elements in the sample for* z* (which should be the same as the samples for *x* and y).

**Excel Data Analysis Tools:** In addition to the various correlation functions described elsewhere, Excel provides the **Covariance** and **Correlation** data analysis tools. The **Covariance** tool calculates the pairwise population covariances for all the variables in the data set. Similarly the **Correlation** tool calculates the various correlation coefficients as described in the following example.

**Example 1**: We expand the data in Example 2 of Correlation Testing via the t Test to include a number of other statistics. The data for the first few states are as described in the Figure 1:

**Figure 1 – Data for Example 1**

Using Excel’s **Correlation** data analysis tool we can compute the pairwise correlation coefficients for the various variables in the table in Figure 1. The results are shown in Figure 2.

**Figure 2 – Correlation coefficients for data in Example 1**

We can also single out the first three variables, poverty, infant mortality and white (i.e. the percentage of the population that is white) and calculate the multiple correlation coefficients, assuming poverty is the dependent variable, as defined in Definition 1 and 2. We use the data in Figure 2 to obtain the values , and .

**Definition 3**: Given *x*, y and *z* as in Definition 1, the **partial correlation** of *x* and *z* holding y constant is defined as follows:

In the **semi-partial correlation**, the correlation between *x* and y is eliminated, but not the correlation between *x* and *z* and y and *z*:

**Observation**: Suppose we look at the relationship between GPA (grade point average) and Salary 5 years after graduation and discover there is a high correlation between these two variables. As has been mentioned elsewhere, this is not to say that doing well in school causes a person to get a higher salary. In fact it is entirely possible that there is a third variable, say IQ, that correlates well with both GPA and Salary (although this would not necessarily imply that IQ is the cause of the higher GPA and higher salary).

In this case, it is possible that the correlation between GPA and Salary is a consequence of the correlation between IQ and GPA and between IQ and Salary. To test this we need to determine the correlation between GPA and Salary eliminating the influence of IQ from both variables, i.e. the partial correlation .

Proof: The first assertion follows since

The second assertion follows since:

**Example 2**: Calculate and for the data in Example 1.

We can see that Property 1 holds for this data since

**Observation**: Since the coefficients of determination is a measure of the portion of variance attributable to the variables involved, we can look at the meaning of the concepts defined above using the following Venn diagram, where the rectangular represents the total variance of the poverty variable.

**Figure 3 – Breakdown of variance of poverty**

Using the data from Example 1, we can calculate the breakdown of the variance for poverty in Figure 4:

**Figure 4 – Breakdown of variance of poverty continued**

Note that we can calculate B in a number of ways: (*A + B – A*, (*B + C*) – *C*, (*A + B + C*) – (*A* + *C*), etc. and get the same answer in each case. Also note that

where *D* = 1 – (*A + B + C)*.

**Figure 5 – Breakdown of variance of poverty continued**

**Property 2**: From Property 1, it follows that:

If the independent variables are mutually independent, this reduces to

**Real Statistics Functions**: The Real Statistics Resource Pack contains the following supplemental functions:

**CORREL_ADJ**(R1, R2) = adjusted correlation coefficient for the data sets defined by ranges R1 and R2

**MCORREL**(R, R1, R2) = multiple correlation of dependent variable *z* with *x* and y where the samples for *z*, *x* and y are the ranges R, R1 and R2 respectively

**Observation**: Definition 1 defines the multiple correlation coefficient *R _{z,x}*

_{y}and corresponding multiple coefficient of determination for three variables

*x,*y and

*z*. These definitions can be extended to more than three variables as described in Advanced Multiple Correlation.

E.g. if R1 is an *m* × *n* data range containing the data for *n* variables then the supplemental function RSquare(R1, *k*) calculates the multiple coefficient of determination for the *k*th variable with respect to the other variables in R1. The multiple correlation coefficient for the *k*th variable with respect to the other variables in R1 can be calculated by the formula =SQRT(RSquare(R1, *k*)).

Thus if R1, R2 and R3 are the three columns of the *m* × 3 data range R, with R1 and R2 containing the samples for the independent variables *x* and y and R3 containing the sample data for dependent variable *z*, then =MCORREL(R3, R1, R2) yields the same result as =SQRT(RSquare(R, 3)).

**Observation**: Similarly the definition of the partial correlation coefficient (Definition 3) can be extended to more than three variables as described in Advanced Multiple Correlation.

9. To understand the work commitment level of employees at KEL

The following factors induces work commitment into me :

Factors

Highly Satisfied

Satisfied

Neutral

Dissatisfied Highly Dissatisfied

a. Working Hours

b. Security Measures

c. Job Security

d. Transportation Facility

e. Leave Policy

f. ESI , EPF Benefits

10. To assess thse motivation level of employees at KEL

The following factors induces motivation into me :

Factors

Highly Satisfied

Satisfied

Neutral

Dissatisfied Highly Dissatisfied

a. Training & Development

b. Social Status

c. Recognition & Reward for performance

d. Monetary Reward

e. Promotion Policy

f. Counseling

11. To assess the job satisfaction level of employees at KEL

The following factors induces job satisfaction factors into me :

Factors

Highly Satisfied

Satisfied

Neutral

Dissatisfied Highly Dissatisfied

a. I consider my job pleasant

b. I feel real enjoyment at my work

c. I am enthusiastic about my work

d. I feel to stay with my organization even if new opportunity is advantageous to me

e. I feel happy to spent rest of my career at this organization

f. I feel part of the family at my organization

12. To assess the Employee – Employer relationship at KEL

The following factors build Employee – Employer relationship :

Factors

Highly Satisfied

Satisfied

Neutral

Dissatisfied Highly Dissatisfied

a. Conflict Resolution

b. Grievance Handling Mechanism

c. Trade Unions Support

d. Healthy Communication

e. Good Leadership

f. Ethical work practices

sir i want to done the mutiple correlation for the above 4 questions according to my projct work.I am liitle wek in maths.can you help me sir

You can create the pairwise correlations using Excel’s CORREL function. You can get all such correlations using Excel’s Correlation data analysis tool.

To get the multiple correlation of any one of the questions against the other 3 questions you can perform a multiple linear regression using Excel’s Regression data analysis tool and use the Multiple R value.

How to perform each of these is described on the Real Statistics website.

Charles

Hi Charles

Please could you assist me. I am required to find correlation between 4 variables: time, temperature, rainfall and stream flow. Please could you advise a formula that I could use to find the correlation between these 4 variables. Thank you

Fatima,

See http://www.real-statistics.com/multiple-regression/multiple-correlation-advanced/

Charles

Thank you!

hey,

I have data of 600 people out of which only 15 got passed. I want to find the correlation between these two entries. How can I calculate??

You can use the CORREL function to calculate the correlation between two data sets.

Charles

Hi charles, I found your blog is so helpful. thank you very much. I am doing research in computer Science Engineering. I am not good at Math. i need to find out the correlation between the two variables with respect to third variable . For example i have 2 features like GPA, IQ ,SALARY and class labels(PROFESSIONAL, NON PROFESSIONAL). now i want to know the correlation between the GPA and SALARY with respect to Class label.what kind of Formula i should use. what if i refer i can find out my answer? please help me.

Elakkiya,

You can use the MCORREL function as described on the referenced webpage. Alternatively you can perform multiple regression and use the calculated correlation value.

Charles

Thank you very much

Hello Charles,

Basically i am not into math. But I need some help in correlation which I need to use in my Structural engineering research work. Hope you would help me out with this.

I have the data for different states in India, for the construction type of the building i.e., Material of wall and material of roof (Sheet 1). And I have the data for these two tables correlated (Sheet 2). But the problem is I am not able to find the methodology about how the values were executed in sheet 2. I am attaching the excel files link.

https://www.dropbox.com/s/n5wtcq4q83hjq1d/census%20state%20wise.xlsx?dl=0

Thank you in advance.

Ravi,

I was only able to access Sheet 1.

For simple correlations (between two samples), you can use Excel’s CORREL function. For correlations between one variable and multiple variables you can use regression. For more complicated analyses, perhaps Canonical Correlation would be useful.

Charles

Dear Charles,

First of all, thank you very much for the description above, this is really helpful. I have a question about one of my analysis and can’t find the answer, I hope you don’t mind asking you.

I am analysing biomechanics of hands. For this, I measured the pressure a hand can apply to a handle after different neurological issues. I measured both hands. Between the left and the right hand was no significant difference in none of the parameters I am using. Now, I am not really sure how to deal with it. Can I use the measured values from both hands or do I need to analyse left and right always separately? Is there a possibility to correct for the inter-hand correlation to use all values?

Thank you very much for your help in advance!

Marion,

How did you determine that there was no significant difference between the left and right hands? Did you use a paired t test (or Wilcoxon signed-ranks test)? You don’t need to consider correlations with these tests.

Charles

Dear Charles,

Thank you very much for your reply. Yes, I conducted a two-tailed paired t-test to get the statistical significance between both hand powers.

My question is, if I now want to use the parameter power to do multiple regression analysis, if I can use the power values from both hands as one sample or if I do need to split it (as they are not independent)?

Marion

Marion,

In the regression analysis you can use the variable “left hand” (or some parameter about the left hand) and the variable “right hand” or some parameter about the right hand. Thus for each element in the sample you will have a value for left and right hands. What you can’t do is have a variable “hand” (or some parameter about hand) and have a sample element for left hand and another sample element for right hand. This would violate the randomness assumption of your sample.

Charles

Dear Charles,

Thank you very much for your help!

Best wishes,

Daniela

Hello Charles, please advise. I am currently trying to run a regression analysis with 5 Independent variables, and one dependent. I am trying to determine if the factors( IV) influence the DV. The regression analysis continues to yield very high P-values and only one significant T-Stat. I believe that for this problem I should be finding the correlation but, I do not know how to find the correlation of data with regard to multiple IV’s. What is the best way to find the correlations and see if the IV’s indeed are influencing the DV? I thank you in advance!

John,

The regression analysis actually includes this result (i.e. the Multiple Correlation value on the report). If only one IV has a significant p-value, then only that variable has a significant influence on predicting the DV.

Charles

hi charles,

can u please help me with this. my thesis is MDCT PREDICTORS OF SURGICAL INTERVENTION IN ADHESIVE SMALL BOWEL OBSTRUCTION. so basically i am trying to predict whether a patient with intestinal obstruction needs surgery or not on the basis of CT scan findings. i have taken 75 retrospective cases and evaluated whether these findings (6) were present or absent in case of intestinal obstr. patient. and then i saw in the medical records, how many of them were operated. so i calculated the senisitivity, specificity, PPV and NPV of each of these findings/variables in predicting surgery in a case of intestinal obstruction. but now i want to apply these findings in 25 prospective cases to see whether or bot my hypothesis is right. please tell me if there is a way to do it and if yes how. thanks.

One potential approach is logistic regression. See the following webpage:

Logistic Regression

Charles

Hi Charles,

I think I’ve made a know in my head. My thesis is on ethical leadership, inclusion and employee voice. Basically, I hypothesis that if employees perceive their leadership to be ethical it will lead to higher perception of inclusion and employee voice. I also want to collect demographics as control variables.

I’m using 3 Likert scale questionnaires (ordinal data), to ask employees about 1. Ethical Leadership 2. Climate of inclusion 3. Employee voice (employees speaking up). On one hand this looks like a correlation, as there’s no cause (independent variable), though on the other hand wouldn’t Ethical leadership be an independent variable? In the sense that I hypothesis it will affect climate of inclusion and employee voice?

Oscar

Oscar,

You can certainly view ethical leadership as an independent variable. Yes, in this case you can test whether more or less “ethical leadership affects [i.e. correlates with] a climate of inclusion and employee voice”. You might use correlation, regression, t tests, ANOVA, etc. to test this.

I don’t know what you mean by “I also want to collect demographics as control variables.” In what sense will these be control variables?

Charles

Hello, there.

I am conducting a mini research project for my degree.

I am researching based on three variables which include depression, neuroticism, and procrastination. I am looking to find a relationship between all three variables. I am not sure of what kind of analyses to use. Could you give me any advice?

Elena,

Which analyses to use depends on a number of factors, including: (1) what sort of data you have, (2) what you mean by “relationship”, (3) what sort of hypotheses you might want to test, etc.

In any case, in these sort of situations, it would be common to use multiple correlation or multiple regression.

Charles

Hi,

I am doing my research on socio-economic determinants of child labor. I am really confused to use which method as i have more than 10 variables like poverty,parental education and unemployement, flaws in education,government ignorance,dowry,early marriages,parental death, migration, shocks. And are these factors effecting child labor if yes than to what extent. kindly guide me to use which method and explain a bit how?

Kiran,

It sounds like you want to figure out which variables to retain when conducting linear regression (this is related to multiple correlation). The variables which have a non-significant regression coefficient are candidates for removal.

In addition, I suggest that you look at the following webpages.

Testing significance of extra variables

Stepwise regression

Shapley-Owen decomposition

Charles

Hi! I am working on my dissertation in which I have to correlate students math test score with different variables like confidence, interest, perseverance etc. I have tried to get help from online tutorials but they are not very helpful. Can you please guide me how to perform correlation with all these around 5 variables. Within one variable I have framed around 5 questions. An immediate reply will be highly appreciated.

Nadra,

You can do this by performing a linear regression and using the R value, which is the correlation.

Charles

Hi I am working on a research project where i have to determine the co relation between Locus of control – Internal and External and stress and anxiety. Here there are two independent variables and two dependent variable. Which method of calculating co relation and which tool will be beneficial for that for a sample of 100.

Anuradha,

What hypothesis (or hypotheses) are you trying to test?

Charles

The hypothesis is that there is relationship between locus of control and stress and anxiety.

Now locus of control is both internal and external which here are independent variables while individually stress and anxiety are dependent variable.

Generally, correlation is defined when there is one dependent variable and one or more independent variables. This is related to linear regression. It is not clear what you mean by correlation when you have two dependent variables and two independent variables.

Charles

hello thanks for this forum..please i really need help here.. i am doing a master 2 research on physical education and sports..and my topic is ” A comparative study of the uchikomi shuttle run test, special judo fitness test and 20m shuttle run test in physical fitness for cameroon judoists”. i have three independent variables and one dependent variables..pls i need to know wch test i can use to analyse my data

Edmond,

You have not provided enough information for me to be able to give an answer. What hypothesis are you trying to test?

Charles

The link had really helped me. honestly, it was not a homework. i was just arguing with my peers. thank for your help. i will suggest my friends to look over the page.

Glad I could help.

Charles

How to understand, standard deviation of a sample is high or low? Calculated mean from the set is 80.2 & standard deviation is 13.4

Kanak,

There is no real assessment about whether a standard deviation is high or low. You can look at the ratio mean divided by standard deviation, but whether this is high or low depends on knowledge about the domain being studied.

Charles

I am planing to have research on multidimensional relationship of Socio-economic status(SES) and ethnicity on WASH (Water sanitation and Hygiene) and nutrition. Please would you suggest me the good statistical method for showing linkage or not within them.

It is difficult to answer your question without more information, but MANOVA is a possible approach.

Charles

Thanks. I have found it.

Using Data Analysis tab in excel, I tried multiple regression by selecting Regression Analysis. When I tried to select multiple columns for Input X Range, it rejects saying only 1 column to select. Can’t we use data analysis tool for multiple regression?

You should be able to do multiple regression with up to 16 independent variables using Excel’s Regression data analysis tool.

Charles

Hi all,

How to find the correlation b/w the variables such as x,y,z in which x has only 0,1 values, y has continuous values and z has categorical values.

Is there any correlation b/w them ??

You should be able to use the technique shown on the referenced webpage. Whether the result is meaningful is another issue.

Charles

Hi Charles,

I am trying to see if there is a correlation between three antibacterial medicine on three species (specifically on genes).

Example:

Gene SpeciesA SpeciesB SpeciesC

aa 1 1 0

ab 1 0 0

ac 0 0 1

1=sensitive, 0=resistant

I wanted to check if SpeciesA response is dependent or independent of SpeciesB or C for the particular medicine. In other words, is there any correlation of sensitivity or resistance to different species?

Thanks in advance.

Baahr

Sorry, but I don’t know how to define the concept of “correlation of sensitivity or resistance to different species”

Charles

Hi Charles,

Basically I wanted to see the correlation between 3 things (species) using binary data. I have 123 observations or data points. What are possible options for such data.

Thanks.

B

Baahr,

Probably you can use multiple correlation, e.g. by using the MCORREL Real Statistics function.

Charles

Hi!

I’m trying to see the correlation between a dependent variable and a vector that is defined by two independent variables (one is the x and the other is the y, when put together they define an exact point). How can I do that?

Catarina,

This certainly sounds like the situation where you use multiple correlation. In particular, you can use the Real Statistics MCORREL function.

Charles

Hi charles i have one question regarding to rank correlation coefficient .how i can compute Rs in having three rank like rank by A,B,and C and how to find the nearest approach. Thank you

Sorry, but I don’t understand your question. Perhaps you are looking to use the RANK.AVG function to establish the ranks and then the Real Statistics MCORREL to calculate the correlation.

Charles

i have five independent variables and i want to determine the relationship between them please how can i?

See Multiple Correlation – Advanced

Charles

Dr. Zaiontz,

Good afternoon. I am hoping I can get your guidance on performing a forecast analysis.

I have two independent variable: operating days and number of physicians

I have two dependent variables: total revenue and total number of patients

I have 36 months worth of data and am trying to predict 12 months out but want to be able to understand how much ‘weight’ each independent variable may have over the dependent variables. Obviously, the more physicians you have, the more patients you can see. Same for days to patients. There is a relationship between Total Revenue and Total Patients as well.

I have calculated the correlation between the variables and can these individual correlations to predict Total Revenue 12 months out but how do I factor in ALL of the variables? Is that possible?

You can use multivariate multiple linear regression. Unfortunately, the website doesn’t yet support this technique.

Charles

Dear Dr. Zaiontz,

I have one dependent variable (binary categorical) and 7 independent variables; 6 of them are binary categorical and the seventh independent variable is age. I want to use correlation to determine if the independent variables affect the dependent variable (falls). Can I use multiple correlation for this?

Thank you,

Akubue

Aku,

Perhaps. If so, you can use the approach described at

Multiple Correlation – Advanced.

Charles

Thank you so much Sir!

Hello sir,

I have the following data

independent variable- age (in groups e.g 31 to 40, 41 to 50, 51-60)

Dependent variables are 25 different factors influencing turnover e.g. pay, tenure, job insecurity, job stress. Responses are on a 4-point Likert-type scale.

All respondents scored each of the 25 dependent variables on the Likert scale.

I need to find out if age has an effect on the factors affecting turnover. For instance is there a difference in the factors important to age group 31 to 40 and age group 41-50 and age group 51-60?

What technique do I employ?

Thank you

Abiola,

If I understand the situation properly, MANOVA might be a good technique to use, assuming that there is some correlation between the factors. You have one independent variable (Age) and 25 dependent variables.

Charles

Hi Charles!

What statistical treatment would you recommend for us to use if we are trying to detemine the relationship between a single dependent variable and 10 independent variables. Thank you.

Alexa,

It really depends on what you mean by “the relationship between”. E.g., you could use multiple regression for this. If, however, you want the correlation coefficient, then you could use the value calculated as shown on the following webpage:

Multiple Correlation – Advanced.

Charles

Hi!

First of all, thanks a lot for your help, as it is very useful for people like me, who are not very familiar with statistics.

I have maybe an stupid question. I would like to see the correlation between two different cell populations (let’s see cells A and cells B, how the population of A change according to the population of B). In addition, I have two different kind of animals (WT and KO). I can see correlation between these cell population in KO but not in WT, probably because i dont have too many animals (4 in each condition).

My question is: as i would like to know the global changes of these 2 cells populations, is it possible to merge WT and KO to do the correlation? When i do this, i observe very nice correlations, and within the graph i can observe how WT mice have for example less A cell populations and KO more…My boss considered that you can only do the correlation using one mice population, but not altogether, but i think this is possible to do it (maybe im wrong!), as i wonder the global changes of these 2 cells populations independently of these cells comes from WT or KO mice.

I am not sure if i explain appropiartly. I look forward to hearing from you. Again, Thanks a lot for your help!

Sylvia,

Sorry, but I don’t understand the scenario you are describing or your questions.

Charles

Hi Sir

Hope you in good health always.

I’m working on the land use change affecting the climate changes.

I have 9 parameters of climate for 3 periods and

8 type of land uses for same 3 periods.

I want to observe whether changes in the land use affected by the climate parameters,

and I want to identify the most affected climate parameter to the land use changes.

I’m inquire your advice and opinion sir cause I’m really poor in statistical analysis.

Sorry, but you haven’t provide me with enough information to be able to offer any advice. Perhaps you could use ANOVA, MANOVA or regression, but I can’t really tell.

Charles

Dear Sir

Namaste

I have soil data(nitrogen,phosphorus,potassium,calcium,magnesium content in soil),Plant data(nitrogen,phosphorus,potassium,calcium,magnesium content in leaf) and yield data of ber plant.Sir How i calculate multiple correlation between soil data and ber yield and between leaf data and ber yield.

With regards

This really depends on the details, but perhaps you can use the approach described on the following webpage:

Advanced Multiple Correlation

Charles

Charles,

I believe I found the answer to this question as I read along, but I would like confirmation before I move forward. I have test scores that I would like to correlate–simply looking for the strength of relationship between skills assessed. The scores are on different scales (standard score–average 100, scaled score–average 10, and raw scores–# correct). I’m a little rusty on my stats. I’m thinking that I do not need to convert these scores to the same scale before I calculate correlation. Can you confirm? My other option would be to convert all scores to raw scores then correlate, but I’m thinking I don’t have to do that.

Thanks much,

H

H,

If the conversion that you have in mind is to multiple the score by some constant and/or add some constant, then the conversion will have no affect on the correlation coefficient. You will get the same answer whether you make the conversion or not. Try it.

Charles

hey i want to find a relation between 3 quantities x,y and z; x and y are independent and z depends on x and y, such that z is directly proportional to x raised to m and y raised to n. is there a way i can find m and n?

Parth,

If the relationship is z = ax^m + by^n, then you have a nonlinear regression problem. One way to find values for m and n (and a and b) is to use Solver. The approach is similar to that shown for exponential regression. See the following webpage for details:

Exponential Regression using Solver

Charles

Hi Charles,

I need to predict number of orders for next week for each weekday at different time intervals based on previous 6-7 weeks.I was trying to use the trend function in Excel but problem is that my output is dependent on 2 factors-weekday and time of the day.How do I go about it?

Pragati,

It looks like you have to deal with seasonality. See the following webpage for help:

Regression Analysis with Seasonality

Charles

Thank you Charles!

Pingback: What Is Pairwise Correlation – Safetyask.com

Hello Charles,

Thanks for the explanation of this, but, as a novice i am still having some troubles understanding. You can say I am a learn-by-reallife-example kind of guy.

Say i have to do debt purchase scoring, meaning what % of the nominal price i am willing to pay based on account characteristics. I have benchmarks including various types of data, including those characteristics. Lets say that each account in my benchmark has:

– product type

– debt amount

– days past due

– age of debtor

– employment status of debtor

– principal/interest ratio

– monthly cash collection hitory

From benchmarks, i should somehow get weight, or score, for each of the mentioned characteristics, so when i compare it to new accounts, i can say that if i am estimating a new account that is a housing loan, i will give lets say 2% of value, if the age is 35-45, additional 1%, if due is 1000-2000$, additional 4%, etc. I think you get the point.

My questions are:

– how to determine the weight(or % value) of each account characteristic

– how to compensate the missing data (if i do not have age for example, i can not hive it 0%, but i can not also give it max %; should i use some average, or utilize some sort of compensation factor?)

Thanks in advance for your answer, ot at least a point in the right direction

Ivan

Ivan,

1. how to determine the weight(or % value) of each account characteristic

I don’t have enough information to help you about this.

2. how to compensate the missing data

See the following webpage

Missing Data

Charles

Charles,

For the 1st question, What kind of data would you need?

What i have is historical data for benchmarking. So basicaly, like in the example list above, i have an account like:

Housing Loan, Debt is 2456$, Days past due is 411 days, debtor is unemploced and 46 years old, ratio between principal and interest is 1,8, and he has payments (cash inflow) in the amount of 220$ since he is in my portfolio.

I have data like that for approx 60 thousand accounts. And based on that, i have to weigh the value of each characteristic regarding cash inflow, to basicaly see how valuable is the new portfolio to me (to break even with my investment and gain profit after the initial period).

So based on benchmark i should be able to know is the loan, or current account, or anything else, in newcoming portfolio worth 2% of the nominal value, od 3% or something else? And all that based on historical data.

Thank you

Ivan,

Unless I am missing something, you need to determine the weights based on your knowledge of the problem that you are trying to solve. This doesn’t appear to be a statistical issue, but one that has to do with your knowledge of the real-world problem.

Charles

Charles,

Well, in my opinion, if the historical data suggests that, for instance, 20% of Housing Loans will pay approx 30% of their debt in 1st year after purchase, that should be considered as a statistical conclusion? Unless I am missing something in the very foundation of the matter.

Thank You,

Ivan

Ivan,

Ok, maybe you are right that this can be considered a statistical matter, but in any case I don’t know how to calculate the weights based on the information that you have provided.

Charles

Charles,

This is a quote from a web site i stumbled upon a few days ago but i seem to have lost the hyperlink. But basicaly, it is what i am looking for, but can not seem to get a grasp of it in excel.

—quote—

Statistical models function in much the same way as judgmental models. However, in choosing the factors to be scored and weighted they rely on statistical methods rather than the experience and judgment of a credit executive.

Statistical models consider many factors simultaneously, a process that calculates and analyzes multivariate correlation to identify the relevant tradeoffs among factors, and assigns statistically derived weights used in the model. The key factors are generally captured from credit agency reports and the credit files of the client.

Statistical models are often described as a scorecard, a pooled scorecard, and a custom scorecard. A scorecard uses data from one firm. A pooled scorecard uses data from many firms. A custom scorecard blends a statistical model with some of the factors used in a judgmental model.

—end quote—

Charles,

I am analyzing language production data for 18 individuals across 3 sampling times. The stimuli are the same each time and my variable is number of productive words. Would a multivariate correlation be appropriate for this type of data to examine the test retest stability (i.e., is the group stable over three times or are they significantly different)?

Thank you for making this software available for all!

Cheers,

Kristina

Kristina,

To find out whether there are differences over the 3 times, you might use Repeated Measures ANOVA. This is described on the Real Statistics website.

Charles

Pingback: Factor Analysis in Marketing - Marketing on Data

Pingback: Factor Analysis in Marketing Example - Marketing on Data

This is well explained and it is really helpful. But I have two doubts for the extension of these.

1) How can I get correlation from correlation coefficient when there are two independent variables(x,y)?

2) How can I get correlation from correlation coefficient when there are three independent variables(x,y,z)?

I am talking about reverse engineering I guess. I have the standard correlation coefficient values but I want to get correlation out of them.

The problem is in some cases I have 2 independent variables or 3 independent variables and in some cases I 4 independent variables.

So my doubt is how do I get correlation value from correlation coefficients when I have many independent variables

The referenced webpage describes how to calculate the correlation coefficient with 1 dependent variable and 2 independent variables. With more than two independent variables, please see the following webpage:

Advanced Multiple Correlation

Charles

Hi Charles,

Thank you. I have gone through them.

I have the correlation coefficient already with me. I want to to calculate correlations from the coefficients.

Thanks,

Karthik

Hi Charles,

I have mailed you the issue. Kindly have a look at it.

Thanks,

Karthik

Hello, I Have to make a Report. It is complex data with more than 3 variables. My boss needs a Report for the sales (more than 1500 types of model) of two years separately, with their respective customers, their locations, Promoters at those locations. Each customer has more than 35 branches all over the country to whom we supply our Goods.

Please help me to arrange this complex data in one file.

I am not an expert.

Sorry Maaz, but you have not provided enough information for me to even give you a suggestion as to how to proceed.

Charles

Hi Charles,

This page really has a lot of readers responding. I am looking at this page from the regression point of view. I am looking at the State Rankings data and wonder how to explain the variation in say Poverty in terms of the remaining data. In other words, how much does the variation in income or education contribute to the variation in poverty?

So, compute the overall R^2 for the entire data set and then compute the R^2 for the data set but leave income data out. The difference between the two R^2 values is the semi-partial correlation coefficient and it account for the contribution made by income to the variation in poverty.

semi-partial r^2 = R^2 – Ri^2

where the i in the second term indicates the ith independent variable excluded.

There is another quantity VIF (Variance Inflation Factor) where

VIF = 1.0/(1-Ri^2)

Interestingly, we can compute the partial r^2 like so

partial r^2 = semi-partial r^2 * VIF = (R^2 – Ri^2)/(1 – Ri^2)

Pls how can I compute for 4 variables without y variables

Amaka,

See the webpage Advanced Multiple Correlation

Charles

Hello Charles,

According to wikipedia, the multiple correlation coefficient should be between 0 and 1. Let’s generate 10 random numbers between -1 and 1 for each of r_xz, r_yz, and r_xy and then compute R.

r_xz = -0.42484496 0.57661027 -0.18204616 0.76603481 0.88093457 -0.90888700 0.05621098 0.78483809 0.10287003 -0.08677053

r_yz = 0.91366669 -0.09333169 0.35514127 0.14526680 -0.79415063 0.79964994 -0.50782453 -0.91588093 -0.34415856 0.90900730

r_xy = 0.77907863 0.38560681 0.28101363 0.98853955 0.31141160 0.41706094 0.08813205 0.18828404 -0.42168053 -0.70577271

Using your formula, I get:

R = 2.0302762 0.6704749 0.4608397 4.1256652 1.4283935 1.5836626 0.5178415 1.3375018 0.3472992 1.1998119

It is clear that there are values of which R is greater than 1. Am I missing something here?

I would really appreciate your help!

Cheers

Jeffrey,

You can’t simply generate 10 random numbers since the values of r_xz, r_yz and r_xy are not independent. You can generate random numbers for x, y and z and then compute the values of r_xz, r_yz and r_xy using CORREL. When you then calculate the values for R, you should get values less than or equal to 1.

Charles

Can u plz send me interpretation of any table of multiple correlation

What table of multiple correlations are you referring to?

Charles

I have a rather large data set and need to know how to calculate a multitude of correlations, for example the data set I have is combined of demographic data and crime data and I need to know how each demographic data point correlates to each crime data point. Is there anyway to use EXCEL to calculate such data correlations or is it only possible to calculate a small data set at a time.

You can use the CORREL function to calculate the correlation between two data sets even if they are large.

If you want to calculate all pairwise correlations between a number of data sets, you can use the Excel Correlation data analysis tool or the Real Statistics CORR function.

There are many other possibilities in Excel depending on the specific problem you are trying to solve and the format of your data.

Charles

Good day sir pls can you recomend text books on statistics for higher learning here is my line 08160817081

Victor,

There are lots of textbooks available. It really depends on (1) which topics you are interested in, (2) are you interested in the theory or just how to conduct the tests, (3) how mathematical should it be, etc.

Charles

Hi,

I have one variable output temperature which is dependent on 6 variable which are length,depth,velocity,conductivity,time and diameter. how can i make correlation between them and how accurate it will be? Thanks in advance

Hi,

You can simply run the regression model and this will calculate the desired correlation (as well as the adjusted correlation, which is a less biased estimate).

Alternatively, you can calculate the correlation directly using the

Rsquarefunction, as described on the webpageAdvanced Multiple Correlation

Charles

This Excel’s CORREL(variable1, variable2) function to run autocorrelation (ACF)for four series – et, variables A, B, & C, results not correct.

Can anybody help?

Excel’s CORREL is intended to calculate the correlation between two data sets. If instead you are looking for the correlation between 4 data sets, then you need a different function. You can use the approach described on the webpage Multiple Correlation – Advanced.

Charles

Hi,

can i use the multiple correlation formula of definition 1 for 3 independent variables?

Also for the adjusted formula, what is n? Thanks.

Hi Kareem,

You need to consider one of the variables to be the dependent variable, otherwise Definition 1 doesn’t make sense.

n is the sample size of variable z (which is equal to the sample size for x or y).

Charles

Hi Charles,

I am looking for a relationship or a limit on correlation coefficient between x and z, given the corr. coefficients b/w x and y, and y and z. Could you please elaborate on this kind of problem?

Thanks a lot!

Sorry, but >I don’t understand your question.

Charles

Hi Charles

Can you use interchangeably the pairwise correlation coefficient between independent variable x and dependent variable z, and partial dependence plot of z on x?

Thanks,

Matteo

Matteo,

If you are asking whether the correlation of z on x and y is the same as the correlation of z on y and x, the answer is yes.

Charles

I mean how to calculate regression coefficient (R) for 5 variables.

This is explained on the webpage Multiple Correlation – Advanced. Alternatively you can use multiple regression to calculate the value of R.

Charles

Hello,

Hope you will be fine. I need your assistant about computing R formula if i have 5 independent and one dependent variable. please guide me how can i modify R formula according to my scenario

Sorry, but I don’t use R. The site is about using Excel for statistics.

Charles

Dear Charles,

thanks for the nice breakdown.

I have a simple problem where i have 3 dependent variables (a, b and c) and i would like to isolate c to see how the a affect b independent of c.

i assume i have to use the partial correlation formula in definition 3 right? (the first of the two…

and i assume the r values are the r (correlation coefficients) i get from a pearson correlation.

i insert those in my formula and i get my results, but they are not between 1 and 0.

how is that possible?

i checked the excel formula several times.

have I made a mistake or is it possible that i get a value above 1? (1.19)

thanks for your help

c

Dear Cesare,

Yes, Definition 3 and Pearson’s are correct, but you should never get a value larger than 1. If you send me an Excel spreadsheet with your data I will try to figure out what has gone wrong.

Charles

How you, Charles?

I was confused by the printout. I did not how to interpret it. Please help me. Thank you!

A scholar was interested in determining if there were differences in correlations between anger and depression when removing the effects of self-esteem for both variables among students (Group 1), teachers (Group 2), and farmers (Group 3). Results are as follows. Please provide a complete conclusion and explanation:

The partial correlation for group 1=0.34

The F value= 5.99 with a probability=o.018

The partial correlation for group 2=0.8018

The F value=84.6 with a probability =.00

The partial correlation for group 3=.3

The F value=4.7 with a probability=0.03

The global test for equality=18 and it has a probability 0.0001

3. Step Differences

1 st partial corr 2 ND partial corr Rstat Prob

(1) .30 (3) .80 5.4 0.00

2. Step differences

1 st partial corr 2Nd partial corr Rstat Prob

(1) .3 (2) .33 .25 1

(2) .3 (3) .8 5.11 0.00

Gorge,

It sounds like you are referring to a printout from some other statistics tool (SPSS, SAS, etc.), which I don’t have, and so I am not able to comment.

Charles

Dear Charles,

I have some set of data for data analysis and l seek your assistance in analyzing them using correlation and regression techniques

What sort of assistance are you looking for? You are welcome to ask questions.

Charles

sir,

i have a question, what statistical instrument can i used in my thesis if i have 4 variables????

To calculate the correlation coefficient for more than 3 variables you need to use matrices as described on the webpage http://www.real-statistics.com/multiple-regression/multiple-correlation-advanced/

Charles

Hi

Thanks for your website. Very helpful.

I am trying to do a multiple correlation but my independent variables were obtained from different sized populations.

What adjustments should i make?

regards

Hi Simao,

Do you mean different sized populations or different sized samples? The calculation of the multiple correlation coefficient described on the referenced webpae is based on the sample data used and not on the underlying populations. I will try to answer the question based on missing data. If this is not what you intended, please elaborate.

On the referenced webpage the correlation coefficient is calculated from a sample of 3-tuples. If one or more of the data elements in the 3-tuples is missing, generally you have three choices: (1) drop that 3-tuple from the sample, (2) calculate the correlation using a pairwise approach whereby only non-missing pairs are used in calculating the correlation coeffcient for that pair of variables (note that the calculation of the multiple correlation coefficient is based on the values of the three pairwise correlation coefficients) and (3) impute the value(s) of the missing elements(s).

The first two approaches are described in more detail on http://www.real-statistics.com/multiple-regression/least-squares-method-multiple-regression/ when I explain two approaches for calculating a covariance or correlation matrix. The third approach is described in http://www.real-statistics.com/handling-missing-data/.

Charles

Thanks for your reply

Yes I meant different sized populations. Each population produce the same amount of samples that I pretend to correlate. But the populations are different in size. Should I do any adjustment for this?

regards

Simao,

I don’t have a precise answer for you, but I can offer the following suggestion:

If the finite populations are large, you probably don’t need to do anything. The usual correction factor for finite populations is to multiple the standard deviation by the square root of (N-n)/(N-1) where N = the population size and n = the sample size.

Charles

Are multiple correlation and multiple regression same?

Rahul,

Not exactly, although they are clearly interrelated. To calculate the multiple correlation coefficient you can use the results for R^2 from multiple regression.

Charles

Hi Charles,

You give the multiple correlation coefficient in Definition 1, and an adjusted multiple correlation coefficient in Definition 2. I’m trying to find a correlation between 1 dependent variable and 2 independent variables, so do I have to use the adjusted multiple correlation coefficient (Definition 2) to accomplish this? Or do I just use the multiple correlation coefficient in Definition 1?

Thank you in advance,

Dan

Also, how can you determine the p-value of such correlation?

Daniel,

The test that the multiple correlation coefficient R is zero is the same as the test that the multiple regression model is a good fit for the data as explained on the webpage http://www.real-statistics.com/multiple-regression/multiple-regression-analysis/.

The test statistic is F = (R^2/k)/((1-R^2)/df) where k = # of independent variables, n = the sample size and df = n – k – 1. p-value = FDIST(F,k,df).

Charles

Charles,

Can you please tell what is FDIST here? I am also stuck with the same problem of how to find p-value of such correlation.

Thank you in advance.

Arman,

Sorry, but I don’t see any reference to FDIST on the referenced webpage.

Charles

Daniel,

Generally R^2 is used (where R = the unadjusted multiple correlation coefficient), even though the adjusted multiple correlation coefficient is a less biased estimate of the population correlation coefficient, and so should be a better estimate.

Charles

My daughter is trying to do a correlation assignment using three variables. She tried using tennis results (height, weight and age) but found no correlation. She also tried cricket results (age, height & success rate). I am despairing as I can not help her. Do you have any suggestions? She is in Year 12 and has chosen sport results but it could be anything!

Hi Claire,

Not sure what the problem is. Is she trying to find three variables that correlate? What is she trying to accomplish?

Charles

“Unfortunately R is not an unbiased estimate of the population multiple correlation coefficient, which is evident for small samples”

Golly, that really is unfortunate!

Hi Charles,

I’m trying to compute a partial correlation but from the examples above I am still not clear on how to do it. Let’s say I have a 30 x 5 matrix (column 1 = subjects in an experiment, column 2 = variable 1, column 3 = variable 2, column 4 = variable 3, column 5 = variable 4). I have computed the correlation between variable 1 and 2 using the correlation function in Analysis ToolPak. But what I’m really interested in is computing the partial correlation between variable 1 and 2 while controlling for variable 3 and/or variable 4.

Your help would be greatly appreciated.

Best,

Dana

Hi Dana,

Because there are more than three variables you can’t use the formulas described in the referenced webpage. With any number of variables you can calculate the partial correlation using the approach described on the webpage http://www.real-statistics.com/multivariate-statistics/factor-analysis/validity-of-correlation-matrix-and-sample-size/, especially in the text around Figures 4 and 5.

I plan to revise the Multiple Correlation webpage shortly to show how to calculate the partial correlations with more than 3 variables. I also plan to add a function to do this to the Real Statistics Resource Pack. But for now please refer to the webpage described above.

Charles

Hi Dana,

I have now updated the website to better explain how to calculate the partial correlation coefficient when there are more than three variables. You can find this on the new webpage http://www.real-statistics.com/multiple-regression/multiple-correlation-advanced/.

Charles

sir ,

sir i request for ,

in multiple co-relation of cofficient what meaning of

b123X2 + b132X3=e123

this equation where e is the error. i dont no which error

please help me sir……………………………..

Shailesh, Sorry, but I don’t understand your question. Charles

Sir,

I’m new to this, so I may be asking a rhetorical question. Are we allowed to compare a simple correlation r of a criterion with a multiple correlation of R of the same criterion?

I think that R in this case will be larger than r and hence, R is closer to a perfect correlation. Is this always true?

Serge,

Both R and r represent correlation. In fact with two data sets R and r are the same thing. Depending on what sort of assertion you are trying to make you can certainly compare them.

As can be seen from Figures 3 and 4 of the referenced webpage, R^2_p,wi = A + B + C and r^2_p,w = A + B, and so it follows that R^2_p,wi >= r^2_p,w. Thus R_p,wi >= r_p,w. This is generally true.

Charles

Hi Charles

Great posting, which I am really struggling to understand properly, having forgotten all the statistics I ever did (long ago!)

I’m wondering if you could briefly interpret the meaning of each Line of Figures 4 & 5 to gibve me a starting point

eg A+B+C Rp,wi2 0.33105

means that Proverty is well correlated to Infant Mortality, independent of being White, or whatever

Mike,

I have updated the webpage so that things are explained a little better. Here is a bit more info:

A+B+C = R_p,wi^2 = .33105 means that 33.1% of the variance in P is explained from the association with W and I (and so 66.9% is not explained in this way). This is especially useful with linear regression since it gives you some idea about how good a linear regression model for P would be based on W and I. Here it is best to look at the webpages on linear regression and multiple regression.

From Figure 3 and 4, we see that the value R_p,wi^2 = r_pw^2 + r_p(i,w)^2. This is because the left side of the equation is A+B+C and the first term on the right side is A+B while the last term is C. This also follows by Property 2.

r_PI = .5644 (cell M5 of Figure 2) means there is a moderately high correlation between P (poverty) and I (infant mortality). If we factor out the effects of being white (W) from this correlation we get the partial correlation r_PI,W. From cell X38 of Figure 5, we see that the square of this number is .322505, and so taking the square root we get that r_PI,W = .5679, which is only slightly higher that the value for r_PI. This means that the being white (W) has virtually no impact on the correlation between P and I (and the little impact that it has is a dampening effect).

I know that this is a little confusing, but I hope that this explanation helps, at least a little bit.

Charles

If at initial state I only have 3 variables : length, width and thickness.

Can I create new independent variables by taking above variables,

for example : Weight = length x width x thickness x density, Volume = length x width x thickness.

Now I have 5 variables length, width, thickness, Weight and Volume to be tested.

All variables give a significant correlation factor When I did the correlation factor by using excel data analysis.

how to select the variables will be giving more representing model.

Tnzal,

With these number of variables, you should look at the webpages for multiple regression http://www.real-statistics.com/multiple-regression/testing-significance-extra-variables-regression-model/. In particular, the webpage http://www.real-statistics.com/multiple-regression/testing-significance-extra-variables-regression-model/ explains how to tell which variables make a significant contribution. To maintain a hierarchical model you also need to density as a variable (like length, width and thickness).

Charles

Sir,

I downloaded your Realstats Resource Pak , I put the Realstats.xlam in a folder called Realstats in my folder on a server. I clicked on Realstats.xlam, clicked on File>Help|Options>Add-Ins, clicked Go. Add-ins available are AnalysisToolPak, Analysis ToolPak – VBA, Euro Currency Tools, Solver Add-In. It does not contain Realstats. Could you please help me. Appreciate that very much. Thank you very much, Ila

Hi Ila,

That Realstats does not appear is not surprising. You need to click on the Browse button and navigate to where you stored the file and then click on OK. RealStats should now appear. Click on Realstats and you should be able to use the supplemental functions and data analysis tools.

Charles

Im sorry sir, may be I cant explain what i want to ask…i meant that just like every statistical tool has a purpose, why were matrices devised in the first place? we can add, subtract, multiply numbers in any case, why we do we need to put them into a rectangular array and then do basic math on them like they do in matrices? Sir, if I am not making sense, please ignore this msg. Regards

Dear Radhika,

I will try my best, what i understand from your questions.

Matrix is a rectangular array of variables or (numbers, symbols, or expressions). This arrangement is for systematically analyze the large number of variable or computation of complex relationship of variable (play with multivariate relation) . For example it is very easy to solve equations with two variables but it would be complex for more then two variable, but it is easy to compute through matrix system. It is useful to solve and generalize mathematical relationship of various field i.e. Statistics, Biological Sciences Economics etc. Because of its practical usefulness Matrix become new discipline in Mathematics.

Thank You

Regards

Hello 🙂

In my syllabus, there’s a topic called ‘basic matrices and their uses’…i dont understand relevance of matrices…so if u could please guide…or give some reference…thank u.

Radhika,

Matrices are used quite often in statistics. There are many examples throughout the Real Statistics website. E.g.

http://www.real-statistics.com/multiple-regression/least-squares-method-multiple-regression/

http://www.real-statistics.com/multivariate-statistics/

Charles

Thank you sir,

What i mean is like correlation is a tool to know the strength and direction of relation between two variables, regression is to predict a dependent variable’s value through independent variable/s, what is the relevance of ‘matrices’? what is the uniqueness of matrices as a statistical tool ? i know i am missing something really basic., if you could pls help.

Regards.

Radhika,

Sorry, but I still don’t understand your question. Matrices are a tool used in performing statistical analysis.

Charles

Dear

Did you have a reference book about multiple correlation and multiple regression?

If you have some e-book or book information about multiple correlation and multiple regression, please contact me at my email.

Dear Rizky,

The Bibliography of the website contains a number of good references which covers these topics. The website itself should serve as a good reference for these topics. Are there some things that you would like to see covered in more detail or more clearly in the website?

Charles

Sir

I came across your site while browsing some information on Multiple correlation. It is described very nicely. It would be great if you could expand it to four variables also.

Vikas Handa

Hi Vikas,

When using more than 3 variables it is best to use matrices. For example in the Multiple Correlation webpage, I describe the function R

_{z,xy}which is the multiple correlation coefficient of z with x and y. If there are 4 (or more variables), I refer to a matrix (or range in Excel) R1 which contains the data for all 4 variables and then look for the function which gives the correlation coefficient of the kth variable in R1 with respect to the other variables in R1.This is better explored in the Multiple Regression webpage http://www.real-statistics.com/multiple-regression/multiple-regression-analysis/. At the end of that page you will find the supplemental function RSquare(R1,k) which calculates the square of this correlation coefficient. Thus the correlation coefficient of the kth variable with respect to all the other variables is given by the formula =SQRT(RSquare(R1,k)).

It turns out that this can also be calculated by first finding the correlation matrix for R1, then taking its inverse and then selecting the kth element on the diagonal. If this has the value c then the desired correlation coefficient is the square root of 1-1/c.

Thus the desired correlation coefficient can also be calculated by the formula =SQRT(1-1/INDEX(DIAG(MINVERSE(CORR(R1))),k)).

I hope this helps.

Charles

hi charles. woud you mind helping me out with this

when two variables (x and y) are correlated, then define the four possible explanations of correlation and give your example for each?

Bekalu,

This sounds like a homework assignment, which you should do yourself. You can find information to help you at the following website:

http://www.real-statistics.com/correlation/basic-concepts-correlation/

Charles