Independence Testing

The method described in Goodness of Fit can also be used to determine whether two sets of data are independent of each other. Such data are organized in what are called contingency tables, as described in Example 1. In these cases df = (row count – 1) (column count – 1).

Excel Function: The CHITEST function described in Goodness of Fit can be extended to support ranges consisting of multiple rows and columns. In this case, we have:

CHITEST(R1, R2) = CHIDIST(x, df) where x is the chi-square statistic, R1 = the array of observed data, R2 = the array of expected values and df = (row count – 1) (column count – 1).

The ranges R1 and R2 must have the same size and shape and can only contain numeric values.

Example 1: A survey is conducted of 175 young adults whose parents are classified either as wealthy, middle class or poor to determine their highest level of schooling (graduated from university, graduated from high school or neither). The results are summarized on the left side of Figure 1 (Observed Values). Based on the data collected is the person’s level of schooling independent of their parents’ wealth?

Expected values Excel

Figure 1 – Observed data and expected values for Example 1

We set the null hypothesis to be

H0: Highest level of schooling attained is independent of parents’ wealth

We use the chi-square test, and so need to calculate the expected values that correspond to the observed values in the table above. To accomplish this we use the fact (by Definition 3 of Basic Probability Concepts) that if A and B are independent events then P(∩ B) = P(A) ∙ P(B). We also assume that the proportions for the sample are good estimates for the probabilities of the expected values.

We now show how to construct the table of expected values (i.e. the Expected Values in Figure 1). We know that 45 of the 175 people in the sample are from wealthy families, and so the probability that someone in the sample is from a wealthy family is 45/175 = 25.7%. Similarly the probability that someone in the sample graduated from university is 68/175 = 38.9%. But based on the null hypothesis, the event of being from a wealthy family is independent of graduating from university, and so the expected probability of both events is simply the product of the two events, or 25.7% ∙ 38.9% = 10.0%. Thus, based on the null hypothesis, we expect that 10.0% of 175 = 17.5 people are from a wealthy family and have graduated from university.

In this way we can fill out the table for expected values. We start by setting all the totals in the Expected Values table to be the same as the corresponding total in the Observed Values table (e.g. cell K6 contains the formula =E6). We then set the value of every cell in the Expected Values table to be

         (row total ∙ col total) / grand total

E.g. cell H6 contains the formula =K6*H9/K9. An alternative approach for filling in all the cells in the Expected Values table is to place the following array formula in range H6:J8 (and then press Ctrl-Shft-Enter):

=MMULT(K6:K8,H9:J9)/K9

See Matrix Operations for more information about the MMULT array function. We can now calculate the p-value for the chi-square test statistic as CHITEST(Obs, Exp, df) where Obs is the 3 × 3 array of observed values, Exp = the 3 × 3 array of expected values and df = (row count – 1) (column count – 1) = 2 ∙ 2 = 4. Since

CHITEST(B6:D8,H6:J8) = 0.003273 < .05 = α

we reject the null hypothesis and conclude that the level of schooling attained is not independent of parents’ wealth.

Example 2: A researcher wants to know whether there is a significant difference in two therapies for curing patients of cocaine dependence (defined as not taking cocaine for at least 6 months). She tests 150 patients and obtains the results in the upper left part of the table below (labeled Observed Values).

Chi-square test independence

Figure 2 – Chi-square tests for independence

We establish the following null hypothesis:

H0: There is no difference between the two therapies’ ability to cure cocaine dependence

We next calculate the Expected Values from the Observed Values and then the p-value of the chi-square statistic as we did in Example 1. This time, however, we will use the approach employed in Example 2 of Goodness of Fit, namely calculating the Pearson’s chi-square test statistic directly (using Definition 2 of Goodness of Fit). The value of this statistic is 5.516 (cell D17 in Figure 2). Since we are dealing with a 2 × 2 table of observations, df = (2 – 1)(2 – 1) = 1. Finally we observe that

p-value = CHIDIST(χ2, df) = CHIDIST(5.516,1) = .0188 < .05 = α

χ2-crit = CHIINV(α, df) = CHIINV(.05,1) = 3.841 < 5.516 = χ2-obs

and so we reject the null hypothesis and conclude there is a significant difference in the cure rate between the two therapies.

As was mentioned in Goodness of Fit, the maximum likelihood test is a more precise version of the chi-square test employed thus far. The lower right-hand side of the worksheet in Figure 2 shows how to calculate the maximum likelihood statistic (using Definition 1 of Goodness of Fit). The value of this statistic is 5.725, which is not much different from the test statistic we obtained using the Pearson’s test. Since this statistic is also approximately chi-square with one degree of freedom, the analysis is quite similar:

p-value = CHIDIST(χ2, df) = CHIDIST(5.725,1) = .015 < .05 = α

χ2-crit = CHIINV(α, df) = CHIINV(.05,1) = 3.841 < 5.725 = χ2-obs

and so once again, we reject the null hypothesis and conclude there is significant difference in the results for the two therapies.

Observation: It is very important to include all observations in the test. E.g. if in Example 2 we only test Cured vs. Therapy 1 and 2, we will get erroneous results. We need to include Not Cured as well as Cured.

Real Statistics Excel Functions: The following supplemental functions are provided in the Real Statistics Resource Pack:

CHI_STAT2(R1, R2) = Pearson’s chi-square statistic for observation values in range R1 and expectation values in range R2

CHI_MAX2(R1, R2) = Maximum likelihood chi-square statistic for observation values in range R1 and expectation values in range R2

CHI_STAT(R1) = Pearson’s chi-square statistic for observation values in range R1. This is CHI_STAT2(R1, R2) where R2 is the expectation values calculated from R1.

CHI_MAX(R1) = Maximum likelihood chi-square statistic for observation values in range R1. This is CHI_MAX2(R1, R2) where R2 is the expectation values calculated from R1.

CHI_TEST(R1) = p-value for Pearson’s chi-square statistic for observation values in range R1. This is CHITEST(R1, R2) where R2 is the expectation values calculated from R1.

CHI_MAX_TEST(R1) = p-value for Maximum likelihood chi-square statistic for observation values in range R1

The ranges R1 and R2 must contain only numeric values.

Real Statistics Data Analysis Tool: In addition, the Real Statistics Resource Pack provides a supplemental Chi-Square Test data analysis tool. To use this tool for Example 1 enter Ctrl-m and select the Chi-square Test option. A dialog box as in Figure 3 appears.

Chi-square dialog box

Figure 3 – Dialog box for Chi-square Test

Insert the observation data into the Input Range (excluding the totals, but optionally including the row and column headings; i.e. range A5:D8), click on the Excel format radio button and press the OK button. Leave the Fisher Exact Test option unchecked (although, see Fisher Exact Test for use of this option).

The data analysis tool builds an array with the expected values and performs both the Pearson’s and maximum likelihood chi-square tests. The Cramer effect size, and for 2 × 2 contingency tables the Odds Ratio effect size, as described in Effect Size for Chi-square are also calculated. The output from the data analysis tool for the data in Example 1 in shown in Figure 4.

Chi-square independence Excel

Figure 4 – Chi-Square data analysis tool output for Example 1

Observation: As described in Goodness of Fitthe expected frequency for any cell in the contingency table  should generally be at least 5. With small tables (especially 2 × 2 tables), cells with expected frequencies of at least 10 would be preferable.

For large contingency tables, a small percentage of cells with expected frequency of less than 5 can be acceptable. Even for smaller contingency tables having one cell with expected frequency of less than 5 may not cause big problems, but it is probably a better choice to use Fisher’s Exact Test in this case. In any event, you should avoid using the chi-square test where there is an expected frequency of less than 1 in any cell.

If the expected frequency for one or more cell is less than 5, it may be beneficial to combine one or more cells so that this condition can be met, although this must be done in such a way as to not bias the results.

Observation: In addition to the usual Excel input data format, the Real Statistics Chi Square Test data analysis tool supports another input data format called standard format. This format is similar to that used by SPSS and other statistical analysis programs.

Example 3: A survey is conducted of 38 young adults whose parents are classified either as wealthy, middle class or poor to determine whether they will graduate from university or not The results are summarized in the table on the left side of Figure 5 (only the first 13 of 38 rows of data are shown). Based on the data collected is a person’s level of schooling independent of their parents’ wealth?

Independence testing standard format

Figure 5 – Data and chi-square tests for Example 3

Once again enter Ctrl-m and select the Chi-square data analysis tool. When the dialog box shown in Figure 3 appears, insert A3:B41 into the Input Range, click on the Standard format radio button and press the OK button.

The data analysis tool first builds a contingency table (range D5:F8 of Figure 5) and performs the same type of analysis as for Example 1 and 2. Since sig = no (cell R11 or R12) we cannot reject the null hypothesis that a student’s graduating from university is independent of his/her parents’ level of income.

47 Responses to Independence Testing

  1. rajesh bansal says:

    p value is 3.28678E-14 . WHAT DOES IT MEAN. PLZ TELL

    • Charles says:

      Rejesh,
      This is a number written in scientific notation, i.e. 3.28678 x 10^(-14). This is a very small number, almost zero.
      Charles

  2. Rosshan Yadav says:

    how to calculate likert data in chi-square test ?
    suppose we take standard likert scale 1-5.
    plz show me with example.

  3. Tanya says:

    Hello: I am trying to analyze whether patient feedback influences a doctor to recommend a certain treatment.
    The data I have is “are you likely to recommend treatment X” (3 rows: yes, somewhat, no) and “did you receive positive feedback about treatment X from your patients” (3 columns: yes, no, don’t know). The data is below. I did a chi-test analysis with a p-value of 0.43.
    5 1 2
    8 0 5
    2 1 4

    The other data I have is “are you likely to recommend treatment X” (3 rows: yes, somewhat, no) and “did you receive negative feedback about treatment X” (3 columns: yes, no, don’t know). Data below. The chi-test p value I got is 0.07.
    1 5 2
    6 4 3
    1 1 5

    I would conclude that neither positive or negative feedback influences docs to recommend treatment X, but want to make sure the chi-test is the right one to use?

  4. Inês says:

    Hi Charles,

    So I have a problem. I need to cross the variable “field of profession” with a yes/no question in SPSS. The problem is that when i do cross this variables I get that 42.9% have expected count less than 5, which means the test isn’t valid. Since my table is 7×2 I can´t read the results from Fisher’s Test. What can I do? Is it ok to combine some fields of profession (rows)..for instance..is it ok to combine phisician with nurse in the same row? Doesn’t this change the results?

    Thanks

  5. Yan Win Soe says:

    Hello,

    May I know Chi-square test for homogeneity.
    e.g. Null Hypothesis : P1=N1 and P2=N2 and P3=N3
    Alternative Hypothesis:
    P1 is not equal N1 or
    P2 is not equal N2 or
    P3 is not equal N3
    If we reject Null Hypothesis. There we have to find for 95% CI for each proportion so that we can prove which pair is not equal in reality. For this case how can we calculate by using Excel. Could you please explain me? Thanks

    • Charles says:

      Yan Win Soe,
      Are you referring to three-way contingency tables? If so, please see the following webpage:
      Log-Linear Regression
      Charles

      • christine says:

        Hello Charles,
        Kindly clarify this for me;
        How do you treat statistical significance tests using age ranges or years of experience.
        say you have one group made up of managers and you want to establish each of their opinion on a process by their years of experience (7 point LIKERT).
        manager = 8
        years of experience: 0-3 (4), 4-6 (3), 10 + (1)
        Which tests is most appropriate?
        Do you take an average of each range?

        Thank you.

  6. Shri says:

    Hi Charles,

    Can I used Chi square test to understand if Age or tenure has any relation to employee quitting.

    So the table would have Age range in row and Resigned or Active in column

    Kind regards

    Shri

    • Charles says:

      Hi Shri,
      Yes, you can create such a 2 x 2 contingency table and use the chi-square test for independence. You didn’t seem to factor tenure into this table, though.
      Charles

      • Shri says:

        Hi Charles,

        Thanks.

        Sorry to bother with these questions. Trying to use stats for decision making in HR.

        I was planning to use tenure in a separate table, comparing tenure and turnover, does that make sense.
        OR

        Can I use tenure and age in one table but it would not include turnover in it.

  7. Able Yeung says:

    Hello Sir/Madam,

    Can you please help give me some advices on how to conduct a Chi-Square test to understand whether the gender data from my sample dataset has any deviations from the census dataset? p>0.05 or not?

    My Gender Dataset is
    Male 307
    Female 330

    The Census Dataset is
    Male 3303015
    Female 3768561

    It is because the SAV or Excel file in SPSS failed to add as large as over 3303015 and 3768561 and grid cells, so I am facing a headache in doing the aforesaid analysis.

    Thank you very much.

    • Charles says:

      I can’t comment on SPSS, but if you use the Real Statistics Chi-square data analysis tool, you will get a result, even for such large data elements.
      Charles

  8. sarah says:

    hello!
    I am trying to do a chi test on a 5*5 table but the 5th line is full of 0. the test comes invalid but when i eliminate the last line which is only 0 (it becomes a 4*5 table) it works and i get results. How should I proceed??

    • Charles says:

      Sarah,
      You shouldn’t use chi-square if you have cells with zero values (or even a lot of cells with values less than 5). Either you should eliminate the last row or combine it with the row above.
      Charles

  9. Robert Dalton says:

    Charles,
    I am running a 5×5 Chi square test in Excel format. I am getting my Expected Values and Summary and they look to be fine. However, I am getting an error in my chi-sq, p-value, , sig, and Cramer V for both Pearson’s and Max likelihood. The only thing showing up is my x-crit. I am running 4.4 Excel 2007.
    I should mention that I have several cells that the number is 0 in my 5×5.

    Thanks!!!

    • Robert Dalton says:

      I had two full rows that had zeros in my 5×5. I removed them and ran it again as a 3 rows x 5 columns. This time it worked. Will this cause a bias the results or since they are zero does it matter?

      • Charles says:

        Robert,
        Eliminating full rows or columns with zeros is fine, but if the table that remains contain one or more cells containing a zero, then the test is not considered to be valid. You can get around this by combining rows or columns.
        Charles

    • Charles says:

      The test is not valid if there are cells which contain a zero.
      In this case you might use the Fisher Exact Test, although the usual version of this test is for a 2 x 2 table.
      Charles

  10. Fereshteh says:

    Hi,I have a gene expression file which contains numerical data and has high dimension.how can I use chi test in my data for dimension reduction.thanks

  11. shiv says:

    hi sir, please explain how i feed the data to calculate chi-square test with the help of real statistics data analysis tool?
    sir, please explain with any new example of share any video for that.
    please sir,
    compile error in hidden module: froinput is shown again and again

    • Charles says:

      The Real Statistics Chi-square data analysis tool accepts input data in either (a) Excel format (i.e. contingency table format) as shown in range A5:D8 of Figure 1 of the referenced webpage or (b) standard format (also called stacked format) as shown in Figure 5.

      Regarding the compile error, please let me know the following information:

      1. What do you see when you enter the formula =VER() in any cell?
      2. What release of Excel and Windows are you using?
      3. What language are you using (English, French, etc.)?

      Charles

  12. Iddy says:

    I like this page, it is meangfull to not only staticians but also to all researchers….

  13. Anthony says:

    Hi Charles, you did a great job here. I would like to clarify something. Below are my data set:
    134 8 3
    310 282 99
    1404 127 267
    700 7 1
    874 83 53
    20 238 262
    130 18 74
    161 68 132
    41 0 3

    A total of 9 rows and 3 columns. Why is my p-value = 0. What does this mean? I hope i calculated it correctly. My p-values is zero, which makes my χ2-crit equally 0. Please help!

    • Charles says:

      Anthony,

      I calculated that chi-square stat = 2159.44 and p-value = CHIDIST(2159.44,16,TRUE) = 0. This means that the test is highly significant. Thus the two variables that you are testing are not independent.

      It does not mean that χ2-crit = 0. In fact, χ2-crit = CHIINV(.05,16) = 26.30 (assuming an alpha value of .05). Since χ2 > χ2-crit, once again we conclude there is a significant result.

      Charles

  14. karishma says:

    hey..
    i tried this module but it shows me an error saying “compile error in hidden module: chiSquare” can u please tell me how do i solve tht error….

    thanks..

    • Charles says:

      When you enter the formula =VER() in any worksheet cell what result do you get? Also which version of Excel are you running?
      Charles

  15. Dick Colby says:

    All much too complex for my needs, which are very simple: do 40 male and 60 female fruit flies fit a 1:1 expectation at P<0.05? How do I do it in Excel? My contingency table has four columns (o, Ho, e and Chi-square) and three rows (male, female) and sum. How do I do it in Excel?

    • Charles says:

      Dick,

      You seem to be conducting a goodness of fit test (which is related to an independence test, but slightly different). The approach is similar to that shown in Example 3 of http://www.real-statistics.com/chi-square-and-f-distributions/goodness-of-fit/.

      You create the following table in Excel (which is similar to the table you described):
      row 1: Gender, Obs, Exp, Chi-sq (in columns A, B, C, D)
      row 2: Male, 40, 50, blank
      row 3: Female, 60, 50, blank
      row 4: Sum, =SUM(B2:B3), =SUM(C2:C3),=CHITEST(B2:B3,C2:C3)

      Charles

      • Dick Colby says:

        Charles: Except that the answer you get is incorrect: 0.046. Should be 10 squared, divided by 50, then multiplied by 2 = 4.0. What has gone wrong?

        • Charles says:

          Dick,
          I don’t see any example on the referenced webpage where I get an answer of 0.046. Which example are you referring to?
          Charles

          • Dick Colby says:

            The example I gave you above:
            row 1: Gender, Obs, Exp, Chi-sq (in columns A, B, C, D)
            row 2: Male, 40, 50, blank
            row 3: Female, 60, 50, blank
            row 4: Sum, =SUM(B2:B3), =SUM(C2:C3),=CHITEST(B2:B3,C2:C3)

          • Charles says:

            Dick,

            Sorry, the formula =CHITEST(B2:B3,C2:C3) is not correct. It should be =FIT_TEST(B2:B3,C2:C3). This results in p-value = .0455.

            An alternative approach is to use

            row 1: Gender, Obs, Exp, Chi-sq (in columns A, B, C, D)
            row 2: Male, 40, 50, =(B2-C2)^2/C2
            row 3: Female, 60, 50, =(B3-C3)^2/C3
            row 4: Sum, =SUM(B2:B3), =SUM(C2:C3),=SUM(D2:D3)

            Thus cell D4 will contain the value chi-sq = 4 and the value of the test is given by the formula =CHIDIST(D4,1), which has value .0455.

            Charles

  16. Trang says:

    Hello Charles!
    I am looking for one statistic test for my data to find the significance or independence of variables or association between them. I made the table of cross tabulation of variables for frequency. Definitely, they are categorical variable. However, there are the values of actual (observed) counts which are equal to 0; so some values of expected counts are less than 1. How should I do in this case?

    • Charles says:

      Unfortunately Trang I don’t fully understand the situation you are describing (especially the part about the observed counts being equal to 0). Can you provide a more complete description?
      Charles

  17. Kate Ge says:

    How can I test those in the following by chi-square test
    Say: people have two staged choice. In the first stage, he either choose A or B. After his choice in stage 1 , he face another choice C or D. His two-staged choice could lead to two kind of results, either true or false. I want test with the combination of A+ C would lead to true results, significantly.
    I’ve done a 2*4 chi-square test by listed all the combination : A+C, A+D, B+C, B+D. But it could only explain different combination have different influence toward the results. How could I know if A+C have significant influence?

    Thanks

    • Charles says:

      Kate,
      I don’t quite understand what you are trying to accomplish. From what I understand it doesn’t seem like you are testing for independence, which is what the chi-square test for independence is designed to accomplish. How what you are testing fit with a test for independence?
      Charles

  18. John Levec says:

    I’m a little confused about when to use which of the formulas. What is the difference between CHITEST, CHIDIST and CHIINV? And which one would I use to find the p-value?

    Thanks.

    • Charles says:

      John,

      CHITEST and CHIDIST can be used to calculate the p-value. CHIDIST is used to calculate the p-value when you know the value of the statistic and the df. When you have a set of data and can calculate the expected values from that data then CHITEST can be used (see the website for a description of how to calculate the expected values or use one of the supplemental functions provided by the Real Statistics Resource Pack to do this).

      CHITEST(R1, R2) = CHIDIST(χ^2, df) where R1 = the array of observed data, R2 = the array of expected value, χ^2 is calculated from R1 and R2 and df = the number of elements in R1 (or R2) minus 1.

      CHIINV is the inverse function. It tells you what value of the statistic will produce a p-value of a certain size.

      I suggest that you read the first four topics on http://www.real-statistics.com/chi-square-and-f-distributions/ for a more complete explanation.

      Charles

  19. Machelle Wilson says:

    How to I call the Chi Square data analysis tool? I’ve installed the tools from this website, but nothing additional shows up in the ‘data analysis’ menu. The additional chi square functions from your resource pack show up in the function list, but nothing additional in the data analysis menu.

    Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *