The method described in Goodness of Fit can also be used to determine whether two sets of data are independent of each other. Such data are organized in what are called **contingency tables**, as described in Example 1. In these cases *df* = (row count – 1) (column count – 1).

**Excel Function**: The CHITEST function described in Goodness of Fit can be extended to support ranges consisting of multiple rows and columns. In this case, we have:

**CHITEST**(R1, R2) = CHIDIST(*x, df*) where *x* is the chi-square statistic, R1 = the array of observed data, R2 = the array of expected values and *df* = (row count – 1) (column count – 1).

The ranges R1 and R2 must have the same size and shape and can only contain numeric values.

**Example 1**: A survey is conducted of 175 young adults whose parents are classified either as wealthy, middle class or poor to determine their highest level of schooling (graduated from university, graduated from high school or neither). The results are summarized on the left side of Figure 1 (Observed Values). Based on the data collected is the person’s level of schooling independent of their parents’ wealth?

**Figure 1 – Observed data and expected values for Example 1**

We set the null hypothesis to be

H_{0}: Highest level of schooling attained is independent of parents’ wealth

We use the chi-square test, and so need to calculate the expected values that correspond to the observed values in the table above. To accomplish this we use the fact (by Definition 3 of Basic Probability Concepts) that if *A* and *B* are independent events then *P*(*A *∩ *B*) = *P*(*A*) ∙ *P*(*B*). We also assume that the proportions for the sample are good estimates for the probabilities of the expected values.

We now show how to construct the table of expected values (i.e. the Expected Values in Figure 1). We know that 45 of the 175 people in the sample are from wealthy families, and so the probability that someone in the sample is from a wealthy family is 45/175 = 25.7%. Similarly the probability that someone in the sample graduated from university is 68/175 = 38.9%. But based on the null hypothesis, the event of being from a wealthy family is independent of graduating from university, and so the expected probability of both events is simply the product of the two events, or 25.7% ∙ 38.9% = 10.0%. Thus, based on the null hypothesis, we expect that 10.0% of 175 = 17.5 people are from a wealthy family and have graduated from university.

In this way we can fill out the table for expected values. We start by setting all the totals in the Expected Values table to be the same as the corresponding total in the Observed Values table (e.g. cell K6 contains the formula =E6). We then set the value of every cell in the Expected Values table to be

(row total ∙ col total) / grand total

E.g. cell H6 contains the formula =K6*H9/K9. An alternative approach for filling in all the cells in the Expected Values table is to place the following array formula in range H6:J8 (and then press **Ctrl-Shft-Enter**):

=MMULT(K6:K8,H9:J9)/K9

See Matrix Operations for more information about the MMULT array function. We can now calculate the p-value for the chi-square test statistic as CHITEST(*Obs*, *Exp*, *df*) where *Obs *is the 3 × 3 array of observed values, *Exp* = the 3 × 3 array of expected values and *df* = (row count – 1) (column count – 1) = 2 ∙ 2 = 4. Since

CHITEST(B6:D8,H6:J8) = 0.003273 < .05 = *α*

we reject the null hypothesis and conclude that the level of schooling attained is not independent of parents’ wealth.

**Example 2**: A researcher wants to know whether there is a significant difference in two therapies for curing patients of cocaine dependence (defined as not taking cocaine for at least 6 months). She tests 150 patients and obtains the results in the upper left part of the table below (labeled Observed Values).

**Figure 2 – Chi-square tests for independence**

We establish the following null hypothesis:

H_{0}: There is no difference between the two therapies’ ability to cure cocaine dependence

We next calculate the Expected Values from the Observed Values and then the p-value of the chi-square statistic as we did in Example 1. This time, however, we will use the approach employed in Example 2 of Goodness of Fit, namely calculating the Pearson’s chi-square test statistic directly (using Definition 2 of Goodness of Fit). The value of this statistic is 5.516 (cell D17 in Figure 2). Since we are dealing with a 2 × 2 table of observations, *df* = (2 – 1)(2 – 1) = 1. Finally we observe that

p-value = CHIDIST(χ^{2}, *df*) = CHIDIST(5.516,1) = .0188 < .05 = *α*

χ^{2}-crit = CHIINV(*α, df*) = CHIINV(.05,1) = 3.841 < 5.516 = χ^{2}-obs

and so we reject the null hypothesis and conclude there is a significant difference in the cure rate between the two therapies.

As was mentioned in Goodness of Fit, the maximum likelihood test is a more precise version of the chi-square test employed thus far. The lower right-hand side of the worksheet in Figure 2 shows how to calculate the maximum likelihood statistic (using Definition 1 of Goodness of Fit). The value of this statistic is 5.725, which is not much different from the test statistic we obtained using the Pearson’s test. Since this statistic is also approximately chi-square with one degree of freedom, the analysis is quite similar:

p-value = CHIDIST(χ^{2}, *df*) = CHIDIST(5.725,1) = .015 < .05 = *α*

χ^{2}-crit = CHIINV(*α, df*) = CHIINV(.05,1) = 3.841 < 5.725 = χ^{2}-obs

and so once again, we reject the null hypothesis and conclude there is significant difference in the results for the two therapies.

**Observation**: It is very important to include all observations in the test. E.g. if in Example 2 we only test Cured vs. Therapy 1 and 2, we will get erroneous results. We need to include Not Cured as well as Cured.

**Real Statistics Excel Functions**: The following supplemental functions are provided in the Real Statistics Resource Pack:

**CHI_STAT2**(R1, R2) = Pearson’s chi-square statistic for observation values in range R1 and expectation values in range R2

**CHI_MAX2**(R1, R2) = Maximum likelihood chi-square statistic for observation values in range R1 and expectation values in range R2

**CHI_STAT**(R1) = Pearson’s chi-square statistic for observation values in range R1. This is CHI_STAT2(R1, R2) where R2 is the expectation values calculated from R1.

**CHI_MAX**(R1) = Maximum likelihood chi-square statistic for observation values in range R1. This is CHI_MAX2(R1, R2) where R2 is the expectation values calculated from R1.

**CHI_TEST**(R1) = p-value for Pearson’s chi-square statistic for observation values in range R1. This is CHITEST(R1, R2) where R2 is the expectation values calculated from R1.

**CHI_MAX_TEST**(R1) = p-value for Maximum likelihood chi-square statistic for observation values in range R1

The ranges R1 and R2 must contain only numeric values.

**Real Statistics Data Analysis Tool**: In addition, the Real Statistics Resource Pack provides a supplemental **Chi-Square Test **data analysis tool. To use this tool for Example 1 enter **Ctrl-m** and select the **Chi-square Test** option. A dialog box as in Figure 3 appears.

**Figure 3 – Dialog box for Chi-square Test**

Insert the observation data into the **Input Range** (excluding the totals, but optionally including the row and column headings; i.e. range A5:D8), click on the **Excel format** radio button and press the **OK** button. Leave the **Fisher Exact Test** option unchecked (although, see Fisher Exact Test for use of this option).

The data analysis tool builds an array with the expected values and performs both the Pearson’s and maximum likelihood chi-square tests. The Cramer effect size, and for 2 × 2 contingency tables the Odds Ratio effect size, as described in Effect Size for Chi-square are also calculated. The output from the data analysis tool for the data in Example 1 in shown in Figure 4.

**Figure 4 – Chi-Square data analysis tool output for Example 1**

**Observation**: As described in Goodness of Fit, the expected frequency for any cell in the contingency table should generally be at least 5. With small tables (especially 2 × 2 tables), cells with expected frequencies of at least 10 would be preferable.

For large contingency tables, a small percentage of cells with expected frequency of less than 5 can be acceptable. Even for smaller contingency tables having one cell with expected frequency of less than 5 may not cause big problems, but it is probably a better choice to use Fisher’s Exact Test in this case. In any event, you should avoid using the chi-square test where there is an expected frequency of less than 1 in any cell.

If the expected frequency for one or more cell is less than 5, it may be beneficial to combine one or more cells so that this condition can be met, although this must be done in such a way as to not bias the results.

**Observation**: In addition to the usual Excel input data format, the Real Statistics **Chi Square Test** data analysis tool supports another input data format called **standard format**. This format is similar to that used by SPSS and other statistical analysis programs.

**Example 3**: A survey is conducted of 38 young adults whose parents are classified either as wealthy, middle class or poor to determine whether they will graduate from university or not The results are summarized in the table on the left side of Figure 5 (only the first 13 of 38 rows of data are shown). Based on the data collected is a person’s level of schooling independent of their parents’ wealth?

**Figure 5 – Data and chi-square tests for Example 3**

Once again enter **Ctrl-m** and select the **Chi-square** data analysis tool. When the dialog box shown in Figure 3 appears, insert A3:B41 into the **Input Range**, click on the **Standard format** radio button and press the **OK** button.

The data analysis tool first builds a contingency table (range D5:F8 of Figure 5) and performs the same type of analysis as for Example 1 and 2. Since *sig* = no (cell R11 or R12) we cannot reject the null hypothesis that a student’s graduating from university is independent of his/her parents’ level of income.

Hi Charles,

So I have a problem. I need to cross the variable “field of profession” with a yes/no question in SPSS. The problem is that when i do cross this variables I get that 42.9% have expected count less than 5, which means the test isn’t valid. Since my table is 7×2 I can´t read the results from Fisher’s Test. What can I do? Is it ok to combine some fields of profession (rows)..for instance..is it ok to combine phisician with nurse in the same row? Doesn’t this change the results?

Thanks

Yes, you can combine rows or columns of cells to get a cell count which is sufficiently large.

Charles

Hello,

May I know Chi-square test for homogeneity.

e.g. Null Hypothesis : P1=N1 and P2=N2 and P3=N3

Alternative Hypothesis:

P1 is not equal N1 or

P2 is not equal N2 or

P3 is not equal N3

If we reject Null Hypothesis. There we have to find for 95% CI for each proportion so that we can prove which pair is not equal in reality. For this case how can we calculate by using Excel. Could you please explain me? Thanks

Yan Win Soe,

Are you referring to three-way contingency tables? If so, please see the following webpage:

Log-Linear Regression

Charles

Hello Charles,

Kindly clarify this for me;

How do you treat statistical significance tests using age ranges or years of experience.

say you have one group made up of managers and you want to establish each of their opinion on a process by their years of experience (7 point LIKERT).

manager = 8

years of experience: 0-3 (4), 4-6 (3), 10 + (1)

Which tests is most appropriate?

Do you take an average of each range?

Thank you.

Christine,

Sorry, but I don’t understand the situation that you are describing.

Charles

Hi Charles,

Can I used Chi square test to understand if Age or tenure has any relation to employee quitting.

So the table would have Age range in row and Resigned or Active in column

Kind regards

Shri

Hi Shri,

Yes, you can create such a 2 x 2 contingency table and use the chi-square test for independence. You didn’t seem to factor tenure into this table, though.

Charles

Hi Charles,

Thanks.

Sorry to bother with these questions. Trying to use stats for decision making in HR.

I was planning to use tenure in a separate table, comparing tenure and turnover, does that make sense.

OR

Can I use tenure and age in one table but it would not include turnover in it.

Shri,

Yes, you can do this. It all depends on what your want to test.

Charles

Hello Sir/Madam,

Can you please help give me some advices on how to conduct a Chi-Square test to understand whether the gender data from my sample dataset has any deviations from the census dataset? p>0.05 or not?

My Gender Dataset is

Male 307

Female 330

The Census Dataset is

Male 3303015

Female 3768561

It is because the SAV or Excel file in SPSS failed to add as large as over 3303015 and 3768561 and grid cells, so I am facing a headache in doing the aforesaid analysis.

Thank you very much.

I can’t comment on SPSS, but if you use the Real Statistics Chi-square data analysis tool, you will get a result, even for such large data elements.

Charles

hello!

I am trying to do a chi test on a 5*5 table but the 5th line is full of 0. the test comes invalid but when i eliminate the last line which is only 0 (it becomes a 4*5 table) it works and i get results. How should I proceed??

Sarah,

You shouldn’t use chi-square if you have cells with zero values (or even a lot of cells with values less than 5). Either you should eliminate the last row or combine it with the row above.

Charles

Charles,

I am running a 5×5 Chi square test in Excel format. I am getting my Expected Values and Summary and they look to be fine. However, I am getting an error in my chi-sq, p-value, , sig, and Cramer V for both Pearson’s and Max likelihood. The only thing showing up is my x-crit. I am running 4.4 Excel 2007.

I should mention that I have several cells that the number is 0 in my 5×5.

Thanks!!!

I had two full rows that had zeros in my 5×5. I removed them and ran it again as a 3 rows x 5 columns. This time it worked. Will this cause a bias the results or since they are zero does it matter?

Robert,

Eliminating full rows or columns with zeros is fine, but if the table that remains contain one or more cells containing a zero, then the test is not considered to be valid. You can get around this by combining rows or columns.

Charles

The test is not valid if there are cells which contain a zero.

In this case you might use the Fisher Exact Test, although the usual version of this test is for a 2 x 2 table.

Charles

Hi,I have a gene expression file which contains numerical data and has high dimension.how can I use chi test in my data for dimension reduction.thanks

I don’t yet address this subject. Here are two articles that do.

http://www.sciencedirect.com/science/article/pii/S0047259X03000563

http://www.kdnuggets.com/2015/05/7-methods-data-dimensionality-reduction.html

Charles

hi sir, please explain how i feed the data to calculate chi-square test with the help of real statistics data analysis tool?

sir, please explain with any new example of share any video for that.

please sir,

compile error in hidden module: froinput is shown again and again

The Real Statistics Chi-square data analysis tool accepts input data in either (a) Excel format (i.e. contingency table format) as shown in range A5:D8 of Figure 1 of the referenced webpage or (b) standard format (also called stacked format) as shown in Figure 5.

Regarding the compile error, please let me know the following information:

1. What do you see when you enter the formula =VER() in any cell?

2. What release of Excel and Windows are you using?

3. What language are you using (English, French, etc.)?

Charles

I like this page, it is meangfull to not only staticians but also to all researchers….

Hi Charles, you did a great job here. I would like to clarify something. Below are my data set:

134 8 3

310 282 99

1404 127 267

700 7 1

874 83 53

20 238 262

130 18 74

161 68 132

41 0 3

A total of 9 rows and 3 columns. Why is my p-value = 0. What does this mean? I hope i calculated it correctly. My p-values is zero, which makes my χ2-crit equally 0. Please help!

Anthony,

I calculated that chi-square stat = 2159.44 and p-value = CHIDIST(2159.44,16,TRUE) = 0. This means that the test is highly significant. Thus the two variables that you are testing are not independent.

It does not mean that χ2-crit = 0. In fact, χ2-crit = CHIINV(.05,16) = 26.30 (assuming an alpha value of .05). Since χ2 > χ2-crit, once again we conclude there is a significant result.

Charles

hey..

i tried this module but it shows me an error saying “compile error in hidden module: chiSquare” can u please tell me how do i solve tht error….

thanks..

When you enter the formula =VER() in any worksheet cell what result do you get? Also which version of Excel are you running?

Charles

All much too complex for my needs, which are very simple: do 40 male and 60 female fruit flies fit a 1:1 expectation at P<0.05? How do I do it in Excel? My contingency table has four columns (o, Ho, e and Chi-square) and three rows (male, female) and sum. How do I do it in Excel?

Dick,

You seem to be conducting a goodness of fit test (which is related to an independence test, but slightly different). The approach is similar to that shown in Example 3 of http://www.real-statistics.com/chi-square-and-f-distributions/goodness-of-fit/.

You create the following table in Excel (which is similar to the table you described):

row 1: Gender, Obs, Exp, Chi-sq (in columns A, B, C, D)

row 2: Male, 40, 50, blank

row 3: Female, 60, 50, blank

row 4: Sum, =SUM(B2:B3), =SUM(C2:C3),=CHITEST(B2:B3,C2:C3)

Charles

Charles: Except that the answer you get is incorrect: 0.046. Should be 10 squared, divided by 50, then multiplied by 2 = 4.0. What has gone wrong?

Dick,

I don’t see any example on the referenced webpage where I get an answer of 0.046. Which example are you referring to?

Charles

The example I gave you above:

row 1: Gender, Obs, Exp, Chi-sq (in columns A, B, C, D)

row 2: Male, 40, 50, blank

row 3: Female, 60, 50, blank

row 4: Sum, =SUM(B2:B3), =SUM(C2:C3),=CHITEST(B2:B3,C2:C3)

Dick,

Sorry, the formula =CHITEST(B2:B3,C2:C3) is not correct. It should be =FIT_TEST(B2:B3,C2:C3). This results in p-value = .0455.

An alternative approach is to use

row 1: Gender, Obs, Exp, Chi-sq (in columns A, B, C, D)

row 2: Male, 40, 50, =(B2-C2)^2/C2

row 3: Female, 60, 50, =(B3-C3)^2/C3

row 4: Sum, =SUM(B2:B3), =SUM(C2:C3),=SUM(D2:D3)

Thus cell D4 will contain the value chi-sq = 4 and the value of the test is given by the formula =CHIDIST(D4,1), which has value .0455.

Charles

Hello Charles!

I am looking for one statistic test for my data to find the significance or independence of variables or association between them. I made the table of cross tabulation of variables for frequency. Definitely, they are categorical variable. However, there are the values of actual (observed) counts which are equal to 0; so some values of expected counts are less than 1. How should I do in this case?

Unfortunately Trang I don’t fully understand the situation you are describing (especially the part about the observed counts being equal to 0). Can you provide a more complete description?

Charles

How can I test those in the following by chi-square test

Say: people have two staged choice. In the first stage, he either choose A or B. After his choice in stage 1 , he face another choice C or D. His two-staged choice could lead to two kind of results, either true or false. I want test with the combination of A+ C would lead to true results, significantly.

I’ve done a 2*4 chi-square test by listed all the combination : A+C, A+D, B+C, B+D. But it could only explain different combination have different influence toward the results. How could I know if A+C have significant influence?

Thanks

Kate,

I don’t quite understand what you are trying to accomplish. From what I understand it doesn’t seem like you are testing for independence, which is what the chi-square test for independence is designed to accomplish. How what you are testing fit with a test for independence?

Charles

I’m a little confused about when to use which of the formulas. What is the difference between CHITEST, CHIDIST and CHIINV? And which one would I use to find the p-value?

Thanks.

John,

CHITEST and CHIDIST can be used to calculate the p-value. CHIDIST is used to calculate the p-value when you know the value of the statistic and the df. When you have a set of data and can calculate the expected values from that data then CHITEST can be used (see the website for a description of how to calculate the expected values or use one of the supplemental functions provided by the Real Statistics Resource Pack to do this).

CHITEST(R1, R2) = CHIDIST(χ^2, df) where R1 = the array of observed data, R2 = the array of expected value, χ^2 is calculated from R1 and R2 and df = the number of elements in R1 (or R2) minus 1.

CHIINV is the inverse function. It tells you what value of the statistic will produce a p-value of a certain size.

I suggest that you read the first four topics on http://www.real-statistics.com/chi-square-and-f-distributions/ for a more complete explanation.

Charles

How to I call the Chi Square data analysis tool? I’ve installed the tools from this website, but nothing additional shows up in the ‘data analysis’ menu. The additional chi square functions from your resource pack show up in the function list, but nothing additional in the data analysis menu.

Thanks.

Hi Machelle,

To access the Real Statistics data analysis tools just press Ctrl-m. A dialog box will appear listing the available tools. The supplemental tools are not available through the data analysis menu. You can also add a menu to the Data ribbon right next to the data analysis menu which will give you access to the supplemental data analysis tools. Instructions for how to do this are available on the webpage http://www.real-statistics.com/excel-capabilities/supplemental-data-analysis-tools/accessing-supplemental-data-analysis-tools/.

Charles