# Independence Testing

The method described in Goodness of Fit can also be used to determine whether two sets of data are independent of each other. Such data are organized in what are called contingency tables, as described in Example 1. In these cases df = (row count – 1) (column count – 1).

Excel Function: The CHISQ.TEST function described in Goodness of Fit can be extended to support ranges consisting of multiple rows and columns. For R1 = the array of observed data and R2 = the array of expected values, we have

CHISQ.TEST(R1, R2) = CHISQ.DIST(x, df) where x is calculated from R1 and R2 as in Definition 2 of Goodness of Fit and df = (row count – 1) (column count – 1).

The ranges R1 and R2 must have the same size and shape and can only contain numeric values.

For versions of Excel prior to Excel 2010, the CHISQ.TEST function doesn’t exist. Instead you need to use the equivalent function, CHITEST.

Example 1: A survey is conducted of 175 young adults whose parents are classified either as wealthy, middle class or poor to determine their highest level of schooling (graduated from university, graduated from high school or neither). The results are summarized on the left side of Figure 1 (Observed Values). Based on the data collected is the person’s level of schooling independent of their parents’ wealth?

Figure 1 – Observed data and expected values for Example 1

We set the null hypothesis to be

H0: Highest level of schooling attained is independent of parents’ wealth

We use the chi-square test, and so need to calculate the expected values that correspond to the observed values in the table above. To accomplish this we use the fact (by Definition 3 of Basic Probability Concepts) that if A and B are independent events then P(∩ B) = P(A) ∙ P(B). We also assume that the proportions for the sample are good estimates for the probabilities of the expected values.

We now show how to construct the table of expected values (i.e. the Expected Values in Figure 1). We know that 45 of the 175 people in the sample are from wealthy families, and so the probability that someone in the sample is from a wealthy family is 45/175 = 25.7%. Similarly the probability that someone in the sample graduated from university is 68/175 = 38.9%. But based on the null hypothesis, the event of being from a wealthy family is independent of graduating from university, and so the expected probability of both events is simply the product of the two events, or 25.7% ∙ 38.9% = 10.0%. Thus, based on the null hypothesis, we expect that 10.0% of 175 = 17.5 people are from a wealthy family and have graduated from university.

In this way we can fill out the table for expected values. We start by setting all the totals in the Expected Values table to be the same as the corresponding total in the Observed Values table (e.g. cell K6 contains the formula =E6). We then set the value of every cell in the Expected Values table to be

(row total ∙ col total) / grand total

E.g. cell H6 contains the formula =K6*H9/K9. An alternative approach for filling in all the cells in the Expected Values table is to place the following array formula in range H6:J8 (and then press Ctrl-Shft-Enter):

=MMULT(K6:K8,H9:J9)/K9

See Matrix Operations for more information about the MMULT array function. We can now calculate the p-value for the chi-square test statistic as CHITEST(Obs, Exp, df) where Obs is the 3 × 3 array of observed values, Exp = the 3 × 3 array of expected values and df = (row count – 1) (column count – 1) = 2 ∙ 2 = 4. Since

CHITEST(B6:D8,H6:J8) = 0.003273 < .05 = α

we reject the null hypothesis and conclude that the level of schooling attained is not independent of parents’ wealth.

Example 2: A researcher wants to know whether there is a significant difference in two therapies for curing patients of cocaine dependence (defined as not taking cocaine for at least 6 months). She tests 150 patients and obtains the results in the upper left part of the table below (labeled Observed Values).

Figure 2 – Chi-square tests for independence

We establish the following null hypothesis:

H0: There is no difference between the two therapies’ ability to cure cocaine dependence

We next calculate the Expected Values from the Observed Values and then the p-value of the chi-square statistic as we did in Example 1. This time, however, we will use the approach employed in Example 2 of Goodness of Fit, namely calculating the Pearson’s chi-square test statistic directly (using Definition 2 of Goodness of Fit). The value of this statistic is 5.516 (cell D17 in Figure 2). Since we are dealing with a 2 × 2 table of observations, df = (2 – 1)(2 – 1) = 1. Finally we observe that

p-value = CHIDIST(χ2, df) = CHIDIST(5.516,1) = .0188 < .05 = α

χ2-crit = CHIINV(α, df) = CHIINV(.05,1) = 3.841 < 5.516 = χ2-obs

and so we reject the null hypothesis and conclude there is a significant difference in the cure rate between the two therapies.

As was mentioned in Goodness of Fit, the maximum likelihood test is a more precise version of the chi-square test employed thus far. The lower right-hand side of the worksheet in Figure 2 shows how to calculate the maximum likelihood statistic (using Definition 1 of Goodness of Fit). The value of this statistic is 5.725, which is not much different from the test statistic we obtained using the Pearson’s test. Since this statistic is also approximately chi-square with one degree of freedom, the analysis is quite similar:

p-value = CHIDIST(χ2, df) = CHIDIST(5.725,1) = .015 < .05 = α

χ2-crit = CHIINV(α, df) = CHIINV(.05,1) = 3.841 < 5.725 = χ2-obs

and so once again, we reject the null hypothesis and conclude there is significant difference in the results for the two therapies.

Observation: It is very important to include all observations in the test. E.g. if in Example 2 we only test Cured vs. Therapy 1 and 2, we will get erroneous results. We need to include Not Cured as well as Cured.

Real Statistics Excel Functions: The following supplemental functions are provided in the Real Statistics Resource Pack:

CHI_STAT2(R1, R2) = Pearson’s chi-square statistic for observation values in range R1 and expectation values in range R2

CHI_MAX2(R1, R2) = Maximum likelihood chi-square statistic for observation values in range R1 and expectation values in range R2

CHI_STAT(R1) = Pearson’s chi-square statistic for observation values in range R1. This is CHI_STAT2(R1, R2) where R2 is the expectation values calculated from R1.

CHI_MAX(R1) = Maximum likelihood chi-square statistic for observation values in range R1. This is CHI_MAX2(R1, R2) where R2 is the expectation values calculated from R1.

CHI_TEST(R1) = p-value for Pearson’s chi-square statistic for observation values in range R1. This is CHITEST(R1, R2) where R2 is the expectation values calculated from R1.

CHI_MAX_TEST(R1) = p-value for Maximum likelihood chi-square statistic for observation values in range R1

The ranges R1 and R2 must contain only numeric values.

Real Statistics Data Analysis Tool: In addition, the Real Statistics Resource Pack provides a supplemental Chi-Square Test data analysis tool. To use this tool for Example 1 enter Ctrl-m and select the Chi-square Test option. A dialog box as in Figure 3 appears.

Figure 3 – Dialog box for Chi-square Test

Insert the observation data into the Input Range (excluding the totals, but optionally including the row and column headings; i.e. range A5:D8), click on the Excel format radio button and press the OK button. Leave the Fisher Exact Test option unchecked (see Fisher Exact Test for use of this option).

The data analysis tool builds an array with the expected values and performs both the Pearson’s and maximum likelihood chi-square tests. The Cramer effect size, and for 2 × 2 contingency tables the Odds Ratio effect size, as described in Effect Size for Chi-square are also calculated. The output from the data analysis tool for the data in Example 1 in shown in Figure 4.

Figure 4 – Chi-Square data analysis tool output for Example 1

Observation: As described in Goodness of Fitthe expected frequency for any cell in the contingency table  should generally be at least 5. With small tables (especially 2 × 2 tables), cells with expected frequencies of at least 10 would be preferable.

For large contingency tables, a small percentage of cells with expected frequency of less than 5 can be acceptable. Even for smaller contingency tables having one cell with expected frequency of less than 5 may not cause big problems, but it is probably a better choice to use Fisher’s Exact Test in this case. In any event, you should avoid using the chi-square test where there is an expected frequency of less than 1 in any cell.

If the expected frequency for one or more cell is less than 5, it may be beneficial to combine one or more cells so that this condition can be met, although this must be done in such a way as to not bias the results.

Observation: In addition to the usual Excel input data format, the Real Statistics Chi Square Test data analysis tool supports another input data format called standard format. This format is similar to that used by SPSS and other statistical analysis programs.

Example 3: A survey is conducted of 38 young adults whose parents are classified either as wealthy, middle class or poor to determine whether they will graduate from university or not. The results are summarized in the table on the left side of Figure 5 (only the first 13 of 38 rows of data are shown). Based on the data collected is a person’s level of schooling independent of their parents’ wealth?

Figure 5 – Data and chi-square tests for Example 3

Once again enter Ctrl-m and select the Chi-square data analysis tool. When the dialog box shown in Figure 3 appears, insert A3:B41 into the Input Range, click on the Standard format radio button and press the OK button.

The data analysis tool first builds a contingency table (range D5:F8 of Figure 5) and performs the same type of analysis as for Example 1 and 2. Since sig = no (cells R11 and R12) we cannot reject the null hypothesis that a student’s graduating from university is independent of his/her parents’ level of income.

Observation: Example 3 uses the two column version of the standard format. There is also a three column version, which is a frequency table version of the other standard format. This is demonstrated in Figure 6 where A4:C9 is inserted in the Input Range (or A3:C9 if Column/row headings included with data is checked). The output is identical to that shown in Figure 5.

Figure 6 – Standard format

### 70 Responses to Independence Testing

1. Machelle Wilson says:

How to I call the Chi Square data analysis tool? I’ve installed the tools from this website, but nothing additional shows up in the ‘data analysis’ menu. The additional chi square functions from your resource pack show up in the function list, but nothing additional in the data analysis menu.

Thanks.

2. John Levec says:

I’m a little confused about when to use which of the formulas. What is the difference between CHITEST, CHIDIST and CHIINV? And which one would I use to find the p-value?

Thanks.

• Charles says:

John,

CHITEST and CHIDIST can be used to calculate the p-value. CHIDIST is used to calculate the p-value when you know the value of the statistic and the df. When you have a set of data and can calculate the expected values from that data then CHITEST can be used (see the website for a description of how to calculate the expected values or use one of the supplemental functions provided by the Real Statistics Resource Pack to do this).

CHITEST(R1, R2) = CHIDIST(χ^2, df) where R1 = the array of observed data, R2 = the array of expected value, χ^2 is calculated from R1 and R2 and df = the number of elements in R1 (or R2) minus 1.

CHIINV is the inverse function. It tells you what value of the statistic will produce a p-value of a certain size.

I suggest that you read the first four topics on http://www.real-statistics.com/chi-square-and-f-distributions/ for a more complete explanation.

Charles

3. Kate Ge says:

How can I test those in the following by chi-square test
Say: people have two staged choice. In the first stage, he either choose A or B. After his choice in stage 1 , he face another choice C or D. His two-staged choice could lead to two kind of results, either true or false. I want test with the combination of A+ C would lead to true results, significantly.
I’ve done a 2*4 chi-square test by listed all the combination : A+C, A+D, B+C, B+D. But it could only explain different combination have different influence toward the results. How could I know if A+C have significant influence?

Thanks

• Charles says:

Kate,
I don’t quite understand what you are trying to accomplish. From what I understand it doesn’t seem like you are testing for independence, which is what the chi-square test for independence is designed to accomplish. How what you are testing fit with a test for independence?
Charles

4. Trang says:

Hello Charles!
I am looking for one statistic test for my data to find the significance or independence of variables or association between them. I made the table of cross tabulation of variables for frequency. Definitely, they are categorical variable. However, there are the values of actual (observed) counts which are equal to 0; so some values of expected counts are less than 1. How should I do in this case?

• Charles says:

Unfortunately Trang I don’t fully understand the situation you are describing (especially the part about the observed counts being equal to 0). Can you provide a more complete description?
Charles

5. Dick Colby says:

All much too complex for my needs, which are very simple: do 40 male and 60 female fruit flies fit a 1:1 expectation at P<0.05? How do I do it in Excel? My contingency table has four columns (o, Ho, e and Chi-square) and three rows (male, female) and sum. How do I do it in Excel?

• Charles says:

Dick,

You seem to be conducting a goodness of fit test (which is related to an independence test, but slightly different). The approach is similar to that shown in Example 3 of http://www.real-statistics.com/chi-square-and-f-distributions/goodness-of-fit/.

You create the following table in Excel (which is similar to the table you described):
row 1: Gender, Obs, Exp, Chi-sq (in columns A, B, C, D)
row 2: Male, 40, 50, blank
row 3: Female, 60, 50, blank
row 4: Sum, =SUM(B2:B3), =SUM(C2:C3),=CHITEST(B2:B3,C2:C3)

Charles

• Dick Colby says:

Charles: Except that the answer you get is incorrect: 0.046. Should be 10 squared, divided by 50, then multiplied by 2 = 4.0. What has gone wrong?

• Charles says:

Dick,
I don’t see any example on the referenced webpage where I get an answer of 0.046. Which example are you referring to?
Charles

• Dick Colby says:

The example I gave you above:
row 1: Gender, Obs, Exp, Chi-sq (in columns A, B, C, D)
row 2: Male, 40, 50, blank
row 3: Female, 60, 50, blank
row 4: Sum, =SUM(B2:B3), =SUM(C2:C3),=CHITEST(B2:B3,C2:C3)

• Charles says:

Dick,

Sorry, the formula =CHITEST(B2:B3,C2:C3) is not correct. It should be =FIT_TEST(B2:B3,C2:C3). This results in p-value = .0455.

An alternative approach is to use

row 1: Gender, Obs, Exp, Chi-sq (in columns A, B, C, D)
row 2: Male, 40, 50, =(B2-C2)^2/C2
row 3: Female, 60, 50, =(B3-C3)^2/C3
row 4: Sum, =SUM(B2:B3), =SUM(C2:C3),=SUM(D2:D3)

Thus cell D4 will contain the value chi-sq = 4 and the value of the test is given by the formula =CHIDIST(D4,1), which has value .0455.

Charles

6. karishma says:

hey..
i tried this module but it shows me an error saying “compile error in hidden module: chiSquare” can u please tell me how do i solve tht error….

thanks..

• Charles says:

When you enter the formula =VER() in any worksheet cell what result do you get? Also which version of Excel are you running?
Charles

7. Anthony says:

Hi Charles, you did a great job here. I would like to clarify something. Below are my data set:
134 8 3
310 282 99
1404 127 267
700 7 1
874 83 53
20 238 262
130 18 74
161 68 132
41 0 3

A total of 9 rows and 3 columns. Why is my p-value = 0. What does this mean? I hope i calculated it correctly. My p-values is zero, which makes my χ2-crit equally 0. Please help!

• Charles says:

Anthony,

I calculated that chi-square stat = 2159.44 and p-value = CHIDIST(2159.44,16,TRUE) = 0. This means that the test is highly significant. Thus the two variables that you are testing are not independent.

It does not mean that χ2-crit = 0. In fact, χ2-crit = CHIINV(.05,16) = 26.30 (assuming an alpha value of .05). Since χ2 > χ2-crit, once again we conclude there is a significant result.

Charles

8. Iddy says:

I like this page, it is meangfull to not only staticians but also to all researchers….

9. shiv says:

hi sir, please explain how i feed the data to calculate chi-square test with the help of real statistics data analysis tool?
sir, please explain with any new example of share any video for that.
compile error in hidden module: froinput is shown again and again

• Charles says:

The Real Statistics Chi-square data analysis tool accepts input data in either (a) Excel format (i.e. contingency table format) as shown in range A5:D8 of Figure 1 of the referenced webpage or (b) standard format (also called stacked format) as shown in Figure 5.

Regarding the compile error, please let me know the following information:

1. What do you see when you enter the formula =VER() in any cell?
2. What release of Excel and Windows are you using?
3. What language are you using (English, French, etc.)?

Charles

10. Fereshteh says:

Hi,I have a gene expression file which contains numerical data and has high dimension.how can I use chi test in my data for dimension reduction.thanks

• Charles says:

I don’t yet address this subject. Here are two articles that do.

Charles

11. Robert Dalton says:

Charles,
I am running a 5×5 Chi square test in Excel format. I am getting my Expected Values and Summary and they look to be fine. However, I am getting an error in my chi-sq, p-value, , sig, and Cramer V for both Pearson’s and Max likelihood. The only thing showing up is my x-crit. I am running 4.4 Excel 2007.
I should mention that I have several cells that the number is 0 in my 5×5.

Thanks!!!

• Robert Dalton says:

I had two full rows that had zeros in my 5×5. I removed them and ran it again as a 3 rows x 5 columns. This time it worked. Will this cause a bias the results or since they are zero does it matter?

• Charles says:

Robert,
Eliminating full rows or columns with zeros is fine, but if the table that remains contain one or more cells containing a zero, then the test is not considered to be valid. You can get around this by combining rows or columns.
Charles

• Charles says:

The test is not valid if there are cells which contain a zero.
In this case you might use the Fisher Exact Test, although the usual version of this test is for a 2 x 2 table.
Charles

12. sarah says:

hello!
I am trying to do a chi test on a 5*5 table but the 5th line is full of 0. the test comes invalid but when i eliminate the last line which is only 0 (it becomes a 4*5 table) it works and i get results. How should I proceed??

• Charles says:

Sarah,
You shouldn’t use chi-square if you have cells with zero values (or even a lot of cells with values less than 5). Either you should eliminate the last row or combine it with the row above.
Charles

13. Able Yeung says:

Can you please help give me some advices on how to conduct a Chi-Square test to understand whether the gender data from my sample dataset has any deviations from the census dataset? p>0.05 or not?

My Gender Dataset is
Male 307
Female 330

The Census Dataset is
Male 3303015
Female 3768561

It is because the SAV or Excel file in SPSS failed to add as large as over 3303015 and 3768561 and grid cells, so I am facing a headache in doing the aforesaid analysis.

Thank you very much.

• Charles says:

I can’t comment on SPSS, but if you use the Real Statistics Chi-square data analysis tool, you will get a result, even for such large data elements.
Charles

14. Shri says:

Hi Charles,

Can I used Chi square test to understand if Age or tenure has any relation to employee quitting.

So the table would have Age range in row and Resigned or Active in column

Kind regards

Shri

• Charles says:

Hi Shri,
Yes, you can create such a 2 x 2 contingency table and use the chi-square test for independence. You didn’t seem to factor tenure into this table, though.
Charles

• Shri says:

Hi Charles,

Thanks.

Sorry to bother with these questions. Trying to use stats for decision making in HR.

I was planning to use tenure in a separate table, comparing tenure and turnover, does that make sense.
OR

Can I use tenure and age in one table but it would not include turnover in it.

• Charles says:

Shri,
Yes, you can do this. It all depends on what your want to test.
Charles

15. Yan Win Soe says:

Hello,

May I know Chi-square test for homogeneity.
e.g. Null Hypothesis : P1=N1 and P2=N2 and P3=N3
Alternative Hypothesis:
P1 is not equal N1 or
P2 is not equal N2 or
P3 is not equal N3
If we reject Null Hypothesis. There we have to find for 95% CI for each proportion so that we can prove which pair is not equal in reality. For this case how can we calculate by using Excel. Could you please explain me? Thanks

• Charles says:

Yan Win Soe,
Are you referring to three-way contingency tables? If so, please see the following webpage:
Log-Linear Regression
Charles

• christine says:

Hello Charles,
Kindly clarify this for me;
How do you treat statistical significance tests using age ranges or years of experience.
say you have one group made up of managers and you want to establish each of their opinion on a process by their years of experience (7 point LIKERT).
manager = 8
years of experience: 0-3 (4), 4-6 (3), 10 + (1)
Which tests is most appropriate?
Do you take an average of each range?

Thank you.

• Charles says:

Christine,
Sorry, but I don’t understand the situation that you are describing.
Charles

16. Inês says:

Hi Charles,

So I have a problem. I need to cross the variable “field of profession” with a yes/no question in SPSS. The problem is that when i do cross this variables I get that 42.9% have expected count less than 5, which means the test isn’t valid. Since my table is 7×2 I can´t read the results from Fisher’s Test. What can I do? Is it ok to combine some fields of profession (rows)..for instance..is it ok to combine phisician with nurse in the same row? Doesn’t this change the results?

Thanks

• Charles says:

Yes, you can combine rows or columns of cells to get a cell count which is sufficiently large.
Charles

17. Tanya says:

Hello: I am trying to analyze whether patient feedback influences a doctor to recommend a certain treatment.
The data I have is “are you likely to recommend treatment X” (3 rows: yes, somewhat, no) and “did you receive positive feedback about treatment X from your patients” (3 columns: yes, no, don’t know). The data is below. I did a chi-test analysis with a p-value of 0.43.
5 1 2
8 0 5
2 1 4

The other data I have is “are you likely to recommend treatment X” (3 rows: yes, somewhat, no) and “did you receive negative feedback about treatment X” (3 columns: yes, no, don’t know). Data below. The chi-test p value I got is 0.07.
1 5 2
6 4 3
1 1 5

I would conclude that neither positive or negative feedback influences docs to recommend treatment X, but want to make sure the chi-test is the right one to use?

• Charles says:

Tanya,
Each chi-square test seems reasonable.
Charles

how to calculate likert data in chi-square test ?
suppose we take standard likert scale 1-5.
plz show me with example.

• Charles says:

Rosshan,
Please be more specific. Perhaps you could provide an example that you are trying to solve.
Charles

19. rajesh bansal says:

p value is 3.28678E-14 . WHAT DOES IT MEAN. PLZ TELL

• Charles says:

Rejesh,
This is a number written in scientific notation, i.e. 3.28678 x 10^(-14). This is a very small number, almost zero.
Charles

20. Mike Egan says:

I have a simple question. I’m using the CHISQ.TEST function for a Chi Square test. The explanation claims that the function returns the Chi Square statistic and the degrees of freedom. But the only output I get is the P value.

How do I get the rest of the output? Or, how do I translate the P value into a Chi Square value and degrees of freedom?

• Charles says:

Mike,
CHISQ.TEST only calculates the p-value. The webpage says that CHISQ.TEST(R1, R2) = CHISQ.DIST(x, df), and the right hand side is a p-value.
To get the chi-square statistic and degrees of freedom:
df = (# of rows in R1 – 1)(# of columns in R1 – 1)
chi-sq stat can be calculated manually as described on the Goodness of Fit webpage or by using the Real Statistic function CHI_STAT2(R1,R2) or CHI_STAT(R1).
Charles

• Brittany says:

You’ll need to calculate the df by hand (see Charles comment) but to get the chi sq test statistic use CHISQ.INV.RT(p-value, df). I don’t know why excel made it backwards but I checked it against hand calculations and it’s correct (because I did the hand calculations first and then realized there was an inverse function, d’oh).

21. Simone says:

Hi Charles,
Thanks so much for the helpful website. I am comparing two methods of treatment for acid mine drainage and have already come to the conclusion that results from the two methods are from the same population using Mann-Whitney testing. One aspect of the data I am keen on knowing though is there a difference in pass/fail rates of the resulting water quality values for the two methods. As I have already determined the methods produced results from the same population, is it then unreasonable to use this method which indicates that they produce significantly different pass/fail results?
Thank you.

• Charles says:

Simone,
Without more information, I can’t say for sure, but Mann-Whitney is commonly used for these sorts of problems.
Charles

22. shweta says:

hi…i want to calculate chi sq. for bmi categories among males and females
how shall i proceed?

• Charles says:

Sweta,
This is explained on the referenced webpage. Do you have a specific question?
Charles

• shweta says:

ohh okay….thanks for the help.

23. shyali says:

i’m doing a educational reasearch called how significally change the trained and untrained teacher’s attitude relavent to the technology intergration.but nw i cant think how i used the chi squre test for items.do i check the chi square for each statements with likerts scale?

• Charles says:

Shyali,
Charles

• shyali says:

15 statements used.each statement has 5 likert scales strongly agree to strongly disagree.used sample 120 teachers,70 are well trained,50 was not trained,via the statements attitude for tecnology usage in classroom both trained and untrained plan to check whether significant or not

• Charles says:

Shyali,
With a chi-square test for independence, you are trying to determine whether two variables are independent. For your problem, what are the two variables?
Charles

24. Fred says:

Very interesting! I had never understood why (obs-exp)^2/exp had a z^2 distribution!
In example 2, could we also use the test for the difference between two proportions?
What are the benefits of each method?
Thanks!

25. Rod Moore says:

Hi Charles

Does Excel have frequency limits in cells when calculating Chi2.test? I have some cells with several hundred thousands and just under 3 million.

• Charles says:

Rod,
You should be able to perform chi-square test of independence even with large cell values. Just try it!
Charles

26. afzal says:

i am doing a research.
i have used likert scale from strongly agree to strongly disagree for each of my questions.
therefore, my degree of freedom is 16. i am trying to show chi-square test of independence for my variables from survey data. i am using SPSS 20. my chi square test values are coming like 5.18, 10.466 , 15.34 like this. now , i would like to know.what should be the interpretation of this. is it dependent or independent. for 95% confidence level for degree of freedom 16 x square is 7.962. so, below 7.962 is independent or what? i just want to know.

27. Tong Sin Keong says:

Charles,

My understanding of chi-square is that the distribution of the population needs to be Gaussian and therefore rules out categorical data. In the problem that I am working on, the population distribution is categorical and the mean and Standard deviations can be calculated. I need to compare it with the distribution of a number of datasets. Is the approach to calculate the z-score of the dataset based on known mean and standard deviation to establish the confidence interval? Are there other probabilistic approaches?

• Charles says:

Tong,
The chi-square test of independence deals with categorical data, and so I am not sure I understand your concerns.
What specifically are you trying to accomplish? Are you testing independence or are you trying to see whether some data fits a specific distribution or something else?
Charles

28. Tong Sin Keong says:

Hi Charles,

Thank you so much for your reply. I apologise for the vagueness, My understanding is that the chi-square curve is based on the sum square of independent normal variables. Therefore, I cannot apply it to categorical data. For instance, if the X is the variable with values of 1,2,3,4,5 for Excellent, Good, Fair, Unsatisfactory, Poor, The distribution of the values of X is not a normal distribution.

In my case, I am doing a thesis which in part deals with the partition of an integer into positive summands for natural sequences such as pi and robotic sequences generated by computers. The number 6 can be decomposed 10 partitions of 6, 5+1, 4+2, 4+1+1. 3+3, 3+2+1, 3+1+1+1, 2+2+2, 2+2+1+1, 2+1+1+1+1,1+1+1+1+1. It is possible to calculate the probability for each partition. I would like to analyse the observed probabilities against the actual. At the moment, I am choosing to analyse it by confidence interval using Z score. Your comments would be very welcomed.

By the way, I found your documents on Q-Q plot, chi-square and Kutosis very helpful indeed.

Thank you again