Cohen’s kappa is a measure of the agreement between two raters, where agreement due to chance is factored out. We now extend Cohen’s kappa to the case where the number of raters can be more than two. This extension is called **Fleiss’ kappa**. As for Cohen’s kappa no weighting is used and the categories are considered to be unordered.

Let *n* = the number of subjects, *k* = the number of evaluation categories and *m* = the number of judges for each subject. E.g. for Example 1 of Cohen’s Kappa, *n* = 50, *k* = 3 and *m* = 2. While for Cohen’s kappa both judges evaluate every subject, in the case of Fleiss’ kappa, there may be many more than *m* judges and not every judge needs to evaluate each subject; what is important is that each subject is evaluated *m* times.

For every subject *i* = 1, 2, …, *n* and evaluation categories *j* = 1, 2, …, *k*, let *x _{ij}* = the number of judges that assign category

*j*to subject

*i*. Thus

The proportion of pairs of judges that agree in their evaluation on subject *i* is given by

The mean of the *p _{i}* is therefore

We use the following measure for the error term

**Definition 1**: **Fleiss’ Kappa** is defined to be

We can also define kappa for the* j*th category by

The standard error for *κ _{j}* is given by the formula

The standard error for *κ* is given by the formula

There is an alternative calculation of the standard error provided in Fleiss’ orginal paper, namely the square root of the following:

The test statistics *z _{j} = κ_{j}*/

*s.e.*(

*κ*) and

_{j}*z = κ*/

*s.e.*are generally approximated by a standard normal distribution, which allows us to calculate a p-value and confidence interval. E.g. the 1 –

*α*confidence interval for kappa is therefore approximated as

*κ* ± NORMSINV(1 – *α/2*) * *s.e.*

**Example 1**: Six psychologists (judges) evaluate 12 patients as to whether they are psychotic, borderline, bipolar or none of these. The rating are summarized in range A3:E15 of Figure 1. Determine the overall agreement between the psychologists, subtracting out agreement due to chance, using Fleiss’ kappa. Also find Fleiss’ kappa for each disorder.

**Figure 1 – Calculation of Fleiss’ Kappa**

For example, we see that 4 of the psychologists rated subject 1 to have psychosis and 2 rated subject 1 to have borderline syndrome, no psychologist rated subject 1 with bipolar or none.

We use the formulas described above to calculate Fleiss’ kappa in the worksheet shown in Figure 1. The formulas in the ranges H4:H15 and B17:B22 are displayed in text format in column J, except that the formulas in cells H9 and B19 are not displayed in the figure since they are rather long. These formulas are:

Cell |
Entity |
Formula |

H9 | s.e. |
=B20*SQRT(SUM(B18:E18)^2-SUMPRODUCT(B18:E18,1-2*B17:E17))/SUM(B18:E18) |

B19 | κ_{1} |
=1-SUMPRODUCT(B4:B15,$H$4-B4:B15)/($H$4*$H$5*($H$4-1)*B17*(1-B17)) |

**Figure 2 – Long**** formulas in worksheet of Figure 1**

Note too that row 18 (labelled b) contains the formulas for *q _{j}*(1–

*q*).

_{j}The p-values (and confidence intervals) show us that all of the kappa values are significantly different from zero.

**Real Statistics Function**: The Real Statistics Resource Pack contains the following supplemental function:

**KAPPA**(R1, *j, lab, alpha, tails, orig*): if *lab* = FALSE (default) returns a 6 × 1 range consisting of *κ* if *j* = 0 (default) or *κ _{j}* if

*j*> 0 for the data in R1 (where R1 is formatted as in range B4:E15 of Figure 1), plus the standard error, z-stat, z-crit, p-value and lower and upper bound of the 1 –

*alpha*confidence interval, where

*alpha*=

*α*(default .05) and

*tails*= 1 or 2 (default). If

*lab*= TRUE then an extra column of labels is included in the output. If

*orig*= TRUE then the original calculation for the standard error is used; default is FALSE.

For Example 1, KAPPA(B4:E15) = .2968 and KAPPA(B4:E15,2) = .28. The complete output for KAPPA(B4:E15,,TRUE) is shown in Figure 3.

**Figure 3 – Output from KAPPA function**

**Real Statistics Data Analysis Tool**: The **Reliability** data analysis tool supplied in the Real Statistics Resource Pack can also be used to calculate Fleiss’ kappa.

To calculate Fleiss’ kappa for Example 1 press **Ctrl-m** and choose the **Reliability** option from the menu that appears. Fill in the dialog box that appears (see Figure 7 of Cronbach’s Alpha) by inserting B4:E15 in the **Input Range,** choosing the **Fleiss’ kappa** option and clicking on the **OK** button..

The output is shown in Figure 4.

**Figure 4 – Output from Fleiss’ Kappa analysis tool**

Note that if you change the values for alpha (cell C26) and/or tails (cell C27) the output in Figure 4 will change automatically.

Hello Charles,

so I have a question. I have an open card sorting with 10 raters, 23 items and 2 categories.

It’s 230 ratings in total, and in 225 out of the 230 cases, the raters assigned the same category.

Using the add-in i get a Fleiss Kappa of 0,92067698.

However, the p-value always amounts to 0. How is that possible?

I also tried changing the numbers so there would be more volatility (e.g. set 225 down to 180 and the Fleiss kappa shrunk but the p-value is still 0?

Maybe it’s a problem since I am using a German Excel version?

The regular Cohen’s Kappa calculation also doesnt work for me

Tyvm!

Patrick,

If you send me an Excel file with your data and calculations I will try to figure what is happening.

The German version shouldn’t give a different answer from the US version (unless there is a problem with the decimal symbol).

You can find my email address at Contact Us.

Charles

Hi Charles,

Thank you so much for this website.

I have analyzed the inter- rater reliability for 5 raters on 34 items. the variables were dichotomous (correct, wrong). the Average pairwise percent agreement was 82.4%. However, the Fliess kappa was 0.32 and Krippendorff’s Alpha was 0.32 as well. Actually, I am frustrated, and I do not know how to report my data. or if there is away to correct the kappa.

your help is much appreciated.

Enas

Enas,

It looks like it is easy to report your results, although it seems like you don’t like the result.

The reason for doing analysis is to find out whether or not your hypotheses are correct. A negative result is still a result.

Charles

Hi Charles,

This has been very useful so far. Thank you.

My work is similar to your example but there are the following differences:

-instead of 4 psychiatric diagnoses, there are 2 columns: Absence and Presence.

-instead of 12 patients, there are 6 patients.

-instead of 6 raters, there are 15 raters.

I have two queries:

1. In one part of my research, all 15 raters agree on ‘Presence’ for every patient. However, Fleiss’ kappa can not be calculated because (if you follow your manual method) you end up dividing by 0. This is not possible. What should I do?

2. In another part, 15 raters agree on ‘Presence’ for the first 5 patients. For the final patient, 14 agree on ‘Presence’ with the remaining rater stating ‘Absence’. I would expect the kappa value to be relatively high. However, it is negative.

This is very confusing and I would be very grateful if you could help.

Thank you.

Joseph,

One of the problems with Fleiss’ Kappa is that you can get some counter-intuitive, esp. in extreme cases. My understanding is that Gwet’s AC2 measurement addresses some of these problems. The next release of the Real Statistics software will include support for Gwet’s AC2. Regarding your specific questions:

1. If all raters agree, simply report 100% agreement (don’t bother with Fleiss’ Kappa)

2. Once again with this level of agreement, you should simply report that you have 100% agreement except in one case.

Charles

Thank you.

Joseph

Hi Sir,

I would like to ask how can I assess the agreement between an isolated rater (master rater) VS a group of say 10 raters. The variables are ordinal, like No Disease, Mild Disease, Moderate Disease, Severe Disease.

Thanks in advance for your reply!

Angela,

The fact that the ratings are ordinal and that you want a comparison with a master rater makes Fleiss’ kappa a poor choice. Perhaps one of the following references will help you find the right choice:

https://en.wikipedia.org/wiki/Krippendorff's_alpha

http://john-uebersax.com/stat/agree.htm

https://orbi.ulg.ac.be/bitstream/2268/39575/1/vanbelle-thesis-5-5-2009.pdf

Charles

Thank you very much for your reply! I was reading the last link you gave and I am torn between which index of agreement to use. Do you have any idea which index (i.e. Vanbelle and Albert novel agreement index, consensus index or Schouten index) is more widely used and more suitable for my case? Also, should I use weighted or unweighted?

Thanks again! 🙂

Angela,

I don-t don-t have an opinion regarding which one is more widely used or which is better for you.

Since your data is ordered, you would normally choose weighted kappa over unweighted Cohen’s kappa.

Charles

Hi Sir,

Could you assist me with the following?

My response outputs from my 4 raters are non-categorical, for e.g ” 52 years, 53 years” . Initially, I manual group them into “Yes” and “No” before using SPSS to calculate the Kappa Scores. But, how do I go about in calculating the Kappa score for them without manual grouping the responses (Just by putting the raw data into SPSS)

Thank you!

Sorry Xinru, but I don’t completely understand your question. If you have non-categorical data, then Fleiss’ kappa is probably not the right tool to use. Perhaps ICC would be better.

Charles

Hi Charles,

Many thanks for producing this, it’s very helpful. I wonder if you could help me with a question. I’m trying to calculate the degree of interrater reliability between three raters when screening research papers for inclusion in a systematic review. The three raters have the option to include or exclude each of the 20 papers. I think I’ve followed you instructions for formatting correctly. The raters all agreed on 18 papers. On the other two there was a 2 to 1 split (ie. two raters said include, the other exclude).

When I input that, I get the following result:

Kappa = 0.86 (p <0.05), 95% CI (0.599, 1.118)

I noticed that the upper CI value is greater than 1. I assumed that the value would be bounded at 1 because that would represent perfect agreement. Have I misunderstood something or made an error?

Many thanks

Rob

Rob,

Yes, the upper bound becomes 1.

Charles

Hi Charles

Thanks for this – so would you recommend just reporting the CI as (0.599, 1)?

Rob,

Yes.

Charles

Brilliant, thanks for your help.

Hello, thank you very much for sharing them.

Could you do me a favor?

1. As you know, Cohen’s kappa has set of criteria, it is: less than 0% no agreement, 0-20% poor, 20-40% fair, 40-60% moderate, 60-80% good, 80% or higher very good. How about Fleiss’s kappa, are they the same?

2. What is the difference between Cohen’s kappa and Weighted Cohen’s kappa?

3. I need to measure the agreement between 8 raters. Is it possible if I use Cohen’s kappa for each couple of rates and then I measure the medium of all the kappa?

4. Is there any software of application to measure Fleiss’s kappa ( Ex: Something like spss or medcalc..)

Thank you so much for helping me!

Linh,

1. You can use the same criteria as for Cohen’s kappa, although there isn’t universal agreement about these criteria.

2. See Weighted Kappa

3. You can do this, but it is not clear how you would interpret the result

4. The Real Statistics software calculates Fleiss kappa. I would think that SPSS and Medcalc do as well, but I cant say for sure since I dont use these tools.

Charles

Thanks a million!!!

You’ ve been a big help.

Hi

I have a problem taking results from the reliability testing. I have 3 columns with marks (1,2,3) from 3 raters and when i run the Fleiss’s kappa on these values

rater1 rater2 rater3

1 1 2

3 3 3

1 1 1

2 2 2

3 3 3

3 3 3

2 2 1

1 1 2

I get this table

Fleiss’s Kappa

alpha 0,05

tails 2

Total rater1 rater2 rater3

#N/A #N/A #N/A #N/A #N/A

#N/A #N/A #N/A #N/A #N/A

#N/A #N/A #N/A #N/A #N/A

#N/A #N/A #N/A #N/A #N/A

#N/A #N/A #N/A #N/A #N/A

#N/A #N/A #N/A #N/A #N/A

please help if possible!

This is not the correct format. Each role is a different rater and each column is a different category. The sum of the values in each row must be the same across all the rows.

Charles

thank you very much for your reply!

I am sorry got it wrong! I had to sum the same marks and make a new table like that. (Please correct me if i am wrong!)

mark1 mark2 mark3

2 1

3

3

3

3

3

2 1

2 1

Thank you for the tools anyway!

Kostas,

This looks good, although when I see it on my computer all the columns are shifted left — in particular, there are no values for Mark3. You would have had to insert zeros for blank values, but this should not be a problem on the Excel spreadsheet.

Charles

yes the empty cells are ignored when i paste. anyway it works fine on excel. thanks again!

Data I sent earlier may be confusing. Please calculate Fleiss kappa and provide worksheet for following data-

A 70 3 2 4

B 2 44 5 0

C 3 6 29 1

D 5 1 1 24

Sorry, but I don’t know what data means. What does each row and each column mean?

Charles

i do not know to calculate Fleiss Kappa for following data. Please help

A 70 3 2 4

B 2 44 5 0

C 3 6 29 1

D 5 1 1 24

Hello Charles,

Unbelievably helpful…thank you! However, I am getting very low kappa’s for several surveys that we are making and I have read about this paradox when there is very high agreement (skewed distribution). Can you take a peek at our spreadsheet to ensure it is correct. It reflects 10 dichotomous ratings by 10 people about the understanding of survey items (first column). They are in very high agreement. Is there a way to handle this and way to report the kappa (or ranges) that is meaningful? Blue rows were not included in the analyses. I truly appreciate any insight that you can provide.

Renee,

If you send me an Excel file with your data and calculations I will look at it.

See Contact us for my email address.

Charles

Hi, Charles

Thanks for great article. I think, it is so benefit for analyse Multiple rater kappa (Fleiss’ Kappa). I have some problem such as negative kappa and not’s sure that I calculate data correct. Would you mind, if I want you to recheck my data calculation.

https://drive.google.com/open?id=0B9MaApJTSiHwTTYzM0JnNXdvMXc

Regards,

Pick

Pick,

No this isn’t the correct calculation. The main problem is that your columns correspond to the raters instead of the rating categories. Also the sum of the values in each row should be the same.

Charles

Hi Charles.

My research consists of 6 categories (Strongly Agree, Agree, Slightly Agree, Slightly Disagree, Disagree and Strongly Disagree) , 27 items, and 9 raters. Below are the results based on this article:

m=9

n=27

Pa=0.322016461

Pe= 25.28172394

kappa=1.027921557

se=0.018256452

z = 56.30456314

p-value = 0

The kappa value is 1 which means excellent agreement but the z-value is so high. Is there something wrong based on the results? Thank you very much.

Rose

Rose,

Without seeing the raw data, it is hard for me to comment. More importantly though, Fleiss’ kappa is designed for categorical ratings. You are using ordered ratings and so the order will not be taken into account in Fleiss’ kappa. You might be able to use the intraclass correlation (ICC) instead or some form of weighted Fleiss’ kappa. Here are two articles about this latter approach:

https://www.researchgate.net/publication/24033178_Weighted_kappa_for_multiple_raters

http://www.sciencedirect.com/science/article/pii/S1572312711001171

Charles

Dear Charles,

Thank you very much for providing this useful and informative website.

I have been asked to evaluate multiple readers’ treatment response assessments of 12 subjects based on impression of a change in lesion uptake of a radiopharmaceutical on medical images acquired on three different dates (there are four choices: progression, stable, partial response or complete response). However, the readers have not necessarily evaluated all the same lesions; they were simply told to select those lesions from the images that appeared to have the most uptake. Their response assessment was based on the appearance of the single hottest lesion at each time point, even if it was a different lesion from the baseline choice. In most cases, based on quantitative measures, I can see that they all selected the same lesion at a particular time point. In a few cases, it is obvious that they must have chosen different lesions. The Fleiss kappa value is 0.65, which the general literature appears to regard as ‘good’ agreement but I understand that this descriptive terminology is arbitrary. My questions are:

1. Given the fact that the readers may occasionally have selected different lesions at a particular time point, based on subjective impression of uptake, is a Fleiss kappa evaluation of response assessment valid? Please keep in mind that the object of the study was to show that the response assessments were consistent among multiple readers.

2. The previous question notwithstanding, is a kappa value of 0.65 really ‘good’?

John

John,

1. I am not able to answer your question without a clearer and more detailed description of the situation.

2. There isn’t complete agreement on what is a good value. The value of .65 is only a guideline.

Charles

Dear Charles,

Thanks for sharing this! Im having trouble with dichotomous variables and 4 raters. Is there any option to calculate an ICC or Fleiss’ Kappa with dichotomous variables?

Tim,

You should be able to use Fleiss’s kappa with a dichotomous variable. What sort of problem are you having?

Charles

Im having a lot of zeroos in my data. For example a column could contain 300 zeroos and only one 1.

So there is no use in calculating mean and se?

Thanks!

I can’t say for sure without seeing the data, but with such data it may not be useful to compute Fleiss’ Kappa Or any other similar measurement.

Charles

Is there a possibility that I send you an example of the dataset?

Thanks in advance.

Yes, please send me the data.

Charles

Thank you for this information on Fleiss Kappa.

I am still having difficulties in conducting the Fleiss Kappa. I have 2 raters that evaluated over 1000 events for 3 different evaluation categories. The raters can only use 1 category out of the 3 to rate an event. I cannot do a Cohen Kappa because more than 3 evaluation category so I was doing an agreement score between the 2 raters but I have been told to do instead a Fleiss Kappa. I have organized my Excel table where for each event the evaluation category is 0 if no agreement and 1 if the 2 juges agree. So for each event you will have either three scores as 1 (if agreement) or two scores as 0 and one score as 1 (if disagreement). I wanted to do a Fleiss Kappa in SPSS or even Excel but I either way I cannot find how. Thank you in advance for your help with this.

Josee,

If you only have two judges, you can use Cohen’s kappa, even with 3 or even more categories.

Charles

Good morning Charles,

In an experiment that we’re doing, we found that we have in some cases perfect agreement between the 10 judges, but anyways, doing kappa fleiss it seems to be no valid or in negative values of kappa.

Can you help us to analyse this answers? Here i copy the link of the experiment.

https://drive.google.com/open?id=0B5lxuC4qxrbmSWo4YUhNdHhKZ3M

Paola,

If you are getting a negative value for kappa, this indicates that there is no agreement among the judges.

Charles

Hello Charles, Thanks for answering!

We checked your comment, but we don’t understand why the Kappa is negative or equals 0, when the whole group of Judges answer with the same evaluation category.

Here I send you the cases I just mentioned. Help me please 🙂

https://drive.google.com/open?id=0B5lxuC4qxrbmbDRSRDNOd2g2V1k

Thanks in Advance

Paola,

Your calculations seem correct.You are right that this is a strange result. With complete agreement the result is negative. Apparantly Fleiss’s kappa is prone to this sort of problem with unbalanced data. See the following webpage for more information:

http://stats.stackexchange.com/questions/47973/strange-values-of-cohens-kappa

Charles

Thank you very much Charles

hello,

I need your help please regarding the calculation of kappa for a study. I have 5 raters for 13 articles. the rate is mainly based on scores from 1 to 5. I want to check the reliability by calculating kappa for each rater and overall.

Simone,

Kappa is a measurement of the overall differences between raters. You don-t calculate a kappa value for an individual rater.

The way to calculate Fleiss’s kappa is shown on the referenced webpage.

Charles

Charles,

I am conducting a controlled test in which I have 30 evaluators and 10 pairs of samples. I am providing a sample pair to an evaluator to review and determine if the 2 cards are the same or different. Within the 10 pairs I have the 6 pairs that are different and 4 are the same. I know the standards and controls.

My null hypothesis is that the cards are the same.

My alternate hypothesis is that the cards are different.

How would you recommend analyzing the data? One set? Separated?

In Minitab using 1 set I am obtaining a p value of 1.0 in the Fleiss’ Kappa Statistics section and negative Kappa values.

Fleiss’ Kappa Statistics

Response Kappa SE Kappa Z P(vs > 0)

Different -0.310712 0.05 -6.21425 1.0000

Same -0.310712 0.05 -6.21425 1.0000

Thank you in advance!

Ben, regarding whether you create one Fleiss’ Kappa or two (one for the cards that are different and another for the cards that are the same), this really depends on what you are trying to show. Either can be useful.

I don’t understand why you need to set null/alternative hypotheses or why you have chosen these hypotheses.

Charles

Hi am needing some advise please.

I have a database of about 4000 plants and have information on each plant species from various sources. The information was obtained from about 30 different sources (raters). However many sources (raters) have only provided information for about 10% of the plant database. Hence I a lot of data on each plant but it is not from each rater. This implies that there are large gaps in data (but sufficient for me to conclude certain answers. The information has been captured in Excel. This also implies that there are large gaps between data in each column as well as in the rows. It has been recommended that I need to use Cohens Kappa (to test the agreement between raters), however in my reading I think that Fleiss Kappa is more suited. Can you advise on this. Also can you advise me of a website or youtube site that will explain how I set up a data set such as this correctly and what process and formula I use (preferably step by step guide).

Leslie,

Fleiss’s kappa can be used when you have many raters. It is not necessary that each rater rate each subject. The main criteria are that (1) each subject is rated by the same number of raters and (2) the rating are categorical (i.e. not ordered, such as in a Likert scale or with a decimal value). If the ratings use a Likert scale or a numerical value then the ICC might be a better way to go.

Before setting up your data, you need to be clear about what sort of data you have (i.e. number of raters, types of ratings, etc.).

Charles

Dear Charles Zaiontz,

Thank you for this great explanation. I would really appreciate it if you could help me with the following:

I have 76 raters who had to listen to two audio recordings (one native language & one foreign language) and transcribe what they heard. Some of the raters were primed. I categorized their transcripts as 1 = correct transcription, 2 = biased transcription and 3 = other.

Now I would like to analyse if the agreement is higher in the primed group than in the control condition and if this agreement is even stronger if the raters were transcribing a non-native language.

Would Fleiss Kappa be the right choice in this case?

Many thanks in advance!

Kind regards,

Maike

Maike,

You can calculate Fleiss’s kappa for the group of primed raters and then calculate another Fleiss’s kappa for the group of unprimed raters and then compare the results. The measurement that you get will be quite limited since you have only two subjects (i.e. the two audio recordings).

When you try to compare agreement in transcribing native vs. non-native languages, you reduce the number of subjects down to 1, which will violate the assumptions of Fleiss’s kappa. In this case you might as well simply compare the variances of the ratings.

The following is something I found on the Internet which may be useful

http://stats.stackexchange.com/questions/132609/comparing-inter-rater-agreement-between-classes-of-raters

Charles

Thank you for your wonderful explanation. According to my data set I am getting Z value as 17.8. Similarly each catogory Z value is above 25. What should I infer from this?

K= 0.437138277

var= 0.000593

se= 0.024351591

k/se =17.95111749

for each category

k1 0.722940023 0.301151045 0.270887021 0.154798991 0.392763376 0.497409713 0.401404805 0.473193742 0.296002606

var(k1) 0.000390278 0.0003871 0.000595739 0.001114339 0.000594559 0.000777897 0.000547207 0.000653159 0.000385771

se(k1) 0.019755455 0.01967485 0.02440776 0.033381712 0.024383589 0.027890808 0.023392452 0.025556984 0.019641059

z 36.59445027 15.30639563 11.09839766 4.637239447 16.10769338 17.83418051 17.15958671 18.51524193 15.0706028

Sreeja,

It is very difficult for me to answer your questions without some context. It would be better if you sent me an Excel file with your raw data and the analysis.

Charles

Sreeja,

Thanks for sending me the Excel spreadsheet. This makes everything much clearer.

I am also getting z values which are high, even higher than yours. The reason for the difference is that I am calculating a different value for the standard errors. The standard errors I am using come from the following paper.

Fleiss, J. L., J. C. M. Nee, and J. R. Landis. 1979. Large sample variance of kappa in the case of different sets of raters. Psychological Bulletin

What is the source of the standard errors that you are using?

Charles

Thank you. It’s very useful to understand the procedure of statistic.

In this page, There’s only explanation for ‘Within or between Appraisers’

Could you add some explanation for ‘Appraisers vs standard’ ?

It will be a lot help for me. Thank you

John,

This depends on your specific requirements. One approach is to Bland Altman. See the following webpage

Bland Altman

Charles

Thank you, Sir

Dear Sir.

This website helped me a lot in understanding statistics. Thank you.

Can you help me with the following?

I want to measure the interobserver agreement of three observers(raters) for evaluation of gastric tumors.

There are 445 subjects (n) and only two categories (whether benign or malignant).

It seems that Fleiss’ Kappa calculates the overall agreement of the three observers.

Can the Fleiss’ Kappa be used to calculate agreement between two observers?

For example, between observer 1 vs observer 2, observer 2 vs observer 3, and observer 3 vs observer 1? (I’m guessing not. Then should I use weighted Cohen’s kappa three times instead?)

Thank you.

M.C.

For pairwise agreement you can use Cohen’s Kappa (or Weighted Kappa).

Charles

Thank you for your explanation! I think I know why my previous attempts to calculate a kappa came out with nonsense, but I am not sure how to set up my data.

I have three raters, each watches two videos for each of 50 subjects. On each of the two videos, they score two different aspects on different scales. On video one, they score 6 items for aspect “A” on a 1-5 scale, and another for aspect “B” on a 0-14 scale. On the second video, they again score 6 items for aspect “C” on a 1-5 scale, and another for aspect “D” on a 0-12 scale. So I really have 14 evaluations per subject per rater, and two of them have a large number of possible scores (13-15). We were adding all the “A” items to get a score out of 30, and similarly all the “C” items to get a score out of 30, to only have 4 evaluations per subject per rater. I am afraid this is artificially lowering our inter-rater reliability, because it seems much less likely that any 2 raters will all agree on a score out of 15 or 30 than a score out of 5.

Do I need to run 14 kappa tests? And then, how do I get an overall inter-rater reliability rating from them? Or can the kappa test be set up to handle large numbers of possible scores? Should I be doing something totally different?

Rose,

From your description, you can calculate 14 ICC values (or 4 ICC values based on the combined scores). If you have a way of calculating a combined rating, you can use this to compute one ICC.

If you have 14 ICC values, you can create a combined value based on what you plan to use the ICC for. E.g. you could use the minimum ICC value.

Charles

Hi Dr. Zaiontz,

Great tutorial! Straight to the point and instantly gratifying! I have a 22 raters evaluating 50 different scenarios. They then categorize the scenarios into 6 possible groups (nominal variable) according to what they believe is the correct category. I have calculated the interrater reliability using fleiss kappa according to your methods but I am also interested in overall rater reliability of all of the judges compared to a gold standard. The gold standard rating was done through the consensus of three independent raters using a tool. What test would you run to compare rater reliability to a gold standard?

thank you!

Hello Felix,

The Real Statistics website doesn’t cover this issue. The following paper may be useful in identifying an approach which meets your needs.

https://orbi.ulg.ac.be/bitstream/2268/39575/1/vanbelle-thesis-5-5-2009.pdf

Charles

Pingback: Interrater reliability or Kappa Statistic in excel - Page 2

Dear

Thank you for your explanation, I have a question

I’m following the formula for kj, but the values I get are not the same as your table, could help reviewing whether the values in your table are correct?

Thank you

Fernando,

I will look at your results if you send me an Excel file with your data and results. You can get my email address at Contact Us.

Charles

Buen dia;

es posible que me envien al archivo de excel practico .

You can download the examples files at

Examples Workbook

Charles

Dear Mr Zaiontz,

Thank you for excellent explanations of the Kappa statistics.

I have a few questions:

Regarding the z-value, it seems to always compare the result to K=zero. You say so on this web-site, and my statistical software has the same default which I cannot change. I find that a bit strange, it is not always good enough to be just a little better than tossing a coin. To introduce a new method you might want to prove it to be significantly better than fair Kappa ( which often is set at K=0.40, but I am aware that definitions vary between different authors).

If I want to compare Kappa values between two different sample populations, with the hypothesis that the Kappa of the test population is significant higher than Kappa of the reference population, I believe I have to calculate the z-score manually by subtracting the mean of the reference population from the test population and divide by SE of the test population. Then I calculate the p-value choosing a one-sided or two-sided test.

Do you agree on this, or are you of another opinion?

(I have a material with 3 observers/judges and 106 samples at two different departments with different training, and the size is chosen by a power calculation based on previous publications on the topic.)

Best regards,

Malin

Dear Malin,

I don’t know any tests to determine whether kappa is higher than some value p. I am not sure whether any simple approaches will work as the case for testing the correlation coefficient reveals. The approach for determining whether a correlation coefficient is significantly different from zero is quite different from that used to determine whether the correlation coefficient exceeds some value. This is because in the first case, you can assume a normal distribution (at least under the null hypothesis) and use a t distribution, while in the second case you can’t assume a normal or t distribution.

I also don’t know any methods for comparing two values of kappa based on two different measurements. The approach that you are considering may be appropriate if you can figure out what a suitable pooled s.e. should be. Again we can look at the correlation coefficient where the pooled s.e. is not so obvious.

I have found the following research paper, which may be useful, although I have not read it myself.

McKenzie DP. Mackinnon AJ. Peladeau N. Onghena P. Bruce PC. Clarke DM. Harrigan S. McGorry PD. Comparing correlated kappas by resampling: is one level of agreement significantly different from another?. Journal of Psychiatric Research. 30(6):483-92, 1996 Nov-Dec.

Charles

Hi Charles

Thank you for a very useful website.

I followed the instructions to calculate Fleiss Kappa, including s.s. and CIs, but I run into trouble with the p-value. My Excel does not seem to like the command with NORMDIST, and I tried following the steps to install your software but no luck so far. Therefore, is there a formula to calculate the p-value? Or how can I ‘unpack’ the command you suggested above? Is there a way to ‘tell’ Excel what to do in a different way?

Many thanks

Paula,

The website uses the function NORMSDIST not NORMDIST. NORMSDIST(x) is equivalent to NORM.S.DIST(x,TRUE) on newer versions of Excel (although the older version still works).

Charles

Thank you for your reply, Charles.

And thank you for picking up my mistake – all sorted now! My p-value came up with 0 – by reading the previous comments on here, I’m guessing that this is acceptable. With regards to reporting the inter-rater reliability, would you say that it is better to report Kappa and CIs, or should I report the p-value as well? Sorry, I’m quite new to this.

Many thanks again.

Paula

Hi,

first of all i want to thank you for the article.

I have a question concerning Fleiss’ Kappa. You wrote: “[…] in the case of Fleiss’ kappa, there may be many more than m judges and not every judge needs to evaluate each subject; what is important is that each subject is evaluated m times.”

So my problem is this: we have around n=350 subjects, k=7 categories and m=6 judges. My problem is as follows: (a) not every judge evaluated each subject and (b) every judge can evaluate each subjekt into 2 categories (it’s a pre-evaluation; after the pre-evaluation we want to meet and discuss our results and finalize our categorization).

Is there any way i can still work with Fleiss’ Kappa? Is in this way m=12, because each judge can evaluate each (problematic) subject into 2 categories?

Thanks in advance!

One remark concerning (b): a subject can be evaluated into 2 categories, but doesn’t have to.

David,

I don’t completely understand the scenario. You say that there are 7 categories but you also say there are 2 categories. Which is it? Perhaps it would be helpful to have an example with some data.

Charles

I got it .. thank you very much .. I will read it and study it … if there any question I will send it to you .. thank you very much again.

Thank you very much … I need help

what do you recommend to carry out a content validity index

Sorry, but I haven’t addressed this topic on the website yet. The following document may be useful:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.460.9380&rep=rep1&type=pdf

Charles

Thank you very much

but I failed to open the link or download the file.. can you please send the pdf as attachment to my e-mail .. elsayedamr@yahoo.com

Just enter the link on your browser and you should be able to access it. It is not from my site, but from another site on the Internet.

Charles

I failed again … If you please what is the title of the paper and I will manage getting it.

The title of the paper is A Quantitative Approach to Content Validity by C. H. Lawshe of Purdue University.

Charles

Hi Charles.

Thanks for the wonderful tools and explanations. I am trying to use the Real statistics free download to perform a Fleiss’ kappa analysis and running into some issues. When I copy the sample data from Figure 1 into Excel and use the Fleiss’s kappa option under the reliability procedures menu the output field comes back with every cell filled in #N/A. Can you suggest some solutions to this issue?

Thanks,

Amanda

Amanda,

Perhaps that is because you also used column A in the Input Range. The input should not include this column.

Charles

Sir,

Can Cohen’s Kappa or Fleiss Kappa be use for any other statistical agreement data. Or the ‘rater’ in here only for human respondents? Can the ‘rater’ also means previous study?

Thank you.

A rater doesn’t have to be human. It can be a previous study.

Charles

Thank you so much Charles!!

Dear Dr. Leinder

Thank you very much for the article. I am working on biotic indices. I want to see the agreement of the indices to evaluate the ecological status. The indices are classified into 5 categories (Bad, poor, Moderate, Good and high). I have about 610 samples.

I would like to know which test to apply. Kappa or ICC.

The categories used are i.e Bad=1, Poor= 2 Moderate= 3 etc.

In general, how do i arrange the data. I have arranged the data as following

Example 1

Stn BI1 BI2 BI3 BI4 BI6 BI 7 BI8

Stn1 1 2 3 2 3 4 3 3

Stn 2 1 2 4 2 3 4 3 4

Example 2

Bad Poor Mod Good High

Stn 1 1 2 4 1 0

Stn 2 1 2 2 3 0

Which is the correct method? If example 2 is correct is there are easy method to it. I did few samples manually, but 610 samples is too much.

Thanks!

Hi,

Looking at the equation to compute kappa for the jth category, I can see that the denominator includes the number of subjects assigned to that category. I would like to know your opinion about the following conclusions I loosely deduce from this equation:

Categories that are bigger in the data and get a higher number of subjects assigned to them, might probably have higher values of kappa than those much smaller categories (with very few subjects belonging to them) just due to the fact that they are bigger (i.e., kappa is biased towards size of categories in the data itself)

Additionally, can you suggest any other inter-rater agreement measure that’s less sensitive to categories size (assuming 3 annotators)?

As I said in my previous comment, I am not certain whether Fleiss’ Kappa is indeed sensitive to category size. In any case, the bigger issue is whether you have ordered data or categorical data. In the first case, Fleiss’ kappa is a reasonable choice. In the second case, the intraclass correlation is a reasonable choice.

Charles

Hi,

I haven’t had time to really look at the formulas to see whether your observation is true or not, but I noted that for Example 1, the exact opposite is true. The two categories with the highest number of subjects have the lowest kappa values.

Charles

Hi,

Thanks for the great tutorial and tool. I am working with thousands of samples, 2 categories (yes and no) and 3 annotators. When I computed Fleiss Kappa using the Excel tool, the results I get show that the per-category kappa values are equal for both categories (e.g., “yes” kappa = “no” kappa = 0.47). I repeated that computation over a completely different set of samples but with same setting (two categories, 3 annotators) and I also get the same result (i.e., per-category kappa values are equal across the categories). Is this normal? Could you please explain why this is happening?

Just to add more context, in both sets of data, I realized that the standard deviation of annotations counts is equal across the categories, but the mean is different.

If you send me an Excel file with your data and results, I will try to figure out what is happening. You can find my email address at Contact Us.

Charles

Just did. Thanks for the great help!

Maram,

Since you only have two categories, the values for the two categories will be equal. This won’t be the case with more than two categories.

Charles

hello.. i need help..to figure out how..to get fleiss kappa value..if I have 5 expert valuation.. and the scale use of the item..is 1 to 10.

example

item expert 1 expert 2 expert 3

1 8 9 10

2 8 8 8

3 9 9 9

4 10 10 10

.

.

.

81 9 9 9

tq

I don’t recommend using Fleiss’ kappa with ordinal data. Better to use Intraclass Correlation Coefficient. See the webpage

Intraclass Correlation Coefficient

Charles

Sir

I have a question in the case of reference value known.

How to deal with the data matrix if we need to evaluate overall Fleiss Kappa when the reference value is known for each sample? Just assume we have already known the actual kind of illness for these 12 patients, now 6 judges give their judement to each patient.

This situation is actually kind of Attribute MSA and MINITAB gives its answers, but I deducted several times with the formulas and was not able to give the exact answer in this case. However, the deduction for the case of reference value unknown is the same as MINITAB’s (like your case here).

Appreciate your reply.

Thanks.

Vincent,

I believe that you are looking for an implementation of Attribute Gage R&R. I plan to look into doing this in one of the next releases of the software.

Charles

Our P value in the row is as mentioned below:

m 3

n 1000

pa 0.4815

pe 0.253083444

kappa 0.305812683

s.e 0.010177511

z 30.04788446

p-value 0.0000

alpha 0.05

lower 0.285865127

upper 0.325760238

The P value in the column is as indicated below:

q 0.074 0.206333333 0.334333333 0.290333333 0.094666667

b 0.068524 0.163759889 0.222554556 0.206039889 0.085704889

k 0.708131458 0.395456356 0.167245385 0.221833528 0.377709556

s.e 0.018257419 0.018257419 0.018257419 0.018257419 0.018257419

z 38.7859573 21.66003666 9.160406974 12.15032275 20.68800442

p 0.00000 0.00000 0.00000 0.00000 0.00000

question:

Is there any issue if our P value is a null element?

Please reply as soon as possible.

Thanks…

The p-value can be zero (probably a very small number). I have not checked your calculations, but you can get a zero answer. You can check your answer by using Real Statistics’

KAPPAfunction.Charles

Thanks a lot, sir. I have sent u an e-mail with the excel sheets for calculation. Can you please have a look and give the solution?

Hi! Thanks for this page.

I have two questions:

1. I would like to compare Fleiss kappa values, for example, 0.43 (95% CI 0.40, 0.46) and 0.53 (0.43, 0.56). The confidence intervals overlap – is this enough to say that the two Fleiss kappa values are not statistically significantly different from each other? I have read that this may not be the case, link: http://www.cscu.cornell.edu/news/statnews/stnews73.pdf

2. I often see % agreement reported in articles when using Cohens kappa (2 observers). Is this something that is done with Fleiss kappa? If so, how is it calculated (is it number of cases where all observers rate the same/total number of cases)? I have noticed that the % agreement I obtain when calculating Light’s Kappa instead of Fleiss kappa is always higher…

Trina,

1. The article you referenced states that just because the confidence intervals overlap doesn’t mean that the two Fleiss kappa values are not statistically significantly different from each other.

Here is a web reference regarding comparing two Fleiss kappa values. Perhaps it is useful to you.

https://www.researchgate.net/post/Can_anyone_assist_with_Fleiss_kappa_values_comparison

2. You can get information about Cohen’s kappa at the webpage

Cohen’s Kappa

Charles

Hi Charles,

I am getting unusual results following the guide when the raters are at almost perfect agreement.

I have 9 cases, each with 4 categorical ratings by 8 raters.

Cases\Ratings 1 2 3 4

1 0 0 0 8

2 0 0 0 8

3 0 0 0 8

4 0 1 0 7

5 0 0 0 8

6 0 0 0 8

7 0 0 0 8

8 0 0 0 8

9 0 0 0 8

The kappa is -0.014 and the p-value is 1.177. Will you be able to advise as I am uncertain what’s going on?

Thanks!

Cheers,

BC

BC,

Unfortunately, many of these measures give strange results in the extreme cases. The best way to look at these situations is that when you have almost perfect agreement you don’t really need to use Fleiss’s Kappa. Not very satisfying answer, but it is true.

Charles

Thanks Charles for the clarification. It’s a good answer to clarify my doubts as I was sieving through all my equations in excel and found no errors.

In a scenario like this, will analysing by ICC for inter-rater reliability (multiple rater) be better? What’s your recommendation?

You would use the ICC with rankings that are quantitative. Fleiss’s kappa is used with categorical data. In situations like this you would simply note that there is obvious agreement and not use any of the usual measures.

Charles

Thanks Charles for the further explanation, much appreciated!

Thank you for this page. It has helped but I am now wondering if Fleiss’ kappa is the correct statistical test for our purposes. I have running a test-retest reliability study on a biological response to a stimuli. There are 4 response pattern categories. Six subjects underwent 4 repeated tests. We want to know if the same subject will attain the same test response to the same stimuli under the same conditions.

Currently, my subjects are the rows, and the response pattern categories are the columns, with each test week as the ‘rater’.

In your opinion, should we use Fleiss’ kappa or ICC?

Thanks for this page, I am a stats novice, so am very grateful for having a clear ‘recipe’ to follow.

Catherine

Catherine,

If the data is ordered (especially with values such as 41.7, etc.) ICC will take the order into account. If the data is categorical (no order) then Fleiss’ is the approach to use.

Charles

Data is categorical so we will stick with Fleiss’ kappa. Thank you so much for your assistance and your prompt reply.

Catherine

Ok Charles thank you very much.

Dear Charles,

i computed standard error and confidence interval according to your instructions. May I ask you where you found the formula for the confidence interval? Or did you come up with it on your own?

Thanks a lot for providing this information!

Cheers,

Henk

Henk,

I think I got it from the following paper

http://www.bwgriffin.com/gsu/courses/edur9131/content/Banerjee-BeyondKappa.pdf

Charles

Thank you very much for your reply, Charles.

I am not quite sure what I need to use, though.

I will give an example to make the understanding of the case easier.

I have 10 criteria that are hypothesised to be important in establishing a diagnosis. 30 experts will rate the importance of each criterion on a likert scale from 1 (not important) to 5 (very important).

What I want to know, is what statistic (besides CV) I can use to evaluate the agreement between all the 30 experts on each single criterion.

Many thanks!

Mattia,

If you want to rate the agreement between all 30 experts across all 10 criteria then you could use the intraclass coefficient (ICC) as described on the webpage Intraclass Coefficient

For your situation instead of Judges A, B, C and D in Example 1, you would have the the 30 experts (these are the columns) and instead of 8 wines would have the 10 criteria. The table would be filled with ratings 1 to 5.

If instead you want to evaluate agreement one criterion at a time, then I don’t have any specific advice to offer. I have come across the following paper which may be helpful. I have only read the Abstract and I am not sure about the paper’s quality or applicability to your problem.

http://www.amsciepub.com/doi/abs/10.2466/pr0.1998.82.3c.1321?journalCode=pr0

Charles

Dear Charles,

I am trying to identify the appropriate measure to use for my inter-rater agreement test. After much research, I believe it may be the Fleiss’ kappa measure, but am not sure as I have not been able to locate literature which suggests it can be used in my specific case presented below.

I have developed a multiple choice test with 58 items based on teaching scenarios and what teachers might do in a particular classroom situation. Then I had 5 expert teachers take the test so that I could attain an agreed upon answer key. Now I want to run a test which provides me with a value which justifies the answer key as having substantial agreement (0.61 – 0.80), perfect agreement (0.81 – 1.00), etc.

I cannot seem to find literature specific to obtaining inter-rater agreement of an answer key to a multiple choice test. In short, my question is what index is appropriate to test the level of agreement for this situation and what research is there to back that decision?

Any advice would be greatly appreciated.

David

Hi Charles,

I would also like to clarify that the 58 multiple choice questions each have three choices, which only 1 is correct. These three choices are nominal or categorical in nature.

David,

Generally in situations where it is clear what is the correct answers are to a multiple choice test, you might run Cronbach’s alpha to check for the internal consistency of the test, but it seems that you are testing for agreement among experts as to what the correct answers are and the categories are nominal. This seems to be a good fit for Fleiss’ kappa.

Charles

Thank you Charles.

I have forged ahead on your recommendation. The kappa value that was returned using the program provided on the domain https://mlnl.net/jg/software/ira/ (Geertzen, 2012)

was 0.77, which according to Fleiss (2003) represents excellent agreement beyond chance. Would you agree with this assessment of the returned k value? Have you seen the on-line program I refer to above?

Thank you, David

David,

I am not familiar with the program you referenced.

You could use the

Fleiss’ Kappaoption the Real StatisticsReliabilitydata analysis tool to calculate Fleiss’ kappa.Charles

Dear Charles,

I am working on a study proposal for my dissertation which will need to get approval from a Ethics Committee, therefore I need to define analysis of data in advance.

I will have n criteria, and raters will rate on a 1-5 Likert scale the importance of each criteria in establishing a diagnosis.

In similar studies I have seen that percentage agreement, coefficient of variation and Kendall’s W are commonly used.

From what I have been able to understand, however, Kendall’s W tests the level of agreement between n raters across ALL items, which means it is a measure of the overall agreement.

However, what I want to know, is to identify what are the items where consensus has been established; in other words, I need a test that assess agreement for each individual items (besides descriptive stats such as CV and percentage agreement).

I have found that in a previous study kappa was used. However I am not quite sure this is appropriate, would you please shed a light on this?

Thank you for your time!

Mattia

Comment added:

I have also found a study where ICC was used to assess agreement across all items and then kappa was used to assess agreement within each item.

Is this appropriate?

Why ICC for overall agreement and kappa for within item agreement?

If you also had a reference for this that would be great.

Many thanks.

Mattia,

ICC is another way to measure agreement across all subjects. In Fleiss’s kappa the ratings can be thought of as categories (Biology, Math, Reading). Any ordering between the categories is not taken into account. Thus Fleiss’s kappa would generally not be used if the rankings were something like a Likert scale (1, 2 ,3, 4, 5). ICC can be used the case where the rankings are ordered or when the rankings are continuous numbers (34.5, 12.7, etc.),

As I explained in my previous response, both of these measurements assess agreement across all the subjects.

Charles

Mattia,

Cohen’s Kappa is used to assess the level of agreement between two raters of many subjects based on one rating criterion.

Fleiss’s Kappa extends Cohen’s Kappa to the case where you have more than two raters (still only one criterion).

Both of these tests measure agreement across all the subjects.

If by item you mean subject it is pretty easy to assess agreement: for two raters, either they agree or they don’t, and even for more than two raters you don’t need any fancy measurements to assess agreement.

Charles

Hi Charles,

I’m working on a content analysis project and would like to compute a Fleiss’ kappa but I’m not sure how to organize the data. There are five coders, and they’ve coded 20 Tweets for content in 19 categories. Coding values are nominal categories, one through three.

Can you toss me a bone?

Thanks,

Rob

Robert,

Since the 5 raters are rating 20 Tweets, the Tweets are the rows and the sum of numbers in each row is 5.

Since there are 3 nominal ratings 1 through 3, there are 3 columns. The cell in the ith row and jth column contains the number of the raters who have given the ith Tweet a rating of j.

The problem is that you have 19 such ratings, one for each category (if I have understood the problem correctly), which means that you have to calculate Fleiss’ Kappa 19 times, once per category. There may be a multivariate Fleiss’ Kappa that does the job you want, but I am not familiar with such a measure.

If instead you

You do indeed understand the problem correctly. Currently I’m organizing the data manually — converting raw scores into categorical agreement figures (0-5-0, 1-4-0, and so on). It’ll take a while but I’ll give it a shot. (Isn’t there some way to get Excel to do this sort of computation so I can avoid human error?)

Robert,

There may be some easier way of doing this reorganization of the data in Excel, but I’d have to see how the data is formatted to really provide a good answer. But by the time I figured it out you will have probably reorganized the data manually. Sorry about that.

Charles

Charles,

Thanks for your mindful relies. Let me run one more thing past your eyes and then I’ll stop.

My five coders coded several categories for variables we didn’t expect to see in very many of the 20 tweets, so all of the coders made substantially the same decisions of “not present” (or a code of “2”). Nearly every tweet was coded the same way by every coder, so the frequency grid looks like this for 15 of the Tweets:

0-5-0

There are five cases of a coder that coded “present” (or a “1”), so five of the data lines look like this:

1-4-0

What puzzles me is that the reported kappa is -0.053, and I don’t know how to interpret this result. In another category there is a single disagreement of this type, and the result is a kappa of -0.010.

I’ve computed kappa for the data sets using two templates and I get the same thing so I think I’m computing it correctly. What does a kappa of -0.010 mean?

Thanks in advance for your help!

Rob

When most of the responses are the same kappa will sometimes give strange results. To me it is a deficiency in the kappa statistic, but if there is for example there is only a single disagreement, there really isn’t much reason to use kappa.

Charles

Hi Charles,

Need some advice. I don’t know how to organize data so I can compute a Fleiss’ kappa. Here’s what I’ve got:

Five coders have coded 20 Tweets for content in 19 categories with a data range of 3 (1=present, 2=not present, 3=can’t tell). Right now the data exist in an Excel workbook with each coder’s data on separate sheets. I get this much: k=categories (there are three), and m=number of judges (there are five). What is n? The number of Tweets? The number of content categories in the Tweets? How do I convert the five spreadsheets into a grid that expresses agreements in each category?

I can’t get my mind around it. Any tip you might have would be very much appreciated!

Rob

Hi,

In my study, each subject is to be rated on a nominal scale from 1 to 5. If each subject was rated by 2 raters, but the 2 raters were drawn from a pool of 6 raters, I can apply Fleiss’s kappa in assessing the inter-rater reliability, right? Is there any requirement on the minimum times that each rater participated, i.e. should I exclude the rater that rated, say extremely, one subject only?

Moreover, if each subject was rated by different number of raters, what method can I use instead? I find that you have mentioned an extension of Fleiss’s kappa as linked, http://conservancy.umn.edu/bitstream/99941/1/v03n4p537.pdf, but seems that it can be applied for dichotomous variable only and thus not suitable for me.

Would Krippendorff’s alpha be appropriate? But I am not sure if I can treat those raters that do not rate the subject as missing data, since this type of “missing” is not random in nature.

Thanks a lot in advanced.

Lam

Lam,

If I understand the problem correctly, you could use the intraclass correlation (ICC), but this measurement requires that all the subjects have the same number of raters. If the number of raters is not too unbalanced and the samllest number of raters is not too small, perhaps you can randomly eliminate some ratings to create a balanced model (alternatively you might be able to use some multiple imputation approach).

I am not familiar with Krippendorff’s alpha, but I do believe there are techniques for handling missing data.

Charles

Thanks Charles,

For my own experience, I used to use ICC for continuous variables and kappa for discrete/ nominal variables. As the data this time is in nominal scale, may I confirm if ICC can still be applied?

In addition, I could now probably fixed the number of raters to be 3 each time (not the same raters). Would you suggest whether ICC or Fleiss’s kappa be more appropriate?

Sorry that I have got so many questions. Thanks again!

Lam

Lam,

To use ICC the data doesn’t need to be continuous, but it does need to be ordinal. E.g. you could use it with a Likert scale (say 1 to 7). If the data has no order then you shouldn’t use ICC. Fleiss’ kappa is only used with nominal data and order is not taken into account. If your data is unordered then you should use Fleiss’ kappa. If it is ordered you should use ICC.

Charles

Hi,

1. I have a set of data with 38 raters rating the content of videos with regards to 8 items* that are all based on the same ordinal scale (very good, good, adequate, insufficient). My results are puzzling for me, because when I’m calculating Fleiss’ Kappa for each item individually (matrix: 1*38) the coefficients (mostly highly significant) are all smaller than when I’m calculating Kappa for all items together (matrix: 8*38). Shouldn’t the overall coefficient (K=0,31) be within the range of individual coefficients (K=-0,13 to 0,28)?

2. Dr. Leidner wrote above that “There are no “criteria” for how to interpret Kappa – you can compare inter-annotator agreement, but to judge its absolute value depends on the task’s inherent difficulty. There are some “guidelines” in the literature, but they are bogus.” My problem is that I have to interpret the effect sizes and I have no idea if my effects are strong or weak. From the info I gave you, can you say if my coefficients are high enough to speak of a good rater agreement? And do you know of any helpful literature on this topic.

Thanks for your help,

Philipp

*Their task is to assess the driving competency of driving license applicants that we filmed during a simulated test drive.

Philipp,

Interpretation of Fleiss’ Kappa seems to be very difficult and depends on many factors (sample size, bias, etc.). See the following article for a description of some of the problems. http://digital-activism.org/2013/05/picking-the-best-intercoder-reliability-statistic-for-your-digital-activism-content-analysis/

Unfortunately, as a result, I don’t have much advise for you regarding the interpretation of Fleiss’ Kappa.

I am about to look into Krippendorff’s Alpha-Reliability as an alternative measure. The following is an article that explains this measure:

http://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=asc_papers

Charles

Hi Charles,

Thanks a lot for the tip. That has been very helpful.

I just calculated the agreement using Krippendorff’s Alpha and they are the same as for Fleiss’ Kappa for the nominal part of my data. However, Krippendorff’s coefficient also has the option to consider rank information and can handle missing values, so I trust my results more now. Krippendorff is also giving guidelines on interpretation of his Kappa. One downside: No CI’s can be computed.

Cheers

Philipp

Sorry Charles, I didnt notice that your email was avaiable. thanks

Raphaela,

I just checked the spreadsheets you sent me and see that the problem is that the sum of the scores in each row are not the same. If these sums are not the same you will see errors in all the cells.

Charles

Good morning,

can I dowaload again RealStat to calculate Fleiss´Kappa? ( calculate the agreement among multiple judges and multiple variables)? please let me know. thank you

Yes. The new version of this function is now available. You can also use the new Reliability data analysis tool.

Charles

Charles, I got a probelm. I have a matriz 0-1 , 11 variables and 15 judges. I applied Fleiss Kappa (Reability data analysis tool) and it results alpha 0,05 and the table below have errors. any idea? thx

Raphaela,

Without seeing the spreadsheet I can’t tell what went wrong. The most common problem is that the input data is not formatted correctly. If you send me the spreadsheet I will be able to help you better.

Charles

I really appreciate !! could you please send me email and I will reply with the spreadsheet attached. Thankss

Raphaela,

My email address is given on the Contact Us page (see Main Menu)

Charles

ok. Charles, Thank you 🙂

Professor, How I calculate Fleiss´Kappa with Real Statistics? click in data anslyse tool and ?

thx

Raphaela

Raphaela,

Currently you need to use the KAPPA function as described on the referenced webpage.

Tomorrow I will issue a new release of the Real Statistics Resource Pack. This release has an improved version of the KAPPA function. In addition there will be a new Reliability data analysis tool which will calculate Fleiss’s Kappa.

Charles

OK. Thank you so much.

So, I dowload Real Statistics again or only the addition?

Raphaela,

There is no addition, only a full release. I expect to issue the new release later today.

Charles

Hi,

In Figure 2, B17 and B18 rows should be interchanged.

Have a nice day.

Hi Ali,

Great catch. I have now corrected Figure 2 as you have suggested. Thanks for identifying this error.

Charles

Hello,

I am interested in determining the degree of agreement between 4 raters on an observation instrument that involves a rating system of 1 – 4 (levels of proficiency with a target attribute). I believe that Fleiss would be the appropriate measure in this case. Is this correct?

Thanks,

Stacy

Stacy,

Yes, Fleiss’s kappa can be used for this purposes.

Charles

Dear Dr. Leinder,

I want to test the reliability of a newly constructed diagnostic interview. Therefore, raters watched videos of diagnosticians interviewing patients and answered for most of the items on a one-to-five likert scale and for some on a yes/no-basis (as the interviewers did). The agreement between the raters, measured by Fleiss Kappa, is thought to be my estimation of the interrater-reliability. Does that sound right so far?

The problem is that some of the items are multiple choice items. Is it still possible to use Fleiss kappa? It seems the statsitic programs can’t handle that… If not, could you give me a hint what kind of calculation I could use instead? Intraclass-correlations (some items are based on categorial scales…)?

And one last (general) question: Is there a relation between interrater reliability and criteriums validity – otherwise than reliability as a condition for validity? My prof is always talking about a paper called “study into criterium validity…” (based on the data I described). I didn’t challenge him so far cause I’m not sure I’m right…

Thanks so much in advance.

Best regards

Kathi

Dear Kathi,

You have addressed your question to Dr. Leinder. Since am not Dr. Leinder perhaps this comment was not intended for me.

Please give me an idea of some of the questions, especially the multiple choice questions.

I don’t know which type of relationship your professor was talking about between interrater reliability and criterion validity.

Charles

Hi,

I am comparing the reliability of 3 different imaging tests for classification of a condition that has 10 different classes. There are 4 raters and 123 cases. Each reader reviews each case 3 separate times, each time using one of the three different models. Each model will have a Fleiss’ kappa. The reliability of the models can then be assessed comparing the three Fleiss’ kappas?

There is no good gold standard for the condition.

thanks

yvonne

Hi Yvonne,

If I understand your situation correctly, Fleiss’s Kappa with measure the degree to which the 4 raters agree on their classification of each imaging test. If your goal is to select which of the three imaging tests provides the most agreement between the raters then it may be worthwhile comparing the kappa measurements (you still wouldn’t know whether any differences are statistically significant). I wouldn’t necessarily call this a measure of the reliability of the models.

Charles

Hi,

Thanks so much for the timely response!

I am attempting to assess the reliability of the assessment tool, not the observers (nurses) though I understand that I may need to assess the observers seperately.

What I understand you to say that I can rate the symptoms as y/n and the rated questions (1 through 9) as likert style. In doing so, Cronbach’s Alpa is the best tool to test all questions together given the goal is to test the tool.

Thanks for your reply,

greatly appreciated.

dove.

Hi;

So greatful for your web-site and resources. If I might test my idea:

I am attempting to test the reliability of an assessment tool.

The tool has nine (9) questions with responses range from 0-3 and

Three (3) fields where the evaluator can select any combination of symptoms or select none of the symptoms from items on the symptom check list (8-10 items available to choose).

As I understand it, I should seperate out the responses into two data fields and measure the reliability of the 9 questions using ICC (IntraClass Correlation) formula and test the second half (reported symptoms) using the Fleiss Kappa analysis formula.

Is this correct?

I greatly appreciate your resonse.

Hi,

It really depends on what you are trying to measure. E.g. you could use Cronbach’s alpha for the nine questions to determine the internal consistency of the questions. Are you evaluating the reliability of the questionnaire or the degree of agreement between the evaluators (I assume these are the nurses)?

For the symptoms you could treat each symptom as a True/False question (True = patient assessed to have that symptom, False = assessed not to have that symptom). The analysis can them proceed as for the first group of questions (with my same question to you, namely what are you really trying to assess?

Charles

i did the observation for a process in operating room, i have 2 sets of observers: assigned observers and participant observers:. the results for 14 categories were as follows:

1 2

1 230 250

2 259 260

3 260 260

4 249 251

5 260 260

6 238 256

7 250 252

8 217 229

9 212 218

10 258 256

11 189 197

12 245 248

13 245 248

14 254 258

15 260 260

and the results of Fleiss’s Kappa was:

Fleiss Kappa for 480 raters = 0.0402 SE = 0.0008

95%CI = 0.0387 to 0.0417.

so, could you please guide whats mean of the results, is there agreements between 2 observers ??

Mohammad,

There is no clear cut agreement as to what is a good enough value for Fleiss’s kappa, but a value of 0.0402 is quite low, indicating that there is not much agreement.

You mention in your email that there are 14 categories and 480 raters, but I don’t understand the results for the 14 categories you listed, starting with the fact that there are 15 categories in your list.

Charles

Dear Sir,

I’m having some troubles in calculating the kappa for my sample. I have 189 patients, 3 evaluators and 2 possible diagnoses but I am not able to apply the formulas you suggest, maybe because of the different translations of the formulas themselves (I’m Italian). I am really having many problems and I need this kappa for my final project at university so I was wondering if you could help me or if you have an excel spreadsheet which I can put my numbers in, please. Thank you very much

Dear Alessandra,

If I am understanding the problem correctly, your input should be similar to Figure 1 in the referenced webpage with two data columns (one for each diagnosis) and 189 rows (one for each patient). The sum of the values in each row should be 3 (for the 3 evaluators).

You can get a copy of the worksheet I used to create Figure 1 by downloading the Worksheet Examples file at http://www.real-statistics.com/free-download/real-statistics-examples-workbook/. Once you have this you can modify it to fit your specific problem.

Charles

Dear Sir,

Could the number of judges (m) for every subject, different?

for example:

test subject #1 is evaluate by 5 psychologist

test subject #2 is evaluate by 6 Psychologist

…

…

and so on.

Dear Nara,

As far as I am aware Fleiss’s kappa requires the same number of judges for every subject. There are extensions where this requirement is relaxed. See for example http://conservancy.umn.edu/bitstream/99941/1/v03n4p537.pdf.

Charles

is it possible to use Fleiss Kappa for 45 respondents rating 5 different interventions each on a total of ten??

Sure. This should work. Just try it.

Charles

Hi, I am having problems with the excel whenever I write the “,”

So instead I am writing “;” but I don’t think it is the same since I get to a point the numeric result becomes #VALUE!

Please help. Thanks

Hi Eloi,

I am not sure what is happening, but I know that for languages that use a comma as the decimal point, Excel uses a semi-colon instead of a comma to separate arguments in a function. This may be what is causing the problem you have identified. In this case, depending on what you are trying to do you may need to use a semi-colon instead of a comma. Alternatively you can change the number/currency defaults in Windows (via the Control Panel) so that a period and not a comma is used as the decimal point.

Charles

Hi

I should correct that the no. of subjects is what I need to calculate. 4 is what we’re currecntly planning with but this might not be enough to prove a rater agreement (Fleiss’ Kappa) of .7 or higher on a .95 confidence level.

Thanks

Philipp

Hi

Please excuse if I’m not using the right terms in English.

For a study that I’m conducting I need to calculate the minimum sample size for Fleiss’ Kappa for the following parameters:

no. of raters: 50

no. of subjects: 4

Fleiss’ Kappa >= .7

alpha = .05

power = .80

Is there a formular for this like for other statistical measures? Does the standard error decrease with the no. of raters and with the no. of subjects? I’ve been researching the web on this for a while but cannot find the information I need.

Thanks for your help,

Philipp

Philipp,

I’m afraid that I don’t have any information about sample size requirements for Fleiss’s kappa. I have come across websites re sample size requirements for Cohen’s kappa and ICC, including the following, but not for Fleiss’s kappa.

http://www.ncbi.nlm.nih.gov/pubmed/12111881

http://ptjournal.apta.org/content/85/3/257.full

Charles

Sir

The formula of kappa for the jth category should make a little change. I think the summation is from i = 1 to i = n.

Sir

Colin,

You are correct. I have just made the change that you have suggested. Thanks as always for identifying this change.

Charles

hi there,

Im havin some little problem hope u can help me out, i wan to calculate the inter rater reliability so i have theree experts with 10 question that examined the test so i need 3 rows and 10 colum rite? but the question is can i use the fleiss’s kappa? tq

Alvin,

I am not sure what you mean by “theree experts with 10 question that examined the test”, and so I can’t tell for sure whether this is the right tool. In any case the tool would be used differently than the approach that you described.

The implementation of Fleiss’s kappa that I have included in the Real Statistics Resource Pack has one row for each subject being evaluated (perhaps these are the questions in your case) and one column for each rating (e.g. if the questions are answered using a Likert scare of 1 to 5 then the there would be 5 columns). Since there are 3 experts the sum of the values in each row would be 3.

You could also use the intraclass coefficient to evaluate ratings. See the webpage http://www.real-statistics.com/reliability/intraclass-correlation/. It all depends on what you are trying to accomplish.

Charles

Hi

I have a questionnaire which different respondents fill in to provide a person with feedback. Let’s say I have a questionnaire with 20 questions and 7 people answer the questionnaire using a scale from 1 to 6. Can I use Fleiss’ Kappa to show the consistency in their rating among the respondents? I thought in my example the rows would be the questions and the columns would be my scale (1 to 6).

Does Fleiss’ Kappa only show agreement (i.e. same rating or different rating) or is it also able to show the quality of agreement (i.e. will agreement be higher if the ratings are different but closer together than if they are from opposite ends of the scale)?

thanks!

Sylvia

Sylvia,

If I understand correctly you will have one column for each question. I.e. 7 rows and 20 columns. Fleiss’s kappa does not take into account the level of agreement/disagreement between the raters. If this is important you might want to use the ICC.

Charles

Dear Dr Leinder

Thank you so much for the article.

My work is on pain reactions in preterm babies. I have 4 raters who have rated the same 20 babies across 20 variables with a yes or no (whether they see the particular pain reaction or not).

May I use Fleiss kappa for my scenario as you have explained? If I can, then may I interpret it the same way and give an overall kappa and kappa for each variable as you have done? Your answer would be greatly appreciated.

Best,

Kiki

Kiki,

Based on my understanding of your question, the answer is yes.

Charles

Kiki,

Did you figure out how to adapt the provided Fleiss Kappa spreadsheet to your needs? If I understand the structure of your study correctly, I have a similar structure in a job analysis study I am conducting. I have three raters. Each rater will rate 40 interview transcripts for the presence or absence of 50 employee characteristics. The characteristics are not mutually exclusive. All 50 characteristics will receive a rating of 1 (characteristic is present) or 0 (characteristic is not present) from all three raters. I think this means I have to run 50 Fleiss Kappa calculations to get rater agreement on each characteristic and then average the Fleiss Kappas to get an overall rater agreement. I’m not sure how to do this most efficiently in Excel.

Hi

Thank you for a great example of how to use Fleiss Kappa in Excel!

I would like to estimate standard error of Fleiss Kappa and then the 95 % CI. How can I do that in Excel?

Hi Mette,

I will update the webpage explaining how to estimate the standard error and confidence interval. You should see this within the next day or two.

Charles

Mette,

If you look at the referenced page you will see how to calculate the standard and 95% CI. The example on that page has also now been added to the Examples Workbook (which you can download for free).

Charles

Colin:

There are no “criteria” for how to interpret Kappa – you can compare inter-annotator agreement, but to judge its absolute value depends on the task’s inherent difficulty. There are some “guidelines” in the literature, but they are bogus.

It is absolutely correct.

May you please give me an specific reference for such an statement. I do agree with what you have mentioned, however, I do not have access to the reference of your statement. I did search for “Colin” but I did not find anything useful!

Thanks.

…[a] specific…*

…such [a] statement…*

Sir

What is the criteria of Fleiss’ Kappa ? Is it the same as Cohen’s Kappa: less than 0% no agreement, 0-20% poor, 20-40% fair, 40-60% moderate, 60-80% good, 80% or higher very good.