Fleiss’ Kappa

Cohen’s kappa is a measure of the agreement between two raters, where agreement due to chance is factored out. We now extend Cohen’s kappa to the case where the number of raters can be more than two. This extension is called Fleiss’ kappa. As for Cohen’s kappa no weighting is used and the categories are considered to be unordered.

Let n = the number of subjects, k = the number of evaluation categories and m = the number of judges for each subject. E.g. for Example 1 of Cohen’s Kappa, n = 50, k = 3 and m = 2. While for Cohen’s kappa both judges evaluate every subject, in the case of Fleiss’ kappa, there may be many more than m judges and not every judge needs to evaluate each subject; what is important is that each subject is evaluated m times.

For every subject i = 1, 2, …, n and evaluation categories j = 1, 2, …, k, let xij = the number of judges that assign category j to subject i. Thus


The proportion of pairs of judges that agree in their evaluation on subject i is given by


The mean of the pi is therefore


We use the following measure for the error term



Definition 1: Fleiss’ Kappa is defined to be


We can also define kappa for the jth category by

The standard error for κj is given by the formula

The standard error for κ is given by the formula


There is an alternative calculation of the standard error provided in Fleiss’ orginal paper, namely the square root of the following:


The test statistics zj = κj/s.e.(κj) and z = κ/s.e. are generally approximated by a standard normal distribution, which allows us to calculate a p-value and confidence interval. E.g. the 1 – α confidence interval for kappa is therefore approximated as

κ ± NORMSINV(1 – α/2) * s.e.

Example 1: Six psychologists (judges) evaluate 12 patients as to whether they are psychotic, borderline, bipolar or none of these. The rating are summarized in range A3:E15 of Figure 1. Determine the overall agreement between the psychologists, subtracting out agreement due to chance, using Fleiss’ kappa. Also find Fleiss’ kappa for each disorder.

Fleiss' kappa worksheet

Figure 1 – Calculation of Fleiss’ Kappa

For example, we see that 4 of the psychologists rated subject 1 to have psychosis and 2 rated subject 1 to have borderline syndrome, no psychologist rated subject 1 with bipolar or none.

We use the formulas described above to calculate Fleiss’ kappa in the worksheet shown in Figure 1. The formulas in the ranges H4:H15 and B17:B22 are displayed in text format in column J, except that the formulas in cells H9 and B19 are not displayed in the figure since they are rather long. These formulas are:

Cell Entity Formula
H9 s.e. =B20*SQRT(SUM(B18:E18)^2-SUMPRODUCT(B18:E18,1-2*B17:E17))/SUM(B18:E18)
B19 κ1 =1-SUMPRODUCT(B4:B15,$H$4-B4:B15)/($H$4*$H$5*($H$4-1)*B17*(1-B17))

Figure 2 – Long formulas in worksheet of Figure 1

Note too that row 18 (labelled b) contains the formulas for qj(1–qj).

The p-values (and confidence intervals) show us that all of the kappa values are significantly different from zero.

Real Statistics Function: The Real Statistics Resource Pack contains the following supplemental function:

KAPPA(R1, j, lab, alpha, tails, orig): if lab = FALSE (default) returns a 6 × 1 range consisting of κ if j = 0 (default) or κj if j > 0 for the data in R1 (where R1 is formatted as in range B4:E15 of Figure 1), plus the standard error, z-stat, z-crit, p-value and lower and upper bound of the 1 – alpha confidence interval, where alpha = α (default .05) and tails = 1 or 2 (default). If lab = TRUE then an extra column of labels is included in the output. If orig = TRUE then the original calculation for the standard error is used; default is FALSE.

For Example 1, KAPPA(B4:E15) = .2968 and KAPPA(B4:E15,2) = .28. The complete output for KAPPA(B4:E15,,TRUE) is shown in Figure 3.

Fleiss kappa function results

Figure 3 – Output from KAPPA function

Real Statistics Data Analysis Tool: The Reliability data analysis tool supplied in the Real Statistics Resource Pack can also be used to calculate Fleiss’ kappa.

To calculate Fleiss’ kappa for Example 1 press Ctrl-m and choose the Reliability option from the menu that appears. Fill in the dialog box that appears (see Figure 7 of Cronbach’s Alpha) by inserting B4:E15 in the Input Range, choosing the Fleiss’ kappa option and clicking on the OK button..

The output is shown in Figure 4.

Fleiss's kappa analysis tool

Figure 4 – Output from Fleiss’ Kappa analysis tool

Note that if you change the values for alpha (cell C26) and/or tails (cell C27) the output in Figure 4 will change automatically.

230 Responses to Fleiss’ Kappa

  1. Colin says:

    What is the criteria of Fleiss’ Kappa ? Is it the same as Cohen’s Kappa: less than 0% no agreement, 0-20% poor, 20-40% fair, 40-60% moderate, 60-80% good, 80% or higher very good.

  2. Colin:

    There are no “criteria” for how to interpret Kappa – you can compare inter-annotator agreement, but to judge its absolute value depends on the task’s inherent difficulty. There are some “guidelines” in the literature, but they are bogus.

    • Pedram says:

      It is absolutely correct.
      May you please give me an specific reference for such an statement. I do agree with what you have mentioned, however, I do not have access to the reference of your statement. I did search for “Colin” but I did not find anything useful!

  3. Mette Johansen says:

    Thank you for a great example of how to use Fleiss Kappa in Excel!
    I would like to estimate standard error of Fleiss Kappa and then the 95 % CI. How can I do that in Excel?

    • Charles says:

      Hi Mette,
      I will update the webpage explaining how to estimate the standard error and confidence interval. You should see this within the next day or two.

    • Charles says:

      If you look at the referenced page you will see how to calculate the standard and 95% CI. The example on that page has also now been added to the Examples Workbook (which you can download for free).

  4. Kiki says:

    Dear Dr Leinder
    Thank you so much for the article.
    My work is on pain reactions in preterm babies. I have 4 raters who have rated the same 20 babies across 20 variables with a yes or no (whether they see the particular pain reaction or not).
    May I use Fleiss kappa for my scenario as you have explained? If I can, then may I interpret it the same way and give an overall kappa and kappa for each variable as you have done? Your answer would be greatly appreciated.

    • Charles says:

      Based on my understanding of your question, the answer is yes.

    • Arnie says:


      Did you figure out how to adapt the provided Fleiss Kappa spreadsheet to your needs? If I understand the structure of your study correctly, I have a similar structure in a job analysis study I am conducting. I have three raters. Each rater will rate 40 interview transcripts for the presence or absence of 50 employee characteristics. The characteristics are not mutually exclusive. All 50 characteristics will receive a rating of 1 (characteristic is present) or 0 (characteristic is not present) from all three raters. I think this means I have to run 50 Fleiss Kappa calculations to get rater agreement on each characteristic and then average the Fleiss Kappas to get an overall rater agreement. I’m not sure how to do this most efficiently in Excel.

  5. Sylvia says:

    I have a questionnaire which different respondents fill in to provide a person with feedback. Let’s say I have a questionnaire with 20 questions and 7 people answer the questionnaire using a scale from 1 to 6. Can I use Fleiss’ Kappa to show the consistency in their rating among the respondents? I thought in my example the rows would be the questions and the columns would be my scale (1 to 6).
    Does Fleiss’ Kappa only show agreement (i.e. same rating or different rating) or is it also able to show the quality of agreement (i.e. will agreement be higher if the ratings are different but closer together than if they are from opposite ends of the scale)?

    • Charles says:

      If I understand correctly you will have one column for each question. I.e. 7 rows and 20 columns. Fleiss’s kappa does not take into account the level of agreement/disagreement between the raters. If this is important you might want to use the ICC.

  6. alvin says:

    hi there,
    Im havin some little problem hope u can help me out, i wan to calculate the inter rater reliability so i have theree experts with 10 question that examined the test so i need 3 rows and 10 colum rite? but the question is can i use the fleiss’s kappa? tq

    • Charles says:


      I am not sure what you mean by “theree experts with 10 question that examined the test”, and so I can’t tell for sure whether this is the right tool. In any case the tool would be used differently than the approach that you described.

      The implementation of Fleiss’s kappa that I have included in the Real Statistics Resource Pack has one row for each subject being evaluated (perhaps these are the questions in your case) and one column for each rating (e.g. if the questions are answered using a Likert scare of 1 to 5 then the there would be 5 columns). Since there are 3 experts the sum of the values in each row would be 3.

      You could also use the intraclass coefficient to evaluate ratings. See the webpage http://www.real-statistics.com/reliability/intraclass-correlation/. It all depends on what you are trying to accomplish.


  7. Colin says:


    The formula of kappa for the jth category should make a little change. I think the summation is from i = 1 to i = n.


    • Charles says:

      You are correct. I have just made the change that you have suggested. Thanks as always for identifying this change.

  8. Philipp says:

    Please excuse if I’m not using the right terms in English.
    For a study that I’m conducting I need to calculate the minimum sample size for Fleiss’ Kappa for the following parameters:
    no. of raters: 50
    no. of subjects: 4
    Fleiss’ Kappa >= .7
    alpha = .05
    power = .80

    Is there a formular for this like for other statistical measures? Does the standard error decrease with the no. of raters and with the no. of subjects? I’ve been researching the web on this for a while but cannot find the information I need.

    Thanks for your help,

  9. Philipp says:

    I should correct that the no. of subjects is what I need to calculate. 4 is what we’re currecntly planning with but this might not be enough to prove a rater agreement (Fleiss’ Kappa) of .7 or higher on a .95 confidence level.


  10. Eloi says:

    Hi, I am having problems with the excel whenever I write the “,”
    So instead I am writing “;” but I don’t think it is the same since I get to a point the numeric result becomes #VALUE!

    Please help. Thanks

    • Charles says:

      Hi Eloi,
      I am not sure what is happening, but I know that for languages that use a comma as the decimal point, Excel uses a semi-colon instead of a comma to separate arguments in a function. This may be what is causing the problem you have identified. In this case, depending on what you are trying to do you may need to use a semi-colon instead of a comma. Alternatively you can change the number/currency defaults in Windows (via the Control Panel) so that a period and not a comma is used as the decimal point.

  11. lena says:

    is it possible to use Fleiss Kappa for 45 respondents rating 5 different interventions each on a total of ten??

  12. Nara says:

    Dear Sir,
    Could the number of judges (m) for every subject, different?
    for example:
    test subject #1 is evaluate by 5 psychologist
    test subject #2 is evaluate by 6 Psychologist

    and so on.

  13. Alessandra Maggioni says:

    Dear Sir,
    I’m having some troubles in calculating the kappa for my sample. I have 189 patients, 3 evaluators and 2 possible diagnoses but I am not able to apply the formulas you suggest, maybe because of the different translations of the formulas themselves (I’m Italian). I am really having many problems and I need this kappa for my final project at university so I was wondering if you could help me or if you have an excel spreadsheet which I can put my numbers in, please. Thank you very much

    • Charles says:

      Dear Alessandra,

      If I am understanding the problem correctly, your input should be similar to Figure 1 in the referenced webpage with two data columns (one for each diagnosis) and 189 rows (one for each patient). The sum of the values in each row should be 3 (for the 3 evaluators).

      You can get a copy of the worksheet I used to create Figure 1 by downloading the Worksheet Examples file at http://www.real-statistics.com/free-download/real-statistics-examples-workbook/. Once you have this you can modify it to fit your specific problem.


  14. i did the observation for a process in operating room, i have 2 sets of observers: assigned observers and participant observers:. the results for 14 categories were as follows:
    1 2
    1 230 250
    2 259 260
    3 260 260
    4 249 251
    5 260 260
    6 238 256
    7 250 252
    8 217 229
    9 212 218
    10 258 256
    11 189 197
    12 245 248
    13 245 248
    14 254 258
    15 260 260

    and the results of Fleiss’s Kappa was:
    Fleiss Kappa for 480 raters = 0.0402 SE = 0.0008
    95%CI = 0.0387 to 0.0417.

    so, could you please guide whats mean of the results, is there agreements between 2 observers ??

    • Charles says:


      There is no clear cut agreement as to what is a good enough value for Fleiss’s kappa, but a value of 0.0402 is quite low, indicating that there is not much agreement.

      You mention in your email that there are 14 categories and 480 raters, but I don’t understand the results for the 14 categories you listed, starting with the fact that there are 15 categories in your list.


  15. dove says:


    So greatful for your web-site and resources. If I might test my idea:

    I am attempting to test the reliability of an assessment tool.
    The tool has nine (9) questions with responses range from 0-3 and
    Three (3) fields where the evaluator can select any combination of symptoms or select none of the symptoms from items on the symptom check list (8-10 items available to choose).

    As I understand it, I should seperate out the responses into two data fields and measure the reliability of the 9 questions using ICC (IntraClass Correlation) formula and test the second half (reported symptoms) using the Fleiss Kappa analysis formula.

    Is this correct?
    I greatly appreciate your resonse.

    • Charles says:

      It really depends on what you are trying to measure. E.g. you could use Cronbach’s alpha for the nine questions to determine the internal consistency of the questions. Are you evaluating the reliability of the questionnaire or the degree of agreement between the evaluators (I assume these are the nurses)?

      For the symptoms you could treat each symptom as a True/False question (True = patient assessed to have that symptom, False = assessed not to have that symptom). The analysis can them proceed as for the first group of questions (with my same question to you, namely what are you really trying to assess?


  16. dove says:

    Thanks so much for the timely response!

    I am attempting to assess the reliability of the assessment tool, not the observers (nurses) though I understand that I may need to assess the observers seperately.

    What I understand you to say that I can rate the symptoms as y/n and the rated questions (1 through 9) as likert style. In doing so, Cronbach’s Alpa is the best tool to test all questions together given the goal is to test the tool.

    Thanks for your reply,
    greatly appreciated.


  17. yvonne says:

    I am comparing the reliability of 3 different imaging tests for classification of a condition that has 10 different classes. There are 4 raters and 123 cases. Each reader reviews each case 3 separate times, each time using one of the three different models. Each model will have a Fleiss’ kappa. The reliability of the models can then be assessed comparing the three Fleiss’ kappas?
    There is no good gold standard for the condition.


    • Charles says:

      Hi Yvonne,
      If I understand your situation correctly, Fleiss’s Kappa with measure the degree to which the 4 raters agree on their classification of each imaging test. If your goal is to select which of the three imaging tests provides the most agreement between the raters then it may be worthwhile comparing the kappa measurements (you still wouldn’t know whether any differences are statistically significant). I wouldn’t necessarily call this a measure of the reliability of the models.

  18. Katharina says:

    Dear Dr. Leinder,

    I want to test the reliability of a newly constructed diagnostic interview. Therefore, raters watched videos of diagnosticians interviewing patients and answered for most of the items on a one-to-five likert scale and for some on a yes/no-basis (as the interviewers did). The agreement between the raters, measured by Fleiss Kappa, is thought to be my estimation of the interrater-reliability. Does that sound right so far?

    The problem is that some of the items are multiple choice items. Is it still possible to use Fleiss kappa? It seems the statsitic programs can’t handle that… If not, could you give me a hint what kind of calculation I could use instead? Intraclass-correlations (some items are based on categorial scales…)?

    And one last (general) question: Is there a relation between interrater reliability and criteriums validity – otherwise than reliability as a condition for validity? My prof is always talking about a paper called “study into criterium validity…” (based on the data I described). I didn’t challenge him so far cause I’m not sure I’m right…

    Thanks so much in advance.

    Best regards

    • Charles says:

      Dear Kathi,

      You have addressed your question to Dr. Leinder. Since am not Dr. Leinder perhaps this comment was not intended for me.

      Please give me an idea of some of the questions, especially the multiple choice questions.

      I don’t know which type of relationship your professor was talking about between interrater reliability and criterion validity.


  19. Stacy says:


    I am interested in determining the degree of agreement between 4 raters on an observation instrument that involves a rating system of 1 – 4 (levels of proficiency with a target attribute). I believe that Fleiss would be the appropriate measure in this case. Is this correct?


  20. Ali Baykuş says:

    In Figure 2, B17 and B18 rows should be interchanged.
    Have a nice day.

  21. Raphaela says:

    Professor, How I calculate Fleiss´Kappa with Real Statistics? click in data anslyse tool and ?


    • Charles says:


      Currently you need to use the KAPPA function as described on the referenced webpage.

      Tomorrow I will issue a new release of the Real Statistics Resource Pack. This release has an improved version of the KAPPA function. In addition there will be a new Reliability data analysis tool which will calculate Fleiss’s Kappa.


  22. Raphaela says:

    ok. Charles, Thank you 🙂

  23. Raphaela says:

    Good morning,
    can I dowaload again RealStat to calculate Fleiss´Kappa? ( calculate the agreement among multiple judges and multiple variables)? please let me know. thank you

    • Charles says:

      Yes. The new version of this function is now available. You can also use the new Reliability data analysis tool.

      • Raphaela says:

        Charles, I got a probelm. I have a matriz 0-1 , 11 variables and 15 judges. I applied Fleiss Kappa (Reability data analysis tool) and it results alpha 0,05 and the table below have errors. any idea? thx

        • Charles says:

          Without seeing the spreadsheet I can’t tell what went wrong. The most common problem is that the input data is not formatted correctly. If you send me the spreadsheet I will be able to help you better.

  24. Raphaela says:

    Sorry Charles, I didnt notice that your email was avaiable. thanks

    • Charles says:


      I just checked the spreadsheets you sent me and see that the problem is that the sum of the scores in each row are not the same. If these sums are not the same you will see errors in all the cells.


  25. Philipp Lenarz says:


    1. I have a set of data with 38 raters rating the content of videos with regards to 8 items* that are all based on the same ordinal scale (very good, good, adequate, insufficient). My results are puzzling for me, because when I’m calculating Fleiss’ Kappa for each item individually (matrix: 1*38) the coefficients (mostly highly significant) are all smaller than when I’m calculating Kappa for all items together (matrix: 8*38). Shouldn’t the overall coefficient (K=0,31) be within the range of individual coefficients (K=-0,13 to 0,28)?

    2. Dr. Leidner wrote above that “There are no “criteria” for how to interpret Kappa – you can compare inter-annotator agreement, but to judge its absolute value depends on the task’s inherent difficulty. There are some “guidelines” in the literature, but they are bogus.” My problem is that I have to interpret the effect sizes and I have no idea if my effects are strong or weak. From the info I gave you, can you say if my coefficients are high enough to speak of a good rater agreement? And do you know of any helpful literature on this topic.

    Thanks for your help,

    *Their task is to assess the driving competency of driving license applicants that we filmed during a simulated test drive.

  26. Lam says:


    In my study, each subject is to be rated on a nominal scale from 1 to 5. If each subject was rated by 2 raters, but the 2 raters were drawn from a pool of 6 raters, I can apply Fleiss’s kappa in assessing the inter-rater reliability, right? Is there any requirement on the minimum times that each rater participated, i.e. should I exclude the rater that rated, say extremely, one subject only?

    Moreover, if each subject was rated by different number of raters, what method can I use instead? I find that you have mentioned an extension of Fleiss’s kappa as linked, http://conservancy.umn.edu/bitstream/99941/1/v03n4p537.pdf, but seems that it can be applied for dichotomous variable only and thus not suitable for me.

    Would Krippendorff’s alpha be appropriate? But I am not sure if I can treat those raters that do not rate the subject as missing data, since this type of “missing” is not random in nature.

    Thanks a lot in advanced.


    • Charles says:

      If I understand the problem correctly, you could use the intraclass correlation (ICC), but this measurement requires that all the subjects have the same number of raters. If the number of raters is not too unbalanced and the samllest number of raters is not too small, perhaps you can randomly eliminate some ratings to create a balanced model (alternatively you might be able to use some multiple imputation approach).
      I am not familiar with Krippendorff’s alpha, but I do believe there are techniques for handling missing data.

      • Lam says:

        Thanks Charles,

        For my own experience, I used to use ICC for continuous variables and kappa for discrete/ nominal variables. As the data this time is in nominal scale, may I confirm if ICC can still be applied?

        In addition, I could now probably fixed the number of raters to be 3 each time (not the same raters). Would you suggest whether ICC or Fleiss’s kappa be more appropriate?

        Sorry that I have got so many questions. Thanks again!


        • Charles says:

          To use ICC the data doesn’t need to be continuous, but it does need to be ordinal. E.g. you could use it with a Likert scale (say 1 to 7). If the data has no order then you shouldn’t use ICC. Fleiss’ kappa is only used with nominal data and order is not taken into account. If your data is unordered then you should use Fleiss’ kappa. If it is ordered you should use ICC.

  27. Robert says:

    Hi Charles,

    Need some advice. I don’t know how to organize data so I can compute a Fleiss’ kappa. Here’s what I’ve got:

    Five coders have coded 20 Tweets for content in 19 categories with a data range of 3 (1=present, 2=not present, 3=can’t tell). Right now the data exist in an Excel workbook with each coder’s data on separate sheets. I get this much: k=categories (there are three), and m=number of judges (there are five). What is n? The number of Tweets? The number of content categories in the Tweets? How do I convert the five spreadsheets into a grid that expresses agreements in each category?

    I can’t get my mind around it. Any tip you might have would be very much appreciated!


  28. Robert says:

    Hi Charles,

    I’m working on a content analysis project and would like to compute a Fleiss’ kappa but I’m not sure how to organize the data. There are five coders, and they’ve coded 20 Tweets for content in 19 categories. Coding values are nominal categories, one through three.

    Can you toss me a bone?



    • Charles says:

      Since the 5 raters are rating 20 Tweets, the Tweets are the rows and the sum of numbers in each row is 5.
      Since there are 3 nominal ratings 1 through 3, there are 3 columns. The cell in the ith row and jth column contains the number of the raters who have given the ith Tweet a rating of j.
      The problem is that you have 19 such ratings, one for each category (if I have understood the problem correctly), which means that you have to calculate Fleiss’ Kappa 19 times, once per category. There may be a multivariate Fleiss’ Kappa that does the job you want, but I am not familiar with such a measure.
      If instead you

      • Robert says:

        You do indeed understand the problem correctly. Currently I’m organizing the data manually — converting raw scores into categorical agreement figures (0-5-0, 1-4-0, and so on). It’ll take a while but I’ll give it a shot. (Isn’t there some way to get Excel to do this sort of computation so I can avoid human error?)

        • Charles says:

          There may be some easier way of doing this reorganization of the data in Excel, but I’d have to see how the data is formatted to really provide a good answer. But by the time I figured it out you will have probably reorganized the data manually. Sorry about that.

          • Robert says:


            Thanks for your mindful relies. Let me run one more thing past your eyes and then I’ll stop.

            My five coders coded several categories for variables we didn’t expect to see in very many of the 20 tweets, so all of the coders made substantially the same decisions of “not present” (or a code of “2”). Nearly every tweet was coded the same way by every coder, so the frequency grid looks like this for 15 of the Tweets:


            There are five cases of a coder that coded “present” (or a “1”), so five of the data lines look like this:


            What puzzles me is that the reported kappa is -0.053, and I don’t know how to interpret this result. In another category there is a single disagreement of this type, and the result is a kappa of -0.010.

            I’ve computed kappa for the data sets using two templates and I get the same thing so I think I’m computing it correctly. What does a kappa of -0.010 mean?

            Thanks in advance for your help!


          • Charles says:

            When most of the responses are the same kappa will sometimes give strange results. To me it is a deficiency in the kappa statistic, but if there is for example there is only a single disagreement, there really isn’t much reason to use kappa.

  29. Mattia says:

    Dear Charles,

    I am working on a study proposal for my dissertation which will need to get approval from a Ethics Committee, therefore I need to define analysis of data in advance.

    I will have n criteria, and raters will rate on a 1-5 Likert scale the importance of each criteria in establishing a diagnosis.

    In similar studies I have seen that percentage agreement, coefficient of variation and Kendall’s W are commonly used.

    From what I have been able to understand, however, Kendall’s W tests the level of agreement between n raters across ALL items, which means it is a measure of the overall agreement.

    However, what I want to know, is to identify what are the items where consensus has been established; in other words, I need a test that assess agreement for each individual items (besides descriptive stats such as CV and percentage agreement).

    I have found that in a previous study kappa was used. However I am not quite sure this is appropriate, would you please shed a light on this?

    Thank you for your time!


    • Mattia says:

      Comment added:

      I have also found a study where ICC was used to assess agreement across all items and then kappa was used to assess agreement within each item.

      Is this appropriate?

      Why ICC for overall agreement and kappa for within item agreement?

      If you also had a reference for this that would be great.

      Many thanks.

      • Charles says:


        ICC is another way to measure agreement across all subjects. In Fleiss’s kappa the ratings can be thought of as categories (Biology, Math, Reading). Any ordering between the categories is not taken into account. Thus Fleiss’s kappa would generally not be used if the rankings were something like a Likert scale (1, 2 ,3, 4, 5). ICC can be used the case where the rankings are ordered or when the rankings are continuous numbers (34.5, 12.7, etc.),

        As I explained in my previous response, both of these measurements assess agreement across all the subjects.


    • Charles says:


      Cohen’s Kappa is used to assess the level of agreement between two raters of many subjects based on one rating criterion.
      Fleiss’s Kappa extends Cohen’s Kappa to the case where you have more than two raters (still only one criterion).

      Both of these tests measure agreement across all the subjects.

      If by item you mean subject it is pretty easy to assess agreement: for two raters, either they agree or they don’t, and even for more than two raters you don’t need any fancy measurements to assess agreement.


  30. David says:

    Dear Charles,
    I am trying to identify the appropriate measure to use for my inter-rater agreement test. After much research, I believe it may be the Fleiss’ kappa measure, but am not sure as I have not been able to locate literature which suggests it can be used in my specific case presented below.

    I have developed a multiple choice test with 58 items based on teaching scenarios and what teachers might do in a particular classroom situation. Then I had 5 expert teachers take the test so that I could attain an agreed upon answer key. Now I want to run a test which provides me with a value which justifies the answer key as having substantial agreement (0.61 – 0.80), perfect agreement (0.81 – 1.00), etc.

    I cannot seem to find literature specific to obtaining inter-rater agreement of an answer key to a multiple choice test. In short, my question is what index is appropriate to test the level of agreement for this situation and what research is there to back that decision?

    Any advice would be greatly appreciated.

    • David says:

      Hi Charles,
      I would also like to clarify that the 58 multiple choice questions each have three choices, which only 1 is correct. These three choices are nominal or categorical in nature.

    • Charles says:

      Generally in situations where it is clear what is the correct answers are to a multiple choice test, you might run Cronbach’s alpha to check for the internal consistency of the test, but it seems that you are testing for agreement among experts as to what the correct answers are and the categories are nominal. This seems to be a good fit for Fleiss’ kappa.

      • David says:

        Thank you Charles.

        I have forged ahead on your recommendation. The kappa value that was returned using the program provided on the domain https://mlnl.net/jg/software/ira/ (Geertzen, 2012)
        was 0.77, which according to Fleiss (2003) represents excellent agreement beyond chance. Would you agree with this assessment of the returned k value? Have you seen the on-line program I refer to above?

        Thank you, David

        • Charles says:

          I am not familiar with the program you referenced.
          You could use the Fleiss’ Kappa option the Real Statistics Reliability data analysis tool to calculate Fleiss’ kappa.

  31. Mattia says:

    Thank you very much for your reply, Charles.

    I am not quite sure what I need to use, though.

    I will give an example to make the understanding of the case easier.

    I have 10 criteria that are hypothesised to be important in establishing a diagnosis. 30 experts will rate the importance of each criterion on a likert scale from 1 (not important) to 5 (very important).

    What I want to know, is what statistic (besides CV) I can use to evaluate the agreement between all the 30 experts on each single criterion.

    Many thanks!

    • Charles says:


      If you want to rate the agreement between all 30 experts across all 10 criteria then you could use the intraclass coefficient (ICC) as described on the webpage Intraclass Coefficient

      For your situation instead of Judges A, B, C and D in Example 1, you would have the the 30 experts (these are the columns) and instead of 8 wines would have the 10 criteria. The table would be filled with ratings 1 to 5.

      If instead you want to evaluate agreement one criterion at a time, then I don’t have any specific advice to offer. I have come across the following paper which may be helpful. I have only read the Abstract and I am not sure about the paper’s quality or applicability to your problem.



  32. Henk says:

    Dear Charles,

    i computed standard error and confidence interval according to your instructions. May I ask you where you found the formula for the confidence interval? Or did you come up with it on your own?

    Thanks a lot for providing this information!

  33. Mattia says:

    Ok Charles thank you very much.

  34. Catherine says:

    Thank you for this page. It has helped but I am now wondering if Fleiss’ kappa is the correct statistical test for our purposes. I have running a test-retest reliability study on a biological response to a stimuli. There are 4 response pattern categories. Six subjects underwent 4 repeated tests. We want to know if the same subject will attain the same test response to the same stimuli under the same conditions.

    Currently, my subjects are the rows, and the response pattern categories are the columns, with each test week as the ‘rater’.

    In your opinion, should we use Fleiss’ kappa or ICC?

    Thanks for this page, I am a stats novice, so am very grateful for having a clear ‘recipe’ to follow.


    • Charles says:

      If the data is ordered (especially with values such as 41.7, etc.) ICC will take the order into account. If the data is categorical (no order) then Fleiss’ is the approach to use.

      • Catherine Crofts says:

        Data is categorical so we will stick with Fleiss’ kappa. Thank you so much for your assistance and your prompt reply.


  35. BC says:

    Hi Charles,

    I am getting unusual results following the guide when the raters are at almost perfect agreement.

    I have 9 cases, each with 4 categorical ratings by 8 raters.

    Cases\Ratings 1 2 3 4
    1 0 0 0 8
    2 0 0 0 8
    3 0 0 0 8
    4 0 1 0 7
    5 0 0 0 8
    6 0 0 0 8
    7 0 0 0 8
    8 0 0 0 8
    9 0 0 0 8

    The kappa is -0.014 and the p-value is 1.177. Will you be able to advise as I am uncertain what’s going on?



    • Charles says:

      Unfortunately, many of these measures give strange results in the extreme cases. The best way to look at these situations is that when you have almost perfect agreement you don’t really need to use Fleiss’s Kappa. Not very satisfying answer, but it is true.

      • BC says:

        Thanks Charles for the clarification. It’s a good answer to clarify my doubts as I was sieving through all my equations in excel and found no errors.

        In a scenario like this, will analysing by ICC for inter-rater reliability (multiple rater) be better? What’s your recommendation?

        • Charles says:

          You would use the ICC with rankings that are quantitative. Fleiss’s kappa is used with categorical data. In situations like this you would simply note that there is obvious agreement and not use any of the usual measures.

  36. Trina says:

    Hi! Thanks for this page.
    I have two questions:
    1. I would like to compare Fleiss kappa values, for example, 0.43 (95% CI 0.40, 0.46) and 0.53 (0.43, 0.56). The confidence intervals overlap – is this enough to say that the two Fleiss kappa values are not statistically significantly different from each other? I have read that this may not be the case, link: http://www.cscu.cornell.edu/news/statnews/stnews73.pdf
    2. I often see % agreement reported in articles when using Cohens kappa (2 observers). Is this something that is done with Fleiss kappa? If so, how is it calculated (is it number of cases where all observers rate the same/total number of cases)? I have noticed that the % agreement I obtain when calculating Light’s Kappa instead of Fleiss kappa is always higher…

  37. Pitambar Behera says:

    Our P value in the row is as mentioned below:
    m 3
    n 1000
    pa 0.4815
    pe 0.253083444
    kappa 0.305812683
    s.e 0.010177511
    z 30.04788446
    p-value 0.0000

    alpha 0.05
    lower 0.285865127
    upper 0.325760238
    The P value in the column is as indicated below:
    q 0.074 0.206333333 0.334333333 0.290333333 0.094666667
    b 0.068524 0.163759889 0.222554556 0.206039889 0.085704889
    k 0.708131458 0.395456356 0.167245385 0.221833528 0.377709556
    s.e 0.018257419 0.018257419 0.018257419 0.018257419 0.018257419
    z 38.7859573 21.66003666 9.160406974 12.15032275 20.68800442
    p 0.00000 0.00000 0.00000 0.00000 0.00000

    Is there any issue if our P value is a null element?
    Please reply as soon as possible.

    • Charles says:

      The p-value can be zero (probably a very small number). I have not checked your calculations, but you can get a zero answer. You can check your answer by using Real Statistics’ KAPPA function.

      • Pitambar Behera says:

        Thanks a lot, sir. I have sent u an e-mail with the excel sheets for calculation. Can you please have a look and give the solution?

  38. Vincent says:


    I have a question in the case of reference value known.

    How to deal with the data matrix if we need to evaluate overall Fleiss Kappa when the reference value is known for each sample? Just assume we have already known the actual kind of illness for these 12 patients, now 6 judges give their judement to each patient.

    This situation is actually kind of Attribute MSA and MINITAB gives its answers, but I deducted several times with the formulas and was not able to give the exact answer in this case. However, the deduction for the case of reference value unknown is the same as MINITAB’s (like your case here).

    Appreciate your reply.

    • Charles says:

      I believe that you are looking for an implementation of Attribute Gage R&R. I plan to look into doing this in one of the next releases of the software.

  39. azhar stapa says:

    hello.. i need help..to figure out how..to get fleiss kappa value..if I have 5 expert valuation.. and the scale use of the item..is 1 to 10.

    item expert 1 expert 2 expert 3
    1 8 9 10
    2 8 8 8
    3 9 9 9
    4 10 10 10
    81 9 9 9


  40. Maram says:


    Thanks for the great tutorial and tool. I am working with thousands of samples, 2 categories (yes and no) and 3 annotators. When I computed Fleiss Kappa using the Excel tool, the results I get show that the per-category kappa values are equal for both categories (e.g., “yes” kappa = “no” kappa = 0.47). I repeated that computation over a completely different set of samples but with same setting (two categories, 3 annotators) and I also get the same result (i.e., per-category kappa values are equal across the categories). Is this normal? Could you please explain why this is happening?

  41. Maram says:


    Looking at the equation to compute kappa for the jth category, I can see that the denominator includes the number of subjects assigned to that category. I would like to know your opinion about the following conclusions I loosely deduce from this equation:
    Categories that are bigger in the data and get a higher number of subjects assigned to them, might probably have higher values of kappa than those much smaller categories (with very few subjects belonging to them) just due to the fact that they are bigger (i.e., kappa is biased towards size of categories in the data itself)

    • Maram says:

      Additionally, can you suggest any other inter-rater agreement measure that’s less sensitive to categories size (assuming 3 annotators)?

      • Charles says:

        As I said in my previous comment, I am not certain whether Fleiss’ Kappa is indeed sensitive to category size. In any case, the bigger issue is whether you have ordered data or categorical data. In the first case, Fleiss’ kappa is a reasonable choice. In the second case, the intraclass correlation is a reasonable choice.

    • Charles says:

      I haven’t had time to really look at the formulas to see whether your observation is true or not, but I noted that for Example 1, the exact opposite is true. The two categories with the highest number of subjects have the lowest kappa values.

  42. sanitha says:

    Dear Dr. Leinder
    Thank you very much for the article. I am working on biotic indices. I want to see the agreement of the indices to evaluate the ecological status. The indices are classified into 5 categories (Bad, poor, Moderate, Good and high). I have about 610 samples.
    I would like to know which test to apply. Kappa or ICC.

    The categories used are i.e Bad=1, Poor= 2 Moderate= 3 etc.

    In general, how do i arrange the data. I have arranged the data as following
    Example 1
    Stn BI1 BI2 BI3 BI4 BI6 BI 7 BI8
    Stn1 1 2 3 2 3 4 3 3
    Stn 2 1 2 4 2 3 4 3 4

    Example 2
    Bad Poor Mod Good High
    Stn 1 1 2 4 1 0
    Stn 2 1 2 2 3 0

    Which is the correct method? If example 2 is correct is there are easy method to it. I did few samples manually, but 610 samples is too much.


  43. fariha says:


    Can Cohen’s Kappa or Fleiss Kappa be use for any other statistical agreement data. Or the ‘rater’ in here only for human respondents? Can the ‘rater’ also means previous study?

    Thank you.

  44. Amanda says:

    Hi Charles.

    Thanks for the wonderful tools and explanations. I am trying to use the Real statistics free download to perform a Fleiss’ kappa analysis and running into some issues. When I copy the sample data from Figure 1 into Excel and use the Fleiss’s kappa option under the reliability procedures menu the output field comes back with every cell filled in #N/A. Can you suggest some solutions to this issue?


    • Charles says:

      Perhaps that is because you also used column A in the Input Range. The input should not include this column.

  45. elsayedamr says:

    Thank you very much … I need help
    what do you recommend to carry out a content validity index

  46. elsayedamr says:

    I got it .. thank you very much .. I will read it and study it … if there any question I will send it to you .. thank you very much again.

  47. David says:


    first of all i want to thank you for the article.

    I have a question concerning Fleiss’ Kappa. You wrote: “[…] in the case of Fleiss’ kappa, there may be many more than m judges and not every judge needs to evaluate each subject; what is important is that each subject is evaluated m times.”

    So my problem is this: we have around n=350 subjects, k=7 categories and m=6 judges. My problem is as follows: (a) not every judge evaluated each subject and (b) every judge can evaluate each subjekt into 2 categories (it’s a pre-evaluation; after the pre-evaluation we want to meet and discuss our results and finalize our categorization).

    Is there any way i can still work with Fleiss’ Kappa? Is in this way m=12, because each judge can evaluate each (problematic) subject into 2 categories?

    Thanks in advance!

    • David says:

      One remark concerning (b): a subject can be evaluated into 2 categories, but doesn’t have to.

    • Charles says:

      I don’t completely understand the scenario. You say that there are 7 categories but you also say there are 2 categories. Which is it? Perhaps it would be helpful to have an example with some data.

  48. Paula says:

    Hi Charles
    Thank you for a very useful website.
    I followed the instructions to calculate Fleiss Kappa, including s.s. and CIs, but I run into trouble with the p-value. My Excel does not seem to like the command with NORMDIST, and I tried following the steps to install your software but no luck so far. Therefore, is there a formula to calculate the p-value? Or how can I ‘unpack’ the command you suggested above? Is there a way to ‘tell’ Excel what to do in a different way?
    Many thanks

    • Charles says:

      The website uses the function NORMSDIST not NORMDIST. NORMSDIST(x) is equivalent to NORM.S.DIST(x,TRUE) on newer versions of Excel (although the older version still works).

      • Paula says:

        Thank you for your reply, Charles.
        And thank you for picking up my mistake – all sorted now! My p-value came up with 0 – by reading the previous comments on here, I’m guessing that this is acceptable. With regards to reporting the inter-rater reliability, would you say that it is better to report Kappa and CIs, or should I report the p-value as well? Sorry, I’m quite new to this.
        Many thanks again.

  49. Malin says:

    Dear Mr Zaiontz,
    Thank you for excellent explanations of the Kappa statistics.
    I have a few questions:
    Regarding the z-value, it seems to always compare the result to K=zero. You say so on this web-site, and my statistical software has the same default which I cannot change. I find that a bit strange, it is not always good enough to be just a little better than tossing a coin. To introduce a new method you might want to prove it to be significantly better than fair Kappa ( which often is set at K=0.40, but I am aware that definitions vary between different authors).
    If I want to compare Kappa values between two different sample populations, with the hypothesis that the Kappa of the test population is significant higher than Kappa of the reference population, I believe I have to calculate the z-score manually by subtracting the mean of the reference population from the test population and divide by SE of the test population. Then I calculate the p-value choosing a one-sided or two-sided test.
    Do you agree on this, or are you of another opinion?
    (I have a material with 3 observers/judges and 106 samples at two different departments with different training, and the size is chosen by a power calculation based on previous publications on the topic.)
    Best regards,

    • Charles says:

      Dear Malin,

      I don’t know any tests to determine whether kappa is higher than some value p. I am not sure whether any simple approaches will work as the case for testing the correlation coefficient reveals. The approach for determining whether a correlation coefficient is significantly different from zero is quite different from that used to determine whether the correlation coefficient exceeds some value. This is because in the first case, you can assume a normal distribution (at least under the null hypothesis) and use a t distribution, while in the second case you can’t assume a normal or t distribution.

      I also don’t know any methods for comparing two values of kappa based on two different measurements. The approach that you are considering may be appropriate if you can figure out what a suitable pooled s.e. should be. Again we can look at the correlation coefficient where the pooled s.e. is not so obvious.

      I have found the following research paper, which may be useful, although I have not read it myself.

      McKenzie DP. Mackinnon AJ. Peladeau N. Onghena P. Bruce PC. Clarke DM. Harrigan S. McGorry PD. Comparing correlated kappas by resampling: is one level of agreement significantly different from another?. Journal of Psychiatric Research. 30(6):483-92, 1996 Nov-Dec.


  50. Fernando says:

    Thank you for your explanation, I have a question
    I’m following the formula for kj, but the values I get are not the same as your table, could help reviewing whether the values in your table are correct?
    Thank you

  51. Pingback: Interrater reliability or Kappa Statistic in excel - Page 2

  52. Felix says:

    Hi Dr. Zaiontz,

    Great tutorial! Straight to the point and instantly gratifying! I have a 22 raters evaluating 50 different scenarios. They then categorize the scenarios into 6 possible groups (nominal variable) according to what they believe is the correct category. I have calculated the interrater reliability using fleiss kappa according to your methods but I am also interested in overall rater reliability of all of the judges compared to a gold standard. The gold standard rating was done through the consensus of three independent raters using a tool. What test would you run to compare rater reliability to a gold standard?

    thank you!

  53. Rose Callahan says:

    Thank you for your explanation! I think I know why my previous attempts to calculate a kappa came out with nonsense, but I am not sure how to set up my data.

    I have three raters, each watches two videos for each of 50 subjects. On each of the two videos, they score two different aspects on different scales. On video one, they score 6 items for aspect “A” on a 1-5 scale, and another for aspect “B” on a 0-14 scale. On the second video, they again score 6 items for aspect “C” on a 1-5 scale, and another for aspect “D” on a 0-12 scale. So I really have 14 evaluations per subject per rater, and two of them have a large number of possible scores (13-15). We were adding all the “A” items to get a score out of 30, and similarly all the “C” items to get a score out of 30, to only have 4 evaluations per subject per rater. I am afraid this is artificially lowering our inter-rater reliability, because it seems much less likely that any 2 raters will all agree on a score out of 15 or 30 than a score out of 5.

    Do I need to run 14 kappa tests? And then, how do I get an overall inter-rater reliability rating from them? Or can the kappa test be set up to handle large numbers of possible scores? Should I be doing something totally different?

    • Charles says:

      From your description, you can calculate 14 ICC values (or 4 ICC values based on the combined scores). If you have a way of calculating a combined rating, you can use this to compute one ICC.
      If you have 14 ICC values, you can create a combined value based on what you plan to use the ICC for. E.g. you could use the minimum ICC value.

  54. M.C. says:

    Dear Sir.

    This website helped me a lot in understanding statistics. Thank you.
    Can you help me with the following?
    I want to measure the interobserver agreement of three observers(raters) for evaluation of gastric tumors.
    There are 445 subjects (n) and only two categories (whether benign or malignant).
    It seems that Fleiss’ Kappa calculates the overall agreement of the three observers.
    Can the Fleiss’ Kappa be used to calculate agreement between two observers?
    For example, between observer 1 vs observer 2, observer 2 vs observer 3, and observer 3 vs observer 1? (I’m guessing not. Then should I use weighted Cohen’s kappa three times instead?)

    Thank you.

  55. John Song says:

    Thank you. It’s very useful to understand the procedure of statistic.

    In this page, There’s only explanation for ‘Within or between Appraisers’

    Could you add some explanation for ‘Appraisers vs standard’ ?

    It will be a lot help for me. Thank you

  56. SREEJA PS says:

    Thank you for your wonderful explanation. According to my data set I am getting Z value as 17.8. Similarly each catogory Z value is above 25. What should I infer from this?
    K= 0.437138277
    var= 0.000593
    se= 0.024351591
    k/se =17.95111749

    for each category

    k1 0.722940023 0.301151045 0.270887021 0.154798991 0.392763376 0.497409713 0.401404805 0.473193742 0.296002606
    var(k1) 0.000390278 0.0003871 0.000595739 0.001114339 0.000594559 0.000777897 0.000547207 0.000653159 0.000385771
    se(k1) 0.019755455 0.01967485 0.02440776 0.033381712 0.024383589 0.027890808 0.023392452 0.025556984 0.019641059
    z 36.59445027 15.30639563 11.09839766 4.637239447 16.10769338 17.83418051 17.15958671 18.51524193 15.0706028

    • Charles says:

      It is very difficult for me to answer your questions without some context. It would be better if you sent me an Excel file with your raw data and the analysis.

    • Charles says:


      Thanks for sending me the Excel spreadsheet. This makes everything much clearer.

      I am also getting z values which are high, even higher than yours. The reason for the difference is that I am calculating a different value for the standard errors. The standard errors I am using come from the following paper.

      Fleiss, J. L., J. C. M. Nee, and J. R. Landis. 1979. Large sample variance of kappa in the case of different sets of raters. Psychological Bulletin

      What is the source of the standard errors that you are using?


  57. Maike says:

    Dear Charles Zaiontz,

    Thank you for this great explanation. I would really appreciate it if you could help me with the following:

    I have 76 raters who had to listen to two audio recordings (one native language & one foreign language) and transcribe what they heard. Some of the raters were primed. I categorized their transcripts as 1 = correct transcription, 2 = biased transcription and 3 = other.

    Now I would like to analyse if the agreement is higher in the primed group than in the control condition and if this agreement is even stronger if the raters were transcribing a non-native language.
    Would Fleiss Kappa be the right choice in this case?

    Many thanks in advance!

    Kind regards,

    • Charles says:


      You can calculate Fleiss’s kappa for the group of primed raters and then calculate another Fleiss’s kappa for the group of unprimed raters and then compare the results. The measurement that you get will be quite limited since you have only two subjects (i.e. the two audio recordings).

      When you try to compare agreement in transcribing native vs. non-native languages, you reduce the number of subjects down to 1, which will violate the assumptions of Fleiss’s kappa. In this case you might as well simply compare the variances of the ratings.

      The following is something I found on the Internet which may be useful


  58. Leslie Hoy says:

    Hi am needing some advise please.
    I have a database of about 4000 plants and have information on each plant species from various sources. The information was obtained from about 30 different sources (raters). However many sources (raters) have only provided information for about 10% of the plant database. Hence I a lot of data on each plant but it is not from each rater. This implies that there are large gaps in data (but sufficient for me to conclude certain answers. The information has been captured in Excel. This also implies that there are large gaps between data in each column as well as in the rows. It has been recommended that I need to use Cohens Kappa (to test the agreement between raters), however in my reading I think that Fleiss Kappa is more suited. Can you advise on this. Also can you advise me of a website or youtube site that will explain how I set up a data set such as this correctly and what process and formula I use (preferably step by step guide).

    • Charles says:

      Fleiss’s kappa can be used when you have many raters. It is not necessary that each rater rate each subject. The main criteria are that (1) each subject is rated by the same number of raters and (2) the rating are categorical (i.e. not ordered, such as in a Likert scale or with a decimal value). If the ratings use a Likert scale or a numerical value then the ICC might be a better way to go.
      Before setting up your data, you need to be clear about what sort of data you have (i.e. number of raters, types of ratings, etc.).

  59. Ben M says:

    I am conducting a controlled test in which I have 30 evaluators and 10 pairs of samples. I am providing a sample pair to an evaluator to review and determine if the 2 cards are the same or different. Within the 10 pairs I have the 6 pairs that are different and 4 are the same. I know the standards and controls.
    My null hypothesis is that the cards are the same.
    My alternate hypothesis is that the cards are different.

    How would you recommend analyzing the data? One set? Separated?
    In Minitab using 1 set I am obtaining a p value of 1.0 in the Fleiss’ Kappa Statistics section and negative Kappa values.

    Fleiss’ Kappa Statistics

    Response Kappa SE Kappa Z P(vs > 0)
    Different -0.310712 0.05 -6.21425 1.0000
    Same -0.310712 0.05 -6.21425 1.0000
    Thank you in advance!

    • Charles says:

      Ben, regarding whether you create one Fleiss’ Kappa or two (one for the cards that are different and another for the cards that are the same), this really depends on what you are trying to show. Either can be useful.
      I don’t understand why you need to set null/alternative hypotheses or why you have chosen these hypotheses.

  60. Simone says:

    I need your help please regarding the calculation of kappa for a study. I have 5 raters for 13 articles. the rate is mainly based on scores from 1 to 5. I want to check the reliability by calculating kappa for each rater and overall.

    • Charles says:

      Kappa is a measurement of the overall differences between raters. You don-t calculate a kappa value for an individual rater.
      The way to calculate Fleiss’s kappa is shown on the referenced webpage.

  61. Paola Jimenez says:

    Good morning Charles,

    In an experiment that we’re doing, we found that we have in some cases perfect agreement between the 10 judges, but anyways, doing kappa fleiss it seems to be no valid or in negative values of kappa.
    Can you help us to analyse this answers? Here i copy the link of the experiment.

  62. Josee says:

    Thank you for this information on Fleiss Kappa.
    I am still having difficulties in conducting the Fleiss Kappa. I have 2 raters that evaluated over 1000 events for 3 different evaluation categories. The raters can only use 1 category out of the 3 to rate an event. I cannot do a Cohen Kappa because more than 3 evaluation category so I was doing an agreement score between the 2 raters but I have been told to do instead a Fleiss Kappa. I have organized my Excel table where for each event the evaluation category is 0 if no agreement and 1 if the 2 juges agree. So for each event you will have either three scores as 1 (if agreement) or two scores as 0 and one score as 1 (if disagreement). I wanted to do a Fleiss Kappa in SPSS or even Excel but I either way I cannot find how. Thank you in advance for your help with this.

  63. Tim says:

    Dear Charles,

    Thanks for sharing this! Im having trouble with dichotomous variables and 4 raters. Is there any option to calculate an ICC or Fleiss’ Kappa with dichotomous variables?

  64. John says:

    Dear Charles,

    Thank you very much for providing this useful and informative website.
    I have been asked to evaluate multiple readers’ treatment response assessments of 12 subjects based on impression of a change in lesion uptake of a radiopharmaceutical on medical images acquired on three different dates (there are four choices: progression, stable, partial response or complete response). However, the readers have not necessarily evaluated all the same lesions; they were simply told to select those lesions from the images that appeared to have the most uptake. Their response assessment was based on the appearance of the single hottest lesion at each time point, even if it was a different lesion from the baseline choice. In most cases, based on quantitative measures, I can see that they all selected the same lesion at a particular time point. In a few cases, it is obvious that they must have chosen different lesions. The Fleiss kappa value is 0.65, which the general literature appears to regard as ‘good’ agreement but I understand that this descriptive terminology is arbitrary. My questions are:
    1. Given the fact that the readers may occasionally have selected different lesions at a particular time point, based on subjective impression of uptake, is a Fleiss kappa evaluation of response assessment valid? Please keep in mind that the object of the study was to show that the response assessments were consistent among multiple readers.
    2. The previous question notwithstanding, is a kappa value of 0.65 really ‘good’?


    • Charles says:

      1. I am not able to answer your question without a clearer and more detailed description of the situation.
      2. There isn’t complete agreement on what is a good value. The value of .65 is only a guideline.

  65. Rose Amielle says:

    Hi Charles.

    My research consists of 6 categories (Strongly Agree, Agree, Slightly Agree, Slightly Disagree, Disagree and Strongly Disagree) , 27 items, and 9 raters. Below are the results based on this article:
    Pe= 25.28172394
    z = 56.30456314
    p-value = 0

    The kappa value is 1 which means excellent agreement but the z-value is so high. Is there something wrong based on the results? Thank you very much.


  66. Pick says:

    Hi, Charles

    Thanks for great article. I think, it is so benefit for analyse Multiple rater kappa (Fleiss’ Kappa). I have some problem such as negative kappa and not’s sure that I calculate data correct. Would you mind, if I want you to recheck my data calculation.




    • Charles says:

      No this isn’t the correct calculation. The main problem is that your columns correspond to the raters instead of the rating categories. Also the sum of the values in each row should be the same.

  67. Renee says:

    Hello Charles,

    Unbelievably helpful…thank you! However, I am getting very low kappa’s for several surveys that we are making and I have read about this paradox when there is very high agreement (skewed distribution). Can you take a peek at our spreadsheet to ensure it is correct. It reflects 10 dichotomous ratings by 10 people about the understanding of survey items (first column). They are in very high agreement. Is there a way to handle this and way to report the kappa (or ranges) that is meaningful? Blue rows were not included in the analyses. I truly appreciate any insight that you can provide.

  68. s k sinha says:

    i do not know to calculate Fleiss Kappa for following data. Please help

    A 70 3 2 4
    B 2 44 5 0
    C 3 6 29 1
    D 5 1 1 24

  69. s k sinha says:

    Data I sent earlier may be confusing. Please calculate Fleiss kappa and provide worksheet for following data-
    A 70 3 2 4

    B 2 44 5 0

    C 3 6 29 1

    D 5 1 1 24

  70. kostas says:

    I have a problem taking results from the reliability testing. I have 3 columns with marks (1,2,3) from 3 raters and when i run the Fleiss’s kappa on these values
    rater1 rater2 rater3
    1 1 2
    3 3 3
    1 1 1
    2 2 2
    3 3 3
    3 3 3
    2 2 1
    1 1 2

    I get this table

    Fleiss’s Kappa
    alpha 0,05
    tails 2

    Total rater1 rater2 rater3
    #N/A #N/A #N/A #N/A #N/A
    #N/A #N/A #N/A #N/A #N/A
    #N/A #N/A #N/A #N/A #N/A
    #N/A #N/A #N/A #N/A #N/A
    #N/A #N/A #N/A #N/A #N/A
    #N/A #N/A #N/A #N/A #N/A

    please help if possible!

    • Charles says:

      This is not the correct format. Each role is a different rater and each column is a different category. The sum of the values in each row must be the same across all the rows.

    • kostas says:

      I am sorry got it wrong! I had to sum the same marks and make a new table like that. (Please correct me if i am wrong!)
      mark1 mark2 mark3
      2 1
      2 1
      2 1
      Thank you for the tools anyway!

      • Charles says:

        This looks good, although when I see it on my computer all the columns are shifted left — in particular, there are no values for Mark3. You would have had to insert zeros for blank values, but this should not be a problem on the Excel spreadsheet.

  71. Linh says:

    Hello, thank you very much for sharing them.
    Could you do me a favor?
    1. As you know, Cohen’s kappa has set of criteria, it is: less than 0% no agreement, 0-20% poor, 20-40% fair, 40-60% moderate, 60-80% good, 80% or higher very good. How about Fleiss’s kappa, are they the same?
    2. What is the difference between Cohen’s kappa and Weighted Cohen’s kappa?
    3. I need to measure the agreement between 8 raters. Is it possible if I use Cohen’s kappa for each couple of rates and then I measure the medium of all the kappa?
    4. Is there any software of application to measure Fleiss’s kappa ( Ex: Something like spss or medcalc..)
    Thank you so much for helping me!

    • Charles says:

      1. You can use the same criteria as for Cohen’s kappa, although there isn’t universal agreement about these criteria.
      2. See Weighted Kappa
      3. You can do this, but it is not clear how you would interpret the result
      4. The Real Statistics software calculates Fleiss kappa. I would think that SPSS and Medcalc do as well, but I cant say for sure since I dont use these tools.

  72. Rob says:

    Hi Charles,

    Many thanks for producing this, it’s very helpful. I wonder if you could help me with a question. I’m trying to calculate the degree of interrater reliability between three raters when screening research papers for inclusion in a systematic review. The three raters have the option to include or exclude each of the 20 papers. I think I’ve followed you instructions for formatting correctly. The raters all agreed on 18 papers. On the other two there was a 2 to 1 split (ie. two raters said include, the other exclude).

    When I input that, I get the following result:

    Kappa = 0.86 (p <0.05), 95% CI (0.599, 1.118)

    I noticed that the upper CI value is greater than 1. I assumed that the value would be bounded at 1 because that would represent perfect agreement. Have I misunderstood something or made an error?

    Many thanks

  73. xinru says:

    Hi Sir,

    Could you assist me with the following?

    My response outputs from my 4 raters are non-categorical, for e.g ” 52 years, 53 years” . Initially, I manual group them into “Yes” and “No” before using SPSS to calculate the Kappa Scores. But, how do I go about in calculating the Kappa score for them without manual grouping the responses (Just by putting the raw data into SPSS)

    Thank you!

    • Charles says:

      Sorry Xinru, but I don’t completely understand your question. If you have non-categorical data, then Fleiss’ kappa is probably not the right tool to use. Perhaps ICC would be better.

  74. Angela says:

    Hi Sir,

    I would like to ask how can I assess the agreement between an isolated rater (master rater) VS a group of say 10 raters. The variables are ordinal, like No Disease, Mild Disease, Moderate Disease, Severe Disease.

    Thanks in advance for your reply!

  75. Joseph says:

    Hi Charles,
    This has been very useful so far. Thank you.

    My work is similar to your example but there are the following differences:
    -instead of 4 psychiatric diagnoses, there are 2 columns: Absence and Presence.
    -instead of 12 patients, there are 6 patients.
    -instead of 6 raters, there are 15 raters.

    I have two queries:
    1. In one part of my research, all 15 raters agree on ‘Presence’ for every patient. However, Fleiss’ kappa can not be calculated because (if you follow your manual method) you end up dividing by 0. This is not possible. What should I do?

    2. In another part, 15 raters agree on ‘Presence’ for the first 5 patients. For the final patient, 14 agree on ‘Presence’ with the remaining rater stating ‘Absence’. I would expect the kappa value to be relatively high. However, it is negative.

    This is very confusing and I would be very grateful if you could help.

    Thank you.

    • Charles says:

      One of the problems with Fleiss’ Kappa is that you can get some counter-intuitive, esp. in extreme cases. My understanding is that Gwet’s AC2 measurement addresses some of these problems. The next release of the Real Statistics software will include support for Gwet’s AC2. Regarding your specific questions:
      1. If all raters agree, simply report 100% agreement (don’t bother with Fleiss’ Kappa)
      2. Once again with this level of agreement, you should simply report that you have 100% agreement except in one case.

  76. Enas says:

    Hi Charles,
    Thank you so much for this website.
    I have analyzed the inter- rater reliability for 5 raters on 34 items. the variables were dichotomous (correct, wrong). the Average pairwise percent agreement was 82.4%. However, the Fliess kappa was 0.32 and Krippendorff’s Alpha was 0.32 as well. Actually, I am frustrated, and I do not know how to report my data. or if there is away to correct the kappa.
    your help is much appreciated.

    • Charles says:

      It looks like it is easy to report your results, although it seems like you don’t like the result.
      The reason for doing analysis is to find out whether or not your hypotheses are correct. A negative result is still a result.

  77. Patrick D says:

    Hello Charles,

    so I have a question. I have an open card sorting with 10 raters, 23 items and 2 categories.
    It’s 230 ratings in total, and in 225 out of the 230 cases, the raters assigned the same category.
    Using the add-in i get a Fleiss Kappa of 0,92067698.

    However, the p-value always amounts to 0. How is that possible?

    I also tried changing the numbers so there would be more volatility (e.g. set 225 down to 180 and the Fleiss kappa shrunk but the p-value is still 0?

    Maybe it’s a problem since I am using a German Excel version?
    The regular Cohen’s Kappa calculation also doesnt work for me

    • Charles says:

      If you send me an Excel file with your data and calculations I will try to figure what is happening.
      The German version shouldn’t give a different answer from the US version (unless there is a problem with the decimal symbol).
      You can find my email address at Contact Us.

  78. Sid says:

    Hi, thanks for this info.
    I have a query. I am conducting a reliability study by 3 observers. The observers are given 3 videos from a pool of videos and asked to observe an interaction of the individual in the video with the environment. The unit of observation is behavior. The observers were instructed to describe their observations. the investigator then analysed the descriptions and developed codes. How do I report the reliability? Right now I am only reporting commonality and discrepancies. However, I would request you to guide me in using a statistical measure to assess reliability.

    • Charles says:

      It sounds like you need an interrater type of reliability measure. The Real Statistics website describes quite a number of these measures, including Fleiss’ Kappa. Which one to use depends on the details of your situation.

  79. Bipin Manezhi says:

    Great article, I tried Fleiss Cappa and it worked. An excellent and easy to use software as well.
    Just a question- I am doing IRR for a Clinical Audit Tool with 9 assessors. Each of them will assess if the clinical records done by clinicians are good, average or bad- so three categories. Will I be using one tail or 2 tail in this case?

    • Charles says:

      Glad that you found the article and software useful.
      Fleiss Kappa always uses a two-tailed test.

      • Bipin says:

        Thanks Charles,

        I did the analysis, and it worked well for 9 out of 10 items I was assessing. it did not work for one item though: I have the data below:
        Good Average Poor Non Resposne
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0
        9 0 0 0

        As all nine assessors rated good for all samples, it should be showing as +1?
        But the result was as shown below:
        Fleiss’s Kappa

        alpha 0.05
        tails 2

        Total Good Average Poor Non Resposne

        Could you please help?

        • Charles says:

          This is an oddity of many of these measurements. Since the variance for your data is zero, the calculation of the statistic probably involves division by zero and so the statistic is undefined. In any case, you can assume that you have complete agreement and don’t really need to calculate kappa. You can treat it as if the value was +1.

  80. Susan says:

    Hi Sir,
    Thank you for the website. I’ve downloaded the resource pack, and tried to do the Fleiss Kappa analysis following the instruction, but got the error message saying “compile error in hidden module:correlation”. Could please help me to solve this problem? Thanks a bunch!

    • Charles says:


      Here are some steps for diagnosing or potentially resolving this problem.

      1. Try using one of the Real Statistics functions. In particular, enter the formula =VER() and see whether you get an error, in which case, the software has not been installed correctly
      2. Solver has not been installed. When you choose the Add-Ins option from the Tools menu, do you see RealStats and Solver in the list of addins with a check mark next to them?
      3. See Hint 2 at http://www.real-statistics.com/appendix/faqs/disappearing-addin
      4. The problem might be with the Trust Center settings. Click on Options from the File ribbon and then choose the Trust Center option on the left side. Next click on Trust Center Settings …. Next click on the Macro Settings option on the left side and make sure that it is Disable all Macros with Notification. Also click on the Trusted Locations option on the left side and click on the Add New Location… button to add the folder that contains RealStats-2007 folder as a trusted location.
      5. Try opening a blank Excel worksheet and press Alt-TI. Uncheck the RealStats addin and close Excel. Now open a blank Excel worksheet and press Alt-TI. This time check the RealStats addin and now try to use the Fleiss’ Kappa analysis tool.


      • Susan says:

        Thank you so much for your quick reply! Sorry I wasn’t clear, I’m using a MacBook, and Excel 2011. Does that make any difference? I don’t see “real statistic functions”, only the” data analysis tool”, and I was trying to use the “interater reliability” function in the “correlation” tab.
        Anyway, I will also try PC, and see if that will work. Thank you so much for your help!

        • Charles says:

          Try pressing Ctrl-m or Cmd-m.
          Also make sure that you get a valid value from the formula =VER()

          • Susan says:

            Hi Charles,
            Thank you so much for your help. =VER() returned “5.0 Excel Mac”. I also got the Kappa function to work, which will return a single number. However, when I tried to input all the parameters to get an output like Figure 3, it only returned a word “kappa”. On the bottom of the window where you put in the parameters, I can see part of the result (Kappa, s.e., but then got truncated, so I could not get the Z, p etc). Could you think of a reason for this? I feel like I’m very close to getting it to work properly, just not quite yet. Thank you, I really appreciate your help!

          • Charles says:

            The KAPPA function is an array function and so you can’t simply press Enter to get all the output. See the following webpage for how to use an array function:
            Array Formulas and Functions

          • Susan says:

            Sorry, forgot to say:

            The parameters I put in were like this:
            rg=E2:G18 (range of my data)
            col=3 (I have 3 categories)
            lab=TRUE (I’m not sure about this, just followed theexample, which said if I want to see a result in Figure 3, lab=TRUE, am I right?)
            orig=TRUE (I’m not sure what this is either or what else I could potentially chose to put in).
            I tried to define my output area as 1 cell, a 1×6 area, a 2×6 area. None of them worked, all I saw was a word “kappa” in the first cell.
            Any suggestion would be great appreciated.

          • Susan says:

            Hi Charles,

            Just want to let you know, I downloaded a PC version to my old old PC, it’s working. Thank you very much for all your help.

          • Charles says:

            This is good to hear.

  81. Lona Trulove says:

    I am conducting a content analysis pilot study on 23 novels. There are five coders and 16 personality traits. The traits are not mutually exclusive and the coders are marking them as either yes the trait was exhibited by the character or no it was not.

    What is the best way to test inter coder reliability?

    • Charles says:

      I don’t have a definitive answer for you, just some ideas.
      Fleiss’ Kappa handles the case where each rater assigns a rating for one of the traits. For your problem you have multiple traits and so possible approaches are: (1) simply report one Fleiss’ kappa measurement for each of the 16 traits, (2) combine the Fleiss’ kappa measurements in some way (average, max, etc.), (3) use some complicated coding that handles all 16 traits (e.g. a vector of 16 elements consisting of zeros and ones).
      Approach 1 is simply and may be the best, but it doesn’t give ratings that handle all the traits simultaneously. Approach 2 seems pretty arbitrary, but is probably the one most often used. Approach 3 seems reasonable, but then you need to come up with some way of ordering the vectors or determining distance between vectors that captures similarity of ratings; in this case you wouldn’t use Fleiss’ kappa (which is restricted to categorical data) but Gwet’s AC2 or Krippendorff’s alpha or something similar.
      I don’t know whether this issue is covered in the following book, but it covers interrater releiability:
      Handbook of Inter-Rater Reliability, 4th Edition by Gwet

  82. Joan says:

    Hi Charles,
    I have completed a survey where the intention is to measure level of agreement on the diagnosis of 4 cases by 57 raters.
    In each case, there are 4-5 possible answers given, but the aim is only to score them as 1 is rater 1 and rater 2 agree on the same diagnosis.
    I understand Cohen’s kappa would only work for 2 raters. Would Fleiss’ kappa work in this instance, and how might I compute this (Is there a calculator I can use online)?
    Thanks a lot.

    • Charles says:

      What criterion would you use if say you had 4 raters? 1 if all 4 agree and 0 otherwise?

      • Joan says:

        Hi Charles,
        I tried using Cohen’s kappa for a few cases manually at first and averaged out the kappa for each case. For 4 cases, it would as such:
        1 vs 2
        1 vs 3
        1 vs 4
        2 vs 3
        2 vs 4
        3 vs 4
        That would work out to be 7 pairs.

        If the diagnosis was the same, I scored it as 1; if it was discordant I scored it as 0.

        For each case, I would then score the count as a fraction. Eg. If there was perfect agreement for 5 pairs, the count would be 5/7.
        I would then add the fractions from each of the 4 cases and divide by 4.

        However, I realized this wasn’t going to be possible for 56 raters!
        Also, the values I worked out seemed to grossly underestimate the level of agreement.

        From your explanation, Fleiss kappa seems to make more sense to me for my data. But it doesn’t seem to require a rater 1 vs 2 agreement, hence I am not quite sure how best to approach this and would greatly appreciate any advice.


        • Charles says:

          The typical approach is to use as input data the ratings made by each rater on each subject (independently of the other raters). The interrater measurement then calculates the degree of agreement between the raters. This makes things less complicated than the approach you are envisioning. You can use Fleiss’ kappa with 4 raters (or in fact any number of raters > 1) provided the ratings are categorical, i.e. not Likert scores (e.g. ratings from 1 to 5) or numeric values (ratings of 2.67 or 3.63, etc.). If you need non-categorical ratings, then there are other measurements available — e.g. Interacorrelation (ICC), Gwet’s AC2, Krippendorff’s alpha or Kendall’s W. All of these are described on the Real Statistics website.

  83. Joan says:

    Thank you, Charles. Will it be possible for me to send you my data and for you to have a look at it on how best to represent my data?

  84. Jeremy Hamm says:

    Hi Charles,
    I could be wrong but it appears your formula for SE in figure 2 (cell H9) is missing an array term in the sumproduct function. In the equation in the text, you write the second term in the numerator of the square root as:


    yes the sumproduct in figure 2 is written as:


    where row 17 is q_j and row 18 is 1-q_j

    without this extra term, the radicand can become negative and the SE incalculable.

    Hoping I’m right on this.

Leave a Reply

Your email address will not be published. Required fields are marked *