Fleiss’ Kappa

Cohen’s kappa is a measure of the agreement between two raters, where agreement due to chance is factored out. We now extend Cohen’s kappa to the case where the number of raters can be more than two. This extension is called Fleiss’ kappa. As for Cohen’s kappa no weighting is used and the categories are considered to be unordered.

Let n = the number of subjects, k = the number of evaluation categories and m = the number of judges for each subject. E.g. for Example 1 of Cohen’s Kappa, n = 50, k = 3 and m = 2. While for Cohen’s kappa both judges evaluate every subject, in the case of Fleiss’ kappa, there may be many more than m judges and not every judge needs to evaluate each subject; what is important is that each subject is evaluated m times.

For every subject i = 1, 2, …, n and evaluation categories j = 1, 2, …, k, let xij = the number of judges that assign category j to subject i. Thus


The proportion of pairs of judges that agree in their evaluation on subject i is given by


The mean of the pi is therefore


We use the following measure for the error term



Definition 1: Fleiss’ Kappa is defined to be


We can also define kappa for the jth category by

The standard error for κj is given by the formula

The standard error for κ is given by the formula


There is an alternative calculation of the standard error provided in Fleiss’ orginal paper, namely the square root of the following:


The test statistics zj = κj/s.e.(κj) and z = κ/s.e. are generally approximated by a standard normal distribution, which allows us to calculate a p-value and confidence interval. E.g. the 1 – α confidence interval for kappa is therefore approximated as

κ ± NORMSINV(1 – α/2) * s.e.

Example 1: Six psychologists (judges) evaluate 12 patients as to whether they are psychotic, borderline, bipolar or none of these. The rating are summarized in range A3:E15 of Figure 1. Determine the overall agreement between the psychologists, subtracting out agreement due to chance, using Fleiss’ kappa. Also find Fleiss’ kappa for each disorder.

Fleiss' kappa worksheet

Figure 1 – Calculation of Fleiss’ Kappa

For example, we see that 4 of the psychologists rated subject 1 to have psychosis and 2 rated subject 1 to have borderline syndrome, no psychologist rated subject 1 with bipolar or none.

We use the formulas described above to calculate Fleiss’ kappa in the worksheet shown in Figure 1. The formulas in the ranges H4:H15 and B17:B22 are displayed in text format in column J, except that the formulas in cells H9 and B19 are not displayed in the figure since they are rather long. These formulas are:

Cell Entity Formula
H9 s.e. =B20*SQRT(SUM(B18:E18)^2-SUMPRODUCT(B18:E18,1-2*B17:E17))/SUM(B18:E18)
B19 κ1 =1-SUMPRODUCT(B4:B15,$H$4-B4:B15)/($H$4*$H$5*($H$4-1)*B17*(1-B17))

Figure 2 – Long formulas in worksheet of Figure 1

Note too that row 18 (labelled b) contains the formulas for qj(1–qj).

The p-values (and confidence intervals) show us that all of the kappa values are significantly different from zero.

Real Statistics Function: The Real Statistics Resource Pack contains the following supplemental function:

KAPPA(R1, j, lab, alpha, tails, orig): if lab = FALSE (default) returns a 6 × 1 range consisting of κ if j = 0 (default) or κj if j > 0 for the data in R1 (where R1 is formatted as in range B4:E15 of Figure 1), plus the standard error, z-stat, z-crit, p-value and lower and upper bound of the 1 – alpha confidence interval, where alpha = α (default .05) and tails = 1 or 2 (default). If lab = TRUE then an extra column of labels is included in the output. If orig = TRUE then the original calculation for the standard error is used; default is FALSE.

For Example 1, KAPPA(B4:E15) = .2968 and KAPPA(B4:E15,2) = .28. The complete output for KAPPA(B4:E15,,TRUE) is shown in Figure 3.

Fleiss kappa function results

Figure 3 – Output from KAPPA function

Real Statistics Data Analysis Tool: The Reliability data analysis tool supplied in the Real Statistics Resource Pack can also be used to calculate Fleiss’ kappa.

To calculate Fleiss’ kappa for Example 1 press Ctrl-m and choose the Reliability option from the menu that appears. Fill in the dialog box that appears (see Figure 7 of Cronbach’s Alpha) by inserting B4:E15 in the Input Range, choosing the Fleiss’ kappa option and clicking on the OK button..

The output is shown in Figure 4.

Fleiss's kappa analysis tool

Figure 4 – Output from Fleiss’ Kappa analysis tool

Note that if you change the values for alpha (cell C26) and/or tails (cell C27) the output in Figure 4 will change automatically.

153 Responses to Fleiss’ Kappa

  1. Simone says:

    I need your help please regarding the calculation of kappa for a study. I have 5 raters for 13 articles. the rate is mainly based on scores from 1 to 5. I want to check the reliability by calculating kappa for each rater and overall.

    • Charles says:

      Kappa is a measurement of the overall differences between raters. You don-t calculate a kappa value for an individual rater.
      The way to calculate Fleiss’s kappa is shown on the referenced webpage.

  2. Ben M says:

    I am conducting a controlled test in which I have 30 evaluators and 10 pairs of samples. I am providing a sample pair to an evaluator to review and determine if the 2 cards are the same or different. Within the 10 pairs I have the 6 pairs that are different and 4 are the same. I know the standards and controls.
    My null hypothesis is that the cards are the same.
    My alternate hypothesis is that the cards are different.

    How would you recommend analyzing the data? One set? Separated?
    In Minitab using 1 set I am obtaining a p value of 1.0 in the Fleiss’ Kappa Statistics section and negative Kappa values.

    Fleiss’ Kappa Statistics

    Response Kappa SE Kappa Z P(vs > 0)
    Different -0.310712 0.05 -6.21425 1.0000
    Same -0.310712 0.05 -6.21425 1.0000
    Thank you in advance!

    • Charles says:

      Ben, regarding whether you create one Fleiss’ Kappa or two (one for the cards that are different and another for the cards that are the same), this really depends on what you are trying to show. Either can be useful.
      I don’t understand why you need to set null/alternative hypotheses or why you have chosen these hypotheses.

  3. Leslie Hoy says:

    Hi am needing some advise please.
    I have a database of about 4000 plants and have information on each plant species from various sources. The information was obtained from about 30 different sources (raters). However many sources (raters) have only provided information for about 10% of the plant database. Hence I a lot of data on each plant but it is not from each rater. This implies that there are large gaps in data (but sufficient for me to conclude certain answers. The information has been captured in Excel. This also implies that there are large gaps between data in each column as well as in the rows. It has been recommended that I need to use Cohens Kappa (to test the agreement between raters), however in my reading I think that Fleiss Kappa is more suited. Can you advise on this. Also can you advise me of a website or youtube site that will explain how I set up a data set such as this correctly and what process and formula I use (preferably step by step guide).

    • Charles says:

      Fleiss’s kappa can be used when you have many raters. It is not necessary that each rater rate each subject. The main criteria are that (1) each subject is rated by the same number of raters and (2) the rating are categorical (i.e. not ordered, such as in a Likert scale or with a decimal value). If the ratings use a Likert scale or a numerical value then the ICC might be a better way to go.
      Before setting up your data, you need to be clear about what sort of data you have (i.e. number of raters, types of ratings, etc.).

  4. Maike says:

    Dear Charles Zaiontz,

    Thank you for this great explanation. I would really appreciate it if you could help me with the following:

    I have 76 raters who had to listen to two audio recordings (one native language & one foreign language) and transcribe what they heard. Some of the raters were primed. I categorized their transcripts as 1 = correct transcription, 2 = biased transcription and 3 = other.

    Now I would like to analyse if the agreement is higher in the primed group than in the control condition and if this agreement is even stronger if the raters were transcribing a non-native language.
    Would Fleiss Kappa be the right choice in this case?

    Many thanks in advance!

    Kind regards,

    • Charles says:


      You can calculate Fleiss’s kappa for the group of primed raters and then calculate another Fleiss’s kappa for the group of unprimed raters and then compare the results. The measurement that you get will be quite limited since you have only two subjects (i.e. the two audio recordings).

      When you try to compare agreement in transcribing native vs. non-native languages, you reduce the number of subjects down to 1, which will violate the assumptions of Fleiss’s kappa. In this case you might as well simply compare the variances of the ratings.

      The following is something I found on the Internet which may be useful


  5. SREEJA PS says:

    Thank you for your wonderful explanation. According to my data set I am getting Z value as 17.8. Similarly each catogory Z value is above 25. What should I infer from this?
    K= 0.437138277
    var= 0.000593
    se= 0.024351591
    k/se =17.95111749

    for each category

    k1 0.722940023 0.301151045 0.270887021 0.154798991 0.392763376 0.497409713 0.401404805 0.473193742 0.296002606
    var(k1) 0.000390278 0.0003871 0.000595739 0.001114339 0.000594559 0.000777897 0.000547207 0.000653159 0.000385771
    se(k1) 0.019755455 0.01967485 0.02440776 0.033381712 0.024383589 0.027890808 0.023392452 0.025556984 0.019641059
    z 36.59445027 15.30639563 11.09839766 4.637239447 16.10769338 17.83418051 17.15958671 18.51524193 15.0706028

    • Charles says:

      It is very difficult for me to answer your questions without some context. It would be better if you sent me an Excel file with your raw data and the analysis.

    • Charles says:


      Thanks for sending me the Excel spreadsheet. This makes everything much clearer.

      I am also getting z values which are high, even higher than yours. The reason for the difference is that I am calculating a different value for the standard errors. The standard errors I am using come from the following paper.

      Fleiss, J. L., J. C. M. Nee, and J. R. Landis. 1979. Large sample variance of kappa in the case of different sets of raters. Psychological Bulletin

      What is the source of the standard errors that you are using?


  6. John Song says:

    Thank you. It’s very useful to understand the procedure of statistic.

    In this page, There’s only explanation for ‘Within or between Appraisers’

    Could you add some explanation for ‘Appraisers vs standard’ ?

    It will be a lot help for me. Thank you

  7. M.C. says:

    Dear Sir.

    This website helped me a lot in understanding statistics. Thank you.
    Can you help me with the following?
    I want to measure the interobserver agreement of three observers(raters) for evaluation of gastric tumors.
    There are 445 subjects (n) and only two categories (whether benign or malignant).
    It seems that Fleiss’ Kappa calculates the overall agreement of the three observers.
    Can the Fleiss’ Kappa be used to calculate agreement between two observers?
    For example, between observer 1 vs observer 2, observer 2 vs observer 3, and observer 3 vs observer 1? (I’m guessing not. Then should I use weighted Cohen’s kappa three times instead?)

    Thank you.

  8. Rose Callahan says:

    Thank you for your explanation! I think I know why my previous attempts to calculate a kappa came out with nonsense, but I am not sure how to set up my data.

    I have three raters, each watches two videos for each of 50 subjects. On each of the two videos, they score two different aspects on different scales. On video one, they score 6 items for aspect “A” on a 1-5 scale, and another for aspect “B” on a 0-14 scale. On the second video, they again score 6 items for aspect “C” on a 1-5 scale, and another for aspect “D” on a 0-12 scale. So I really have 14 evaluations per subject per rater, and two of them have a large number of possible scores (13-15). We were adding all the “A” items to get a score out of 30, and similarly all the “C” items to get a score out of 30, to only have 4 evaluations per subject per rater. I am afraid this is artificially lowering our inter-rater reliability, because it seems much less likely that any 2 raters will all agree on a score out of 15 or 30 than a score out of 5.

    Do I need to run 14 kappa tests? And then, how do I get an overall inter-rater reliability rating from them? Or can the kappa test be set up to handle large numbers of possible scores? Should I be doing something totally different?

    • Charles says:

      From your description, you can calculate 14 ICC values (or 4 ICC values based on the combined scores). If you have a way of calculating a combined rating, you can use this to compute one ICC.
      If you have 14 ICC values, you can create a combined value based on what you plan to use the ICC for. E.g. you could use the minimum ICC value.

  9. Felix says:

    Hi Dr. Zaiontz,

    Great tutorial! Straight to the point and instantly gratifying! I have a 22 raters evaluating 50 different scenarios. They then categorize the scenarios into 6 possible groups (nominal variable) according to what they believe is the correct category. I have calculated the interrater reliability using fleiss kappa according to your methods but I am also interested in overall rater reliability of all of the judges compared to a gold standard. The gold standard rating was done through the consensus of three independent raters using a tool. What test would you run to compare rater reliability to a gold standard?

    thank you!

  10. Pingback: Interrater reliability or Kappa Statistic in excel - Page 2

  11. Fernando says:

    Thank you for your explanation, I have a question
    I’m following the formula for kj, but the values I get are not the same as your table, could help reviewing whether the values in your table are correct?
    Thank you

  12. Malin says:

    Dear Mr Zaiontz,
    Thank you for excellent explanations of the Kappa statistics.
    I have a few questions:
    Regarding the z-value, it seems to always compare the result to K=zero. You say so on this web-site, and my statistical software has the same default which I cannot change. I find that a bit strange, it is not always good enough to be just a little better than tossing a coin. To introduce a new method you might want to prove it to be significantly better than fair Kappa ( which often is set at K=0.40, but I am aware that definitions vary between different authors).
    If I want to compare Kappa values between two different sample populations, with the hypothesis that the Kappa of the test population is significant higher than Kappa of the reference population, I believe I have to calculate the z-score manually by subtracting the mean of the reference population from the test population and divide by SE of the test population. Then I calculate the p-value choosing a one-sided or two-sided test.
    Do you agree on this, or are you of another opinion?
    (I have a material with 3 observers/judges and 106 samples at two different departments with different training, and the size is chosen by a power calculation based on previous publications on the topic.)
    Best regards,

    • Charles says:

      Dear Malin,

      I don’t know any tests to determine whether kappa is higher than some value p. I am not sure whether any simple approaches will work as the case for testing the correlation coefficient reveals. The approach for determining whether a correlation coefficient is significantly different from zero is quite different from that used to determine whether the correlation coefficient exceeds some value. This is because in the first case, you can assume a normal distribution (at least under the null hypothesis) and use a t distribution, while in the second case you can’t assume a normal or t distribution.

      I also don’t know any methods for comparing two values of kappa based on two different measurements. The approach that you are considering may be appropriate if you can figure out what a suitable pooled s.e. should be. Again we can look at the correlation coefficient where the pooled s.e. is not so obvious.

      I have found the following research paper, which may be useful, although I have not read it myself.

      McKenzie DP. Mackinnon AJ. Peladeau N. Onghena P. Bruce PC. Clarke DM. Harrigan S. McGorry PD. Comparing correlated kappas by resampling: is one level of agreement significantly different from another?. Journal of Psychiatric Research. 30(6):483-92, 1996 Nov-Dec.


  13. Paula says:

    Hi Charles
    Thank you for a very useful website.
    I followed the instructions to calculate Fleiss Kappa, including s.s. and CIs, but I run into trouble with the p-value. My Excel does not seem to like the command with NORMDIST, and I tried following the steps to install your software but no luck so far. Therefore, is there a formula to calculate the p-value? Or how can I ‘unpack’ the command you suggested above? Is there a way to ‘tell’ Excel what to do in a different way?
    Many thanks

    • Charles says:

      The website uses the function NORMSDIST not NORMDIST. NORMSDIST(x) is equivalent to NORM.S.DIST(x,TRUE) on newer versions of Excel (although the older version still works).

      • Paula says:

        Thank you for your reply, Charles.
        And thank you for picking up my mistake – all sorted now! My p-value came up with 0 – by reading the previous comments on here, I’m guessing that this is acceptable. With regards to reporting the inter-rater reliability, would you say that it is better to report Kappa and CIs, or should I report the p-value as well? Sorry, I’m quite new to this.
        Many thanks again.

  14. David says:


    first of all i want to thank you for the article.

    I have a question concerning Fleiss’ Kappa. You wrote: “[…] in the case of Fleiss’ kappa, there may be many more than m judges and not every judge needs to evaluate each subject; what is important is that each subject is evaluated m times.”

    So my problem is this: we have around n=350 subjects, k=7 categories and m=6 judges. My problem is as follows: (a) not every judge evaluated each subject and (b) every judge can evaluate each subjekt into 2 categories (it’s a pre-evaluation; after the pre-evaluation we want to meet and discuss our results and finalize our categorization).

    Is there any way i can still work with Fleiss’ Kappa? Is in this way m=12, because each judge can evaluate each (problematic) subject into 2 categories?

    Thanks in advance!

    • David says:

      One remark concerning (b): a subject can be evaluated into 2 categories, but doesn’t have to.

    • Charles says:

      I don’t completely understand the scenario. You say that there are 7 categories but you also say there are 2 categories. Which is it? Perhaps it would be helpful to have an example with some data.

  15. elsayedamr says:

    I got it .. thank you very much .. I will read it and study it … if there any question I will send it to you .. thank you very much again.

  16. elsayedamr says:

    Thank you very much … I need help
    what do you recommend to carry out a content validity index

  17. Amanda says:

    Hi Charles.

    Thanks for the wonderful tools and explanations. I am trying to use the Real statistics free download to perform a Fleiss’ kappa analysis and running into some issues. When I copy the sample data from Figure 1 into Excel and use the Fleiss’s kappa option under the reliability procedures menu the output field comes back with every cell filled in #N/A. Can you suggest some solutions to this issue?


    • Charles says:

      Perhaps that is because you also used column A in the Input Range. The input should not include this column.

  18. fariha says:


    Can Cohen’s Kappa or Fleiss Kappa be use for any other statistical agreement data. Or the ‘rater’ in here only for human respondents? Can the ‘rater’ also means previous study?

    Thank you.

  19. sanitha says:

    Dear Dr. Leinder
    Thank you very much for the article. I am working on biotic indices. I want to see the agreement of the indices to evaluate the ecological status. The indices are classified into 5 categories (Bad, poor, Moderate, Good and high). I have about 610 samples.
    I would like to know which test to apply. Kappa or ICC.

    The categories used are i.e Bad=1, Poor= 2 Moderate= 3 etc.

    In general, how do i arrange the data. I have arranged the data as following
    Example 1
    Stn BI1 BI2 BI3 BI4 BI6 BI 7 BI8
    Stn1 1 2 3 2 3 4 3 3
    Stn 2 1 2 4 2 3 4 3 4

    Example 2
    Bad Poor Mod Good High
    Stn 1 1 2 4 1 0
    Stn 2 1 2 2 3 0

    Which is the correct method? If example 2 is correct is there are easy method to it. I did few samples manually, but 610 samples is too much.


  20. Maram says:


    Looking at the equation to compute kappa for the jth category, I can see that the denominator includes the number of subjects assigned to that category. I would like to know your opinion about the following conclusions I loosely deduce from this equation:
    Categories that are bigger in the data and get a higher number of subjects assigned to them, might probably have higher values of kappa than those much smaller categories (with very few subjects belonging to them) just due to the fact that they are bigger (i.e., kappa is biased towards size of categories in the data itself)

    • Maram says:

      Additionally, can you suggest any other inter-rater agreement measure that’s less sensitive to categories size (assuming 3 annotators)?

      • Charles says:

        As I said in my previous comment, I am not certain whether Fleiss’ Kappa is indeed sensitive to category size. In any case, the bigger issue is whether you have ordered data or categorical data. In the first case, Fleiss’ kappa is a reasonable choice. In the second case, the intraclass correlation is a reasonable choice.

    • Charles says:

      I haven’t had time to really look at the formulas to see whether your observation is true or not, but I noted that for Example 1, the exact opposite is true. The two categories with the highest number of subjects have the lowest kappa values.

  21. Maram says:


    Thanks for the great tutorial and tool. I am working with thousands of samples, 2 categories (yes and no) and 3 annotators. When I computed Fleiss Kappa using the Excel tool, the results I get show that the per-category kappa values are equal for both categories (e.g., “yes” kappa = “no” kappa = 0.47). I repeated that computation over a completely different set of samples but with same setting (two categories, 3 annotators) and I also get the same result (i.e., per-category kappa values are equal across the categories). Is this normal? Could you please explain why this is happening?

  22. azhar stapa says:

    hello.. i need help..to figure out how..to get fleiss kappa value..if I have 5 expert valuation.. and the scale use of the item..is 1 to 10.

    item expert 1 expert 2 expert 3
    1 8 9 10
    2 8 8 8
    3 9 9 9
    4 10 10 10
    81 9 9 9


  23. Vincent says:


    I have a question in the case of reference value known.

    How to deal with the data matrix if we need to evaluate overall Fleiss Kappa when the reference value is known for each sample? Just assume we have already known the actual kind of illness for these 12 patients, now 6 judges give their judement to each patient.

    This situation is actually kind of Attribute MSA and MINITAB gives its answers, but I deducted several times with the formulas and was not able to give the exact answer in this case. However, the deduction for the case of reference value unknown is the same as MINITAB’s (like your case here).

    Appreciate your reply.

    • Charles says:

      I believe that you are looking for an implementation of Attribute Gage R&R. I plan to look into doing this in one of the next releases of the software.

  24. Pitambar Behera says:

    Our P value in the row is as mentioned below:
    m 3
    n 1000
    pa 0.4815
    pe 0.253083444
    kappa 0.305812683
    s.e 0.010177511
    z 30.04788446
    p-value 0.0000

    alpha 0.05
    lower 0.285865127
    upper 0.325760238
    The P value in the column is as indicated below:
    q 0.074 0.206333333 0.334333333 0.290333333 0.094666667
    b 0.068524 0.163759889 0.222554556 0.206039889 0.085704889
    k 0.708131458 0.395456356 0.167245385 0.221833528 0.377709556
    s.e 0.018257419 0.018257419 0.018257419 0.018257419 0.018257419
    z 38.7859573 21.66003666 9.160406974 12.15032275 20.68800442
    p 0.00000 0.00000 0.00000 0.00000 0.00000

    Is there any issue if our P value is a null element?
    Please reply as soon as possible.

    • Charles says:

      The p-value can be zero (probably a very small number). I have not checked your calculations, but you can get a zero answer. You can check your answer by using Real Statistics’ KAPPA function.

      • Pitambar Behera says:

        Thanks a lot, sir. I have sent u an e-mail with the excel sheets for calculation. Can you please have a look and give the solution?

  25. Trina says:

    Hi! Thanks for this page.
    I have two questions:
    1. I would like to compare Fleiss kappa values, for example, 0.43 (95% CI 0.40, 0.46) and 0.53 (0.43, 0.56). The confidence intervals overlap – is this enough to say that the two Fleiss kappa values are not statistically significantly different from each other? I have read that this may not be the case, link: http://www.cscu.cornell.edu/news/statnews/stnews73.pdf
    2. I often see % agreement reported in articles when using Cohens kappa (2 observers). Is this something that is done with Fleiss kappa? If so, how is it calculated (is it number of cases where all observers rate the same/total number of cases)? I have noticed that the % agreement I obtain when calculating Light’s Kappa instead of Fleiss kappa is always higher…

  26. BC says:

    Hi Charles,

    I am getting unusual results following the guide when the raters are at almost perfect agreement.

    I have 9 cases, each with 4 categorical ratings by 8 raters.

    Cases\Ratings 1 2 3 4
    1 0 0 0 8
    2 0 0 0 8
    3 0 0 0 8
    4 0 1 0 7
    5 0 0 0 8
    6 0 0 0 8
    7 0 0 0 8
    8 0 0 0 8
    9 0 0 0 8

    The kappa is -0.014 and the p-value is 1.177. Will you be able to advise as I am uncertain what’s going on?



    • Charles says:

      Unfortunately, many of these measures give strange results in the extreme cases. The best way to look at these situations is that when you have almost perfect agreement you don’t really need to use Fleiss’s Kappa. Not very satisfying answer, but it is true.

      • BC says:

        Thanks Charles for the clarification. It’s a good answer to clarify my doubts as I was sieving through all my equations in excel and found no errors.

        In a scenario like this, will analysing by ICC for inter-rater reliability (multiple rater) be better? What’s your recommendation?

        • Charles says:

          You would use the ICC with rankings that are quantitative. Fleiss’s kappa is used with categorical data. In situations like this you would simply note that there is obvious agreement and not use any of the usual measures.

  27. Catherine says:

    Thank you for this page. It has helped but I am now wondering if Fleiss’ kappa is the correct statistical test for our purposes. I have running a test-retest reliability study on a biological response to a stimuli. There are 4 response pattern categories. Six subjects underwent 4 repeated tests. We want to know if the same subject will attain the same test response to the same stimuli under the same conditions.

    Currently, my subjects are the rows, and the response pattern categories are the columns, with each test week as the ‘rater’.

    In your opinion, should we use Fleiss’ kappa or ICC?

    Thanks for this page, I am a stats novice, so am very grateful for having a clear ‘recipe’ to follow.


    • Charles says:

      If the data is ordered (especially with values such as 41.7, etc.) ICC will take the order into account. If the data is categorical (no order) then Fleiss’ is the approach to use.

      • Catherine Crofts says:

        Data is categorical so we will stick with Fleiss’ kappa. Thank you so much for your assistance and your prompt reply.


  28. Mattia says:

    Ok Charles thank you very much.

  29. Henk says:

    Dear Charles,

    i computed standard error and confidence interval according to your instructions. May I ask you where you found the formula for the confidence interval? Or did you come up with it on your own?

    Thanks a lot for providing this information!

  30. Mattia says:

    Thank you very much for your reply, Charles.

    I am not quite sure what I need to use, though.

    I will give an example to make the understanding of the case easier.

    I have 10 criteria that are hypothesised to be important in establishing a diagnosis. 30 experts will rate the importance of each criterion on a likert scale from 1 (not important) to 5 (very important).

    What I want to know, is what statistic (besides CV) I can use to evaluate the agreement between all the 30 experts on each single criterion.

    Many thanks!

    • Charles says:


      If you want to rate the agreement between all 30 experts across all 10 criteria then you could use the intraclass coefficient (ICC) as described on the webpage Intraclass Coefficient

      For your situation instead of Judges A, B, C and D in Example 1, you would have the the 30 experts (these are the columns) and instead of 8 wines would have the 10 criteria. The table would be filled with ratings 1 to 5.

      If instead you want to evaluate agreement one criterion at a time, then I don’t have any specific advice to offer. I have come across the following paper which may be helpful. I have only read the Abstract and I am not sure about the paper’s quality or applicability to your problem.



  31. David says:

    Dear Charles,
    I am trying to identify the appropriate measure to use for my inter-rater agreement test. After much research, I believe it may be the Fleiss’ kappa measure, but am not sure as I have not been able to locate literature which suggests it can be used in my specific case presented below.

    I have developed a multiple choice test with 58 items based on teaching scenarios and what teachers might do in a particular classroom situation. Then I had 5 expert teachers take the test so that I could attain an agreed upon answer key. Now I want to run a test which provides me with a value which justifies the answer key as having substantial agreement (0.61 – 0.80), perfect agreement (0.81 – 1.00), etc.

    I cannot seem to find literature specific to obtaining inter-rater agreement of an answer key to a multiple choice test. In short, my question is what index is appropriate to test the level of agreement for this situation and what research is there to back that decision?

    Any advice would be greatly appreciated.

    • David says:

      Hi Charles,
      I would also like to clarify that the 58 multiple choice questions each have three choices, which only 1 is correct. These three choices are nominal or categorical in nature.

    • Charles says:

      Generally in situations where it is clear what is the correct answers are to a multiple choice test, you might run Cronbach’s alpha to check for the internal consistency of the test, but it seems that you are testing for agreement among experts as to what the correct answers are and the categories are nominal. This seems to be a good fit for Fleiss’ kappa.

      • David says:

        Thank you Charles.

        I have forged ahead on your recommendation. The kappa value that was returned using the program provided on the domain https://mlnl.net/jg/software/ira/ (Geertzen, 2012)
        was 0.77, which according to Fleiss (2003) represents excellent agreement beyond chance. Would you agree with this assessment of the returned k value? Have you seen the on-line program I refer to above?

        Thank you, David

        • Charles says:

          I am not familiar with the program you referenced.
          You could use the Fleiss’ Kappa option the Real Statistics Reliability data analysis tool to calculate Fleiss’ kappa.

  32. Mattia says:

    Dear Charles,

    I am working on a study proposal for my dissertation which will need to get approval from a Ethics Committee, therefore I need to define analysis of data in advance.

    I will have n criteria, and raters will rate on a 1-5 Likert scale the importance of each criteria in establishing a diagnosis.

    In similar studies I have seen that percentage agreement, coefficient of variation and Kendall’s W are commonly used.

    From what I have been able to understand, however, Kendall’s W tests the level of agreement between n raters across ALL items, which means it is a measure of the overall agreement.

    However, what I want to know, is to identify what are the items where consensus has been established; in other words, I need a test that assess agreement for each individual items (besides descriptive stats such as CV and percentage agreement).

    I have found that in a previous study kappa was used. However I am not quite sure this is appropriate, would you please shed a light on this?

    Thank you for your time!


    • Mattia says:

      Comment added:

      I have also found a study where ICC was used to assess agreement across all items and then kappa was used to assess agreement within each item.

      Is this appropriate?

      Why ICC for overall agreement and kappa for within item agreement?

      If you also had a reference for this that would be great.

      Many thanks.

      • Charles says:


        ICC is another way to measure agreement across all subjects. In Fleiss’s kappa the ratings can be thought of as categories (Biology, Math, Reading). Any ordering between the categories is not taken into account. Thus Fleiss’s kappa would generally not be used if the rankings were something like a Likert scale (1, 2 ,3, 4, 5). ICC can be used the case where the rankings are ordered or when the rankings are continuous numbers (34.5, 12.7, etc.),

        As I explained in my previous response, both of these measurements assess agreement across all the subjects.


    • Charles says:


      Cohen’s Kappa is used to assess the level of agreement between two raters of many subjects based on one rating criterion.
      Fleiss’s Kappa extends Cohen’s Kappa to the case where you have more than two raters (still only one criterion).

      Both of these tests measure agreement across all the subjects.

      If by item you mean subject it is pretty easy to assess agreement: for two raters, either they agree or they don’t, and even for more than two raters you don’t need any fancy measurements to assess agreement.


  33. Robert says:

    Hi Charles,

    I’m working on a content analysis project and would like to compute a Fleiss’ kappa but I’m not sure how to organize the data. There are five coders, and they’ve coded 20 Tweets for content in 19 categories. Coding values are nominal categories, one through three.

    Can you toss me a bone?



    • Charles says:

      Since the 5 raters are rating 20 Tweets, the Tweets are the rows and the sum of numbers in each row is 5.
      Since there are 3 nominal ratings 1 through 3, there are 3 columns. The cell in the ith row and jth column contains the number of the raters who have given the ith Tweet a rating of j.
      The problem is that you have 19 such ratings, one for each category (if I have understood the problem correctly), which means that you have to calculate Fleiss’ Kappa 19 times, once per category. There may be a multivariate Fleiss’ Kappa that does the job you want, but I am not familiar with such a measure.
      If instead you

      • Robert says:

        You do indeed understand the problem correctly. Currently I’m organizing the data manually — converting raw scores into categorical agreement figures (0-5-0, 1-4-0, and so on). It’ll take a while but I’ll give it a shot. (Isn’t there some way to get Excel to do this sort of computation so I can avoid human error?)

        • Charles says:

          There may be some easier way of doing this reorganization of the data in Excel, but I’d have to see how the data is formatted to really provide a good answer. But by the time I figured it out you will have probably reorganized the data manually. Sorry about that.

          • Robert says:


            Thanks for your mindful relies. Let me run one more thing past your eyes and then I’ll stop.

            My five coders coded several categories for variables we didn’t expect to see in very many of the 20 tweets, so all of the coders made substantially the same decisions of “not present” (or a code of “2”). Nearly every tweet was coded the same way by every coder, so the frequency grid looks like this for 15 of the Tweets:


            There are five cases of a coder that coded “present” (or a “1”), so five of the data lines look like this:


            What puzzles me is that the reported kappa is -0.053, and I don’t know how to interpret this result. In another category there is a single disagreement of this type, and the result is a kappa of -0.010.

            I’ve computed kappa for the data sets using two templates and I get the same thing so I think I’m computing it correctly. What does a kappa of -0.010 mean?

            Thanks in advance for your help!


          • Charles says:

            When most of the responses are the same kappa will sometimes give strange results. To me it is a deficiency in the kappa statistic, but if there is for example there is only a single disagreement, there really isn’t much reason to use kappa.

  34. Robert says:

    Hi Charles,

    Need some advice. I don’t know how to organize data so I can compute a Fleiss’ kappa. Here’s what I’ve got:

    Five coders have coded 20 Tweets for content in 19 categories with a data range of 3 (1=present, 2=not present, 3=can’t tell). Right now the data exist in an Excel workbook with each coder’s data on separate sheets. I get this much: k=categories (there are three), and m=number of judges (there are five). What is n? The number of Tweets? The number of content categories in the Tweets? How do I convert the five spreadsheets into a grid that expresses agreements in each category?

    I can’t get my mind around it. Any tip you might have would be very much appreciated!


  35. Lam says:


    In my study, each subject is to be rated on a nominal scale from 1 to 5. If each subject was rated by 2 raters, but the 2 raters were drawn from a pool of 6 raters, I can apply Fleiss’s kappa in assessing the inter-rater reliability, right? Is there any requirement on the minimum times that each rater participated, i.e. should I exclude the rater that rated, say extremely, one subject only?

    Moreover, if each subject was rated by different number of raters, what method can I use instead? I find that you have mentioned an extension of Fleiss’s kappa as linked, http://conservancy.umn.edu/bitstream/99941/1/v03n4p537.pdf, but seems that it can be applied for dichotomous variable only and thus not suitable for me.

    Would Krippendorff’s alpha be appropriate? But I am not sure if I can treat those raters that do not rate the subject as missing data, since this type of “missing” is not random in nature.

    Thanks a lot in advanced.


    • Charles says:

      If I understand the problem correctly, you could use the intraclass correlation (ICC), but this measurement requires that all the subjects have the same number of raters. If the number of raters is not too unbalanced and the samllest number of raters is not too small, perhaps you can randomly eliminate some ratings to create a balanced model (alternatively you might be able to use some multiple imputation approach).
      I am not familiar with Krippendorff’s alpha, but I do believe there are techniques for handling missing data.

      • Lam says:

        Thanks Charles,

        For my own experience, I used to use ICC for continuous variables and kappa for discrete/ nominal variables. As the data this time is in nominal scale, may I confirm if ICC can still be applied?

        In addition, I could now probably fixed the number of raters to be 3 each time (not the same raters). Would you suggest whether ICC or Fleiss’s kappa be more appropriate?

        Sorry that I have got so many questions. Thanks again!


        • Charles says:

          To use ICC the data doesn’t need to be continuous, but it does need to be ordinal. E.g. you could use it with a Likert scale (say 1 to 7). If the data has no order then you shouldn’t use ICC. Fleiss’ kappa is only used with nominal data and order is not taken into account. If your data is unordered then you should use Fleiss’ kappa. If it is ordered you should use ICC.

  36. Philipp Lenarz says:


    1. I have a set of data with 38 raters rating the content of videos with regards to 8 items* that are all based on the same ordinal scale (very good, good, adequate, insufficient). My results are puzzling for me, because when I’m calculating Fleiss’ Kappa for each item individually (matrix: 1*38) the coefficients (mostly highly significant) are all smaller than when I’m calculating Kappa for all items together (matrix: 8*38). Shouldn’t the overall coefficient (K=0,31) be within the range of individual coefficients (K=-0,13 to 0,28)?

    2. Dr. Leidner wrote above that “There are no “criteria” for how to interpret Kappa – you can compare inter-annotator agreement, but to judge its absolute value depends on the task’s inherent difficulty. There are some “guidelines” in the literature, but they are bogus.” My problem is that I have to interpret the effect sizes and I have no idea if my effects are strong or weak. From the info I gave you, can you say if my coefficients are high enough to speak of a good rater agreement? And do you know of any helpful literature on this topic.

    Thanks for your help,

    *Their task is to assess the driving competency of driving license applicants that we filmed during a simulated test drive.

  37. Raphaela says:

    Sorry Charles, I didnt notice that your email was avaiable. thanks

    • Charles says:


      I just checked the spreadsheets you sent me and see that the problem is that the sum of the scores in each row are not the same. If these sums are not the same you will see errors in all the cells.


  38. Raphaela says:

    Good morning,
    can I dowaload again RealStat to calculate Fleiss´Kappa? ( calculate the agreement among multiple judges and multiple variables)? please let me know. thank you

    • Charles says:

      Yes. The new version of this function is now available. You can also use the new Reliability data analysis tool.

      • Raphaela says:

        Charles, I got a probelm. I have a matriz 0-1 , 11 variables and 15 judges. I applied Fleiss Kappa (Reability data analysis tool) and it results alpha 0,05 and the table below have errors. any idea? thx

        • Charles says:

          Without seeing the spreadsheet I can’t tell what went wrong. The most common problem is that the input data is not formatted correctly. If you send me the spreadsheet I will be able to help you better.

  39. Raphaela says:

    ok. Charles, Thank you 🙂

  40. Raphaela says:

    Professor, How I calculate Fleiss´Kappa with Real Statistics? click in data anslyse tool and ?


    • Charles says:


      Currently you need to use the KAPPA function as described on the referenced webpage.

      Tomorrow I will issue a new release of the Real Statistics Resource Pack. This release has an improved version of the KAPPA function. In addition there will be a new Reliability data analysis tool which will calculate Fleiss’s Kappa.


  41. Ali Baykuş says:

    In Figure 2, B17 and B18 rows should be interchanged.
    Have a nice day.

  42. Stacy says:


    I am interested in determining the degree of agreement between 4 raters on an observation instrument that involves a rating system of 1 – 4 (levels of proficiency with a target attribute). I believe that Fleiss would be the appropriate measure in this case. Is this correct?


  43. Katharina says:

    Dear Dr. Leinder,

    I want to test the reliability of a newly constructed diagnostic interview. Therefore, raters watched videos of diagnosticians interviewing patients and answered for most of the items on a one-to-five likert scale and for some on a yes/no-basis (as the interviewers did). The agreement between the raters, measured by Fleiss Kappa, is thought to be my estimation of the interrater-reliability. Does that sound right so far?

    The problem is that some of the items are multiple choice items. Is it still possible to use Fleiss kappa? It seems the statsitic programs can’t handle that… If not, could you give me a hint what kind of calculation I could use instead? Intraclass-correlations (some items are based on categorial scales…)?

    And one last (general) question: Is there a relation between interrater reliability and criteriums validity – otherwise than reliability as a condition for validity? My prof is always talking about a paper called “study into criterium validity…” (based on the data I described). I didn’t challenge him so far cause I’m not sure I’m right…

    Thanks so much in advance.

    Best regards

    • Charles says:

      Dear Kathi,

      You have addressed your question to Dr. Leinder. Since am not Dr. Leinder perhaps this comment was not intended for me.

      Please give me an idea of some of the questions, especially the multiple choice questions.

      I don’t know which type of relationship your professor was talking about between interrater reliability and criterion validity.


  44. yvonne says:

    I am comparing the reliability of 3 different imaging tests for classification of a condition that has 10 different classes. There are 4 raters and 123 cases. Each reader reviews each case 3 separate times, each time using one of the three different models. Each model will have a Fleiss’ kappa. The reliability of the models can then be assessed comparing the three Fleiss’ kappas?
    There is no good gold standard for the condition.


    • Charles says:

      Hi Yvonne,
      If I understand your situation correctly, Fleiss’s Kappa with measure the degree to which the 4 raters agree on their classification of each imaging test. If your goal is to select which of the three imaging tests provides the most agreement between the raters then it may be worthwhile comparing the kappa measurements (you still wouldn’t know whether any differences are statistically significant). I wouldn’t necessarily call this a measure of the reliability of the models.

  45. dove says:

    Thanks so much for the timely response!

    I am attempting to assess the reliability of the assessment tool, not the observers (nurses) though I understand that I may need to assess the observers seperately.

    What I understand you to say that I can rate the symptoms as y/n and the rated questions (1 through 9) as likert style. In doing so, Cronbach’s Alpa is the best tool to test all questions together given the goal is to test the tool.

    Thanks for your reply,
    greatly appreciated.


  46. dove says:


    So greatful for your web-site and resources. If I might test my idea:

    I am attempting to test the reliability of an assessment tool.
    The tool has nine (9) questions with responses range from 0-3 and
    Three (3) fields where the evaluator can select any combination of symptoms or select none of the symptoms from items on the symptom check list (8-10 items available to choose).

    As I understand it, I should seperate out the responses into two data fields and measure the reliability of the 9 questions using ICC (IntraClass Correlation) formula and test the second half (reported symptoms) using the Fleiss Kappa analysis formula.

    Is this correct?
    I greatly appreciate your resonse.

    • Charles says:

      It really depends on what you are trying to measure. E.g. you could use Cronbach’s alpha for the nine questions to determine the internal consistency of the questions. Are you evaluating the reliability of the questionnaire or the degree of agreement between the evaluators (I assume these are the nurses)?

      For the symptoms you could treat each symptom as a True/False question (True = patient assessed to have that symptom, False = assessed not to have that symptom). The analysis can them proceed as for the first group of questions (with my same question to you, namely what are you really trying to assess?


  47. i did the observation for a process in operating room, i have 2 sets of observers: assigned observers and participant observers:. the results for 14 categories were as follows:
    1 2
    1 230 250
    2 259 260
    3 260 260
    4 249 251
    5 260 260
    6 238 256
    7 250 252
    8 217 229
    9 212 218
    10 258 256
    11 189 197
    12 245 248
    13 245 248
    14 254 258
    15 260 260

    and the results of Fleiss’s Kappa was:
    Fleiss Kappa for 480 raters = 0.0402 SE = 0.0008
    95%CI = 0.0387 to 0.0417.

    so, could you please guide whats mean of the results, is there agreements between 2 observers ??

    • Charles says:


      There is no clear cut agreement as to what is a good enough value for Fleiss’s kappa, but a value of 0.0402 is quite low, indicating that there is not much agreement.

      You mention in your email that there are 14 categories and 480 raters, but I don’t understand the results for the 14 categories you listed, starting with the fact that there are 15 categories in your list.


  48. Alessandra Maggioni says:

    Dear Sir,
    I’m having some troubles in calculating the kappa for my sample. I have 189 patients, 3 evaluators and 2 possible diagnoses but I am not able to apply the formulas you suggest, maybe because of the different translations of the formulas themselves (I’m Italian). I am really having many problems and I need this kappa for my final project at university so I was wondering if you could help me or if you have an excel spreadsheet which I can put my numbers in, please. Thank you very much

    • Charles says:

      Dear Alessandra,

      If I am understanding the problem correctly, your input should be similar to Figure 1 in the referenced webpage with two data columns (one for each diagnosis) and 189 rows (one for each patient). The sum of the values in each row should be 3 (for the 3 evaluators).

      You can get a copy of the worksheet I used to create Figure 1 by downloading the Worksheet Examples file at http://www.real-statistics.com/free-download/real-statistics-examples-workbook/. Once you have this you can modify it to fit your specific problem.


  49. Nara says:

    Dear Sir,
    Could the number of judges (m) for every subject, different?
    for example:
    test subject #1 is evaluate by 5 psychologist
    test subject #2 is evaluate by 6 Psychologist

    and so on.

  50. lena says:

    is it possible to use Fleiss Kappa for 45 respondents rating 5 different interventions each on a total of ten??

  51. Eloi says:

    Hi, I am having problems with the excel whenever I write the “,”
    So instead I am writing “;” but I don’t think it is the same since I get to a point the numeric result becomes #VALUE!

    Please help. Thanks

    • Charles says:

      Hi Eloi,
      I am not sure what is happening, but I know that for languages that use a comma as the decimal point, Excel uses a semi-colon instead of a comma to separate arguments in a function. This may be what is causing the problem you have identified. In this case, depending on what you are trying to do you may need to use a semi-colon instead of a comma. Alternatively you can change the number/currency defaults in Windows (via the Control Panel) so that a period and not a comma is used as the decimal point.

  52. Philipp says:

    I should correct that the no. of subjects is what I need to calculate. 4 is what we’re currecntly planning with but this might not be enough to prove a rater agreement (Fleiss’ Kappa) of .7 or higher on a .95 confidence level.


  53. Philipp says:

    Please excuse if I’m not using the right terms in English.
    For a study that I’m conducting I need to calculate the minimum sample size for Fleiss’ Kappa for the following parameters:
    no. of raters: 50
    no. of subjects: 4
    Fleiss’ Kappa >= .7
    alpha = .05
    power = .80

    Is there a formular for this like for other statistical measures? Does the standard error decrease with the no. of raters and with the no. of subjects? I’ve been researching the web on this for a while but cannot find the information I need.

    Thanks for your help,

  54. Colin says:


    The formula of kappa for the jth category should make a little change. I think the summation is from i = 1 to i = n.


    • Charles says:

      You are correct. I have just made the change that you have suggested. Thanks as always for identifying this change.

  55. alvin says:

    hi there,
    Im havin some little problem hope u can help me out, i wan to calculate the inter rater reliability so i have theree experts with 10 question that examined the test so i need 3 rows and 10 colum rite? but the question is can i use the fleiss’s kappa? tq

    • Charles says:


      I am not sure what you mean by “theree experts with 10 question that examined the test”, and so I can’t tell for sure whether this is the right tool. In any case the tool would be used differently than the approach that you described.

      The implementation of Fleiss’s kappa that I have included in the Real Statistics Resource Pack has one row for each subject being evaluated (perhaps these are the questions in your case) and one column for each rating (e.g. if the questions are answered using a Likert scare of 1 to 5 then the there would be 5 columns). Since there are 3 experts the sum of the values in each row would be 3.

      You could also use the intraclass coefficient to evaluate ratings. See the webpage http://www.real-statistics.com/reliability/intraclass-correlation/. It all depends on what you are trying to accomplish.


  56. Sylvia says:

    I have a questionnaire which different respondents fill in to provide a person with feedback. Let’s say I have a questionnaire with 20 questions and 7 people answer the questionnaire using a scale from 1 to 6. Can I use Fleiss’ Kappa to show the consistency in their rating among the respondents? I thought in my example the rows would be the questions and the columns would be my scale (1 to 6).
    Does Fleiss’ Kappa only show agreement (i.e. same rating or different rating) or is it also able to show the quality of agreement (i.e. will agreement be higher if the ratings are different but closer together than if they are from opposite ends of the scale)?

    • Charles says:

      If I understand correctly you will have one column for each question. I.e. 7 rows and 20 columns. Fleiss’s kappa does not take into account the level of agreement/disagreement between the raters. If this is important you might want to use the ICC.

  57. Kiki says:

    Dear Dr Leinder
    Thank you so much for the article.
    My work is on pain reactions in preterm babies. I have 4 raters who have rated the same 20 babies across 20 variables with a yes or no (whether they see the particular pain reaction or not).
    May I use Fleiss kappa for my scenario as you have explained? If I can, then may I interpret it the same way and give an overall kappa and kappa for each variable as you have done? Your answer would be greatly appreciated.

    • Charles says:

      Based on my understanding of your question, the answer is yes.

    • Arnie says:


      Did you figure out how to adapt the provided Fleiss Kappa spreadsheet to your needs? If I understand the structure of your study correctly, I have a similar structure in a job analysis study I am conducting. I have three raters. Each rater will rate 40 interview transcripts for the presence or absence of 50 employee characteristics. The characteristics are not mutually exclusive. All 50 characteristics will receive a rating of 1 (characteristic is present) or 0 (characteristic is not present) from all three raters. I think this means I have to run 50 Fleiss Kappa calculations to get rater agreement on each characteristic and then average the Fleiss Kappas to get an overall rater agreement. I’m not sure how to do this most efficiently in Excel.

  58. Mette Johansen says:

    Thank you for a great example of how to use Fleiss Kappa in Excel!
    I would like to estimate standard error of Fleiss Kappa and then the 95 % CI. How can I do that in Excel?

    • Charles says:

      Hi Mette,
      I will update the webpage explaining how to estimate the standard error and confidence interval. You should see this within the next day or two.

    • Charles says:

      If you look at the referenced page you will see how to calculate the standard and 95% CI. The example on that page has also now been added to the Examples Workbook (which you can download for free).

  59. Colin:

    There are no “criteria” for how to interpret Kappa – you can compare inter-annotator agreement, but to judge its absolute value depends on the task’s inherent difficulty. There are some “guidelines” in the literature, but they are bogus.

    • Pedram says:

      It is absolutely correct.
      May you please give me an specific reference for such an statement. I do agree with what you have mentioned, however, I do not have access to the reference of your statement. I did search for “Colin” but I did not find anything useful!

  60. Colin says:

    What is the criteria of Fleiss’ Kappa ? Is it the same as Cohen’s Kappa: less than 0% no agreement, 0-20% poor, 20-40% fair, 40-60% moderate, 60-80% good, 80% or higher very good.

Leave a Reply

Your email address will not be published. Required fields are marked *