Cohen’s kappa takes into account disagreement between the two raters, but not the degree of disagreement. This is especially relevant when the ratings are ordered (as they are in Example 2 of Cohen’s Kappa).

To address this issue, there is a modification to Cohen’s kappa called **weighted Cohen’s kappa**. The weighted kappa is calculated using a predefined table of weights which measure the degree of disagreement between the two raters, the higher the disagreement the higher the weight. The table of weights should be a symmetric matrix with zeros in the main diagonal (i.e. where there is agreement between the two judges) and positive values off the main diagonal. The farther apart are the judgments the higher the weights assigned.

We show how this is done for Example 2 of Cohen’s Kappa where we have reordered the rating categories from highest to lowest to make things a little clearer. We will use a linear weighting although higher penalties can be assigned for example to the Never × Often assessments.

**Example 2**: Repeat Example 2 of Cohen’s Kappa using the weights in range G6:J9 of Figure 1, where the weight of disagreement of Never × Often is twice the weights of the other disagreements.

**Figure 1 – Weighted kappa**

We first calculate the table of expected values (assuming that outcomes are by chance) in range A14:E19. This is done exactly as for the chi-square test of independence. E.g. cell B16 contains the formula =B$10*$E7/$E$10.

The weighted value of kappa is calculated by first summing the products of all the elements in the observation table by the corresponding weights and dividing by the sum of the products of all the elements in the expectation table by the corresponding weights. Since the weights measure disagreement, weighted kappa is then equal to 1 minus this quotient.

For Example 1, the weighted kappa (cell H15) is given by the formula

=1-SUMPRODUCT(B7:D9,H7:J9)/SUMPRODUCT(B16:D18,H7:J9)

Note that if we assign all the weights on the main diagonal to be 0 and all the weights off the main diagonal to be 1, we have another way to calculate the unweighted kappa, as shown in Figure 2.

**Figure 2 – Unweighted kappa**

**Observation**: Using the notation from Cohen’s Kappa where *p _{ij}* are the observed probabilities,

*e*are the expected probabilities and

_{ij}= p_{i}q_{j}*w*are the weights (with

_{ij}*w*=

_{ji}*w*) then

_{ij}The standard error is given by the following formula:

where

Note too that the weighted kappa can be expressed as

where

From these formulas, hypothesis testing can be done and confidence intervals calculated, as described in Cohen’s Kappa.

**Real Statistics Function**: The Real Statistics Resource Pack contains the following function:

**WKAPPA**(R1, R2, *lab, alpha*) = returns a 4 × 1 range with values kappa, the standard error and left and right endpoints of the 1 – *alpha* confidence interval (alpha defaults to .05) where R1 contains the observed data (formatted as in range M7:O9 of Figure 2) and R2 contains the weights (formatted as in range S7:U9 of the same figure).

If range R2 is omitted it defaults to the unweighted situation where the weights on the main diagonal are all zeros and the other weights are ones. Range R2 can also be replaced by a number *r*. A value of *r* = 1 means the weights are linear (as in Figure 1), a value of 2 means the weights are quadratic. In general this means that the equivalent weights range would contain zeros on the main diagonal and values (|*i−j*|)^{r} in the *i*th row and* j*th column when *i ≠ j*.

If *lab* = TRUE then WKAPPA returns a 4 × 2 range where the first column contains labels which correspond to the values in the second column. The default is *lab* = FALSE.

**Observation**: Referring to Figure 1 and 2, we have WKAPPA(B7:D9,G6:J9) = WKAPPA(B7:D9,1) = .500951 and WKAPPA(M7:O9) = .495904. We If we highlight a 4 × 2 range and enter WKAPPA(B7:D9, G6:J9,TRUE,.05) we obtain the output in range Y7:Y10 of Figure 3. For WKAPPA(M7:O9,,TRUE,.05) we obtain the output in range AA8:AB11 of Figure 7 of Cohen’s Kappa.

**Real Statistics Data Analysis Tool**: The **Reliability** data analysis tool supplied in the Real Statistics Resource Pack can also be used to calculate Cohen’s weighted kappa.

To calculate Cohen’s weighted kappa for Example 1 press **Ctrl-m** and choose the **Reliability** option from the menu that appears. Fill in the dialog box that appears (see Figure 7 of Cronbach’s Alpha) by inserting B7:D9 in the **Input Range **and G7:J9 in the **Weights Range**, making sure that **Column headings included with data** is not selected and choosing the **Weighted kappa **option. The output is shown on the left side of Figure 25.5.3.

Alternatively you can simply place the number 1 in the **Weights Range** field. If instead you place 2 in the **Weights Range** field (quadratic weights) you get the results on the right side of Figure 3.

**Figure 3 – ****Weighted kappa with linear and quadratic weights**

Hi Charles,

I would like to calculate test retest agreement per item for a 7 points likert scale measure (strongly agree to strongly do not agree).

How to weight each response?

Kate,

You have many choices for this, but one approach is to use C(|k-h|+1,2) as the weight where h and k are the the two Likert scale measurements.

If you treat the Likert scale as interval data, then you can use (h-k)^2 instead.

In the next release of the Real Statistics software I plan to introduce Gwet’s AC2 and Krippendorff’s alpha measurements of interrater reliability. These types of weights will be discussed on the website at that time. The new release should be available shortly.

Charles

Dear Charles,

Can you please share any idea about the following concerns:

I wanted to find the inter-rater reliability index with two raters of a math test grading scheme. I made codes, each of which considers both the score to be credited for the answer and the strategy used (e.g. code 21 means a score of 2 points, with strategy A used, code 22 means a score of 2 pts. with strategy B). With this grading scheme, I end up with a lot of categories (codes) to choose in rating student’s answer. But after comparison of codes, I noticed that more than half of these codes received 0 frequency from both raters (because students’ answers are almost similar).

Concern 1: Can I exclude those categories (codes) with 0 frequency from both raters in making the contingency table?

Concern 2: I would want to verify – is weighted Cohen’s Kappa suitable to use for my type of data?

Your response would be greatly appreciated. Thanks!

Cherel,

1. Yes, you can exclude categories with 0 frequency from both raters

2. Yes, you can use Cohen’s kappa if you can figure out what the suitable weightings are.

Charles

Hello Charles,

First of all, thank you for responding to my previous post.

Regarding your reply for concern 2, may I know any method to use to determine what weightings I should use? It’s unclear to me what’s the basis for choosing linear or quadratic weights.

Thank you for the help!

Cherel,

You can use weightings if there is order in the ratings. E.g. if your judges are assigning ratings of Gold, Silver or Bronze then perhaps you would assign a weight of 5 for Gold, 3 for Silver and 2 for Bronze.

Charles

Hello Charles,

I think I already got what weighting means. Thank you for clarifying it to me.

With the type of codes I have, I believe I should do unweighted Cohen’s kappa. However, is it possible that I will break my codes into two categories, i.e. score and strategy used, and then do a separate Cohen’s kappa computation for each? The codes for the strategy used will then be analyzed using Cohen’s kappa, while the scores will be analyzed with linear weighted Cohen’s kappa. Am I thinking it right?

Thanks!

Cherel,

This sounds like a reasonable approach.I can’t say whether it does what you need, since I don’t know this, but it certainly could be a good approach.

Charles

Dear Charles

I am interested to calculate kappa squared in mediation analysis. Could you please guide me for the calculation in excel?

Bahaman,

The Real Statistics webpage http://www.real-statistics.com/reliability/cohens-kappa/ describes Cohen’s kappa.

If instead you are looking for Cohen’s kappa squared as an effect size for mediation analysis, please see the following:

http://lib.ugent.be/fulltxt/RUG01/002/214/023/RUG01-002214023_2015_0001_AC.pdf

http://quantpsy.org/pubs/preacher_kelley_2011.pdf

http://etd.library.vanderbilt.edu/available/etd-05182015-161658/unrestricted/Lachowicz_Thesis_20150526.pdf

Charles

I cannot perform the control+m in excel to know the std. error and confidence interval. Is there any way to compute the standard error and confidence interval using excel? thank you.

Iza,

You need to install the Real Statistics software to use the control+m features.

The referenced webpage explains how to calculate the standard error and confidence interval in Excel (without using the Real Statistics software), but it is not so easy to do.

Charles

Thank you for your helpful information 🙂

I calculated the formula for weighted kappa, and I wonder at what level of significance (p value) this kappa coefficient is statistically significant.

Heemin,

As usual, alpha = .05 is used to measure statistical significance. Remember though that significance means significantly different from zero.

Charles

Hello, Charles.

First of all, thank you for your quick reply! 🙂

I have some questions on what you wrote above.

Does alpha = .05 means p value, right?

Then how I can figure out my Kappa coefficient values are under the significance level of .05?

Are alpha and p value different?

To elaborate my questions, I obtained weighted kappa coefficient values with using the formula you wrote above. However, I cannot sure this values are at significance level. When I use SPSS for calculating unweighted kappa, the p values are presented on the table. With this level, I can reject the null hypothesis and the two variables I used were agreed at the degree of obtained value. However, using EXCEL I’m not sure whether my obtained weighted kappa values is statistically significant or not. I hope my questions are clear to you ;'(

Thanks!

See the webpage referred to in my previous response.

Charles

Heemin,

alpha is the goal and the p-value is the actual value calculated. Seethe following for a further explanation:

Null and Alternative Hypotheses

Charles

Hello Charles,

I would like to evaluate the agreement of 2 rates rating surgical performance. The rating is from 1 to 5, 1 being the poorest and 5 being the best. Can I use weighted kappa? How can I that in SPSS? If I have 3 raters, can I still use weighted kappa？Thank you!

Jian,

With two raters, it seems like a fit for weighted kappa. I don’t use SPSS and so I can’t comment on how to do this in SPSS.

With three raters, you can’t use the usual weighted kappa.

Charles

Hi Charles,

Thank you for your reply. I really want to learn how to do weighted Kappa on excel, but I cannot understand the tutorial above. Is it possible for me to face time with you so you can help me with this? Thank you

Hi Jian,

Sorry but I don’t have enough time to do this sort of thing. I am happy to answer specific questions though.

What don’t you understand from the tutorial?

Charles

Thank you for the ever useful and interesting explanations you provide.

If we have 2 raters who were asked to rate results according to mild, moderate, severe scale. Let us assume that all the assumptions are met and this is a straightforward weighted kappa analysis.

I understand that the agreement between these 2 raters can be calculated using weighted kappa, but is it possible to show the rate of agreement between the raters AT EACH LEVEL of the grading scale?

for example, can we find out if the agreement was higher when the ratings were mild compared to other grades?

Hamid,

Before I can answer your question, I need to understand what you mean by “can we find out if the agreement was higher when the ratings were mild compared to other grades?” Obviously if both raters rated mild they are in complete agreement. Perhaps you mean what “can we find out if the agreement was higher when one rater gives a mild rating compared to the level of agreement when one rater gives a moderate or severe rating?” If this is what you mean, then you need to define how to measure the level of agreement when one rater assigns a mild rating. What sort of measurement seems appropriate?

Charles

dear

thanks for your helpful website.

I am wondering if I can use fleiss kappa for 5 points -likert scale (Strongly disagree, disagree, neither , agree, strongly agree) instead of weighted kappa. because we are looking for agreement between 10 raters (testing the content validity). and in our research we do not concern about the weight.

Enas

Hi Charles,

I have some categorical/ nominal data. The categories are “Certain”, “Probable”, “Possible”, “Unlikely” and “Unassessable”. There is no order to these. I have been advised to use weighted quadratic kappa by a statistician however I am slightly unsure as to how this would apply for my data. Would it be better to use linear weighted kappa?

Thanks

Hanna

Hanna,

I can certainly see an order “Certain” > “Probable” > “Possible” > “Unlikely” > “Unassessable”, although I am not sure about the last of these categories. Since there is an order you could use weighted kappa.

Charles

Thank you for your response.

Just to give a bit of context to my data- Pharmacists rate potential adverse drug events using the categories I have mentioned above.

The difference between the 1st and 2nd categories would have the same level of importance (if not more) than the difference between the 2nd and the 3rd categories. But the difference between the 1st and 4th categories would be more important than both of the differences mentioned above. So, would quadratic weights be more appropriate than linear weights here?

Many thanks

Hanna

Hanna,

Keep in mind that you can choose whatever weights you want. They don’t have to be linear or quadratic.

Charles

Hi Charles, I’m a nu-bee in SPSS. Is it possible to do the whole thing in SPSS? Like calculation of weighted kappa, drawing the table etc.? If so what would be the command? How to determine quadratic weights for weighted kappa?

Sorry Arthur, but I don-t use SPSS.

Charles

Hi Charles,

Thank you for the useful guidance here.

I have two questions.

First, if someone calculates unweighted Cohen’s K when actually their data are ordinal so it would be more appropriate for them to calculate weighted Cohen’s K, would the result be a more or less conservative estimate of reliability?

Second, I have some categorical data, with 4 categories. Three of these are ordered, e.g., low, medium, high, but one of them is a sort of “other” category and so not really ordered. Would you classify this variable as ordinal or nominal?

I have chosen to classify it as nominal, and therefore, have calculated unweighted Cohen’s K, yielding a significant k value of .509 (hence my first question on interpretation).

Thanks so much.

Amber

Amber:

1. I don’t have any reason to believe that it would be a conservative estimate. It would be a different estimate. Better to use the value for the weighted kappa.

2. I guess it depends on what “other” really means. If it means “I don’t know”, you might be better off dropping those values and treating the variable as ordinal, especially if only a small percentage of the respondents answer “other”.

Charles

thanks to anyone will answer

I have a table 2 x 2 with this data:

16 0

4 0

With a calculator i get k= 0 with CI : from 0 to 0

it looks so strange, is it correct?

Edward,

Sorry, but I don’t understand the point of analyzing such lopsided data.

Charles

Hi Charles,

I’m a very beginner and inexpert. I have to apply the cohen’s kappa to some table 2 x 2 showing the adhesion to a specific protocol. For example for the data in the table

53 1

1 5

I get a kappa of +0,815 with CI: 0,565 to 1000

But with the data

16 0

4 0

i get k=0 with CI: 0 to 0

I mechanically applied this calculater for my thesis, i’m really inxpert. You say these are lopsided data, therefore i can’t apply the cohen’s kappa?

Edward,

You can use Cohen’s kappa.

I get the same results for the first example.

For the second example, I get kappa = 0 with a CI of -.87 to +.87.

Charles

thank you so much Charles, you re very kind

Hi Charles,

thank you for this great explanation.

For a paper I calculated the Weighted K (wK). However I’m wondering how to interpret my wK…

Can I interpret my wK just like the unweighted K? So, for example, a wK greater than 0.61 corresponds to substantial agreement (as reported in https://www.stfm.org/fmhub/fm2005/May/Anthony360.pdf for the unweighted k)?

Thank you for your help!

Danile.

Daniele,

Yes, I would think that the interpretation of weighted kappa is similar to unweighted kappa. Keep in mind that not everyone agrees with the rankings shown in Table 2 of the referenced paper (nor any other scale of agreement).

Charles

Thanks for your reply!

Yes, I know that there isn’t an agreement about the rankings… I found several types of ranks for the kappa interpretation.

Can you suggest a reference of reliable (in your opinion) rankings?

My field is neuroimaging (medicine), so it is not supposed to be an “exact science”…

Best regards,

Daniele.

Daniele,

The different interpretations are all fairly similar. I can’t say that one is better than another.

Charles

Can this concept be extended to three raters (i.e., is there a weighted Fleiss kappa)?

Rose,

I don’t know of a weighted version of Fleiss kappa or a three rater version of weighted kappa. Perhaps ICC or Kendall’s W will will provide the required functionality for you.

Charles

Dear Charles,

Thank you for proving the overall information of kappa.

I’m reviewing a statistical analysis used in a reliability study and kappa is widely used in it. However, in many cases using ordinal scores, they just said that kappa was used in the study. I’m wondering if I could know whether they use weighted OR unweighted kappa in those papers without mentioning of the exact name. In addition, if they use the weighted kappa, can I distinguish the type of weighted kappa (linear or quadrtic) without mentioning if the statistical table shows K value only?

Regards,

Miran

Miran,

I don’t know how you could determine this. My guess (and this is only a guess) is that unless they said otherwise they used the unweighted kappa.

Charles

Dear Charles,

Thank you for your answer. I’ve judged that a paper used the unweighted kappa unless they mention about the weighted kappa so far.

Regards,

Miran

I am curious about the application of weighted kappa in the following scenario. I had two raters complete a diagnostic checklist with 12 different criteria. The response to each criteria was either 1 or 0 (present or absent). If a specific number of criteria were present, then an overall criteria was coded 1 (if not, 0). Are dichotomous responses considered ordered in this case? Is weighted kappa the appropriate statistic for reliability?

Chris,

Dichotomous responses are generally considered to be categorical, although depending on what the data represents they could be considered to be ordered. E.g. Male = 0 and Female = 1 is not really ordered, while 0 = Low and 1 = High could be considered ordered.

Regarding your specific case, I understand that if a rater finds that say 6 or more criteria are met then the score is 1, while if fewer than 6 are met then the rating is 0. This could qualify as ordered. Based on what you have described you might be able to use weighted kappa, but I would have to hear more about the scenario before I could give a definitive answer.

Note that the coding that I have described throws away a lot of the data. You might be better just counting the number of times the criteria are met and use this as the rating. Then you could use weighted kappa with this number as the weights. You might also be able to use the intra-correlation coefficient.

Charles

Hi Charles,

What is the best method of determining the correct predefined weight to use?

Cheers,

John

John,

You need to decide what weights to use based on your knowledge of the situation. The usual weights are linear and quadratic.

Charles

Hi Charles,

I would like to know what would be the minimum sample size for a reliability re-test of a newly developed questionnaire of 100 items, ordinal scale (1-5), considering 80% power (0.05 type I error) to detect an acceptable weighted Kappa coefficient ≥0.6 in a two-tail single group comparison. Could you please help me with that or provide me any good reference that contains this information?

Thank you for your help.

Deborah.

Deborah,

I found the following article on the Internet which may provide you with the information that you are looking for.

http://www.ime.usp.br/~abe/lista/pdfGSoh9GPIQN.pdf

Charles

Hi Charles,

Thank you so much for your help. I managed the sample size but now I have another problem. I am developing and evaluating the psychometric properties of a multidimensional psychological scale. One of the measurements is the reliability re-test, for which I handed out two copies of the same questionnaire for participants to complete each of them within an interval of 15 days. I need now to compare both results, for each participant, to see how much the outcomes have changed in between measurements. The scale has 100 items, each of them with 1-5 categorical responses (from never to always, for example). Because it is categorical, I have been advised to use weighted Kappa (0-1.0) for this calculation ans I need a single final kappa score. Do you have any idea about which software and how to calculate it? I haven’t found anything explaining the practical calculation in a software. Thank you!

Deborah,

The referenced webpage describes in detail how to calculate the weighted kappa.It also describes how to use the Real Statistics

Weighted Kappadata analysis tool andKAPPAfunction.Charles

I have a 25 items that are rated 0-4 (from Unable to Normal). How should I calculate the inter-rater reliability between two raters and intra-rater reliability between two sessions for each item? If this should be a weight Kappa, then how to calculate the 95% confidence interval in Excel? How to calculate a single final kappa score.

Rick,

I don’t completely understand your scenario. Are you measuring the 25 items twice (presumably based on different criteria). Are you trying to compare these two ways of measuring? You might be able to use Weighted Kappa. The referenced webpage shows how to calculate the 95% confidence interval in Excel. I don’t understand what you mean by calculating “single final kappa score”, since the usual weighted kappa gives such a final score.

Charles

Dear Charles,

How do you compute the 95% CI for weighted Kappa? Is there anything already in your excel tool?

Thanks in advance for your help

Best regards

Cédric

Cédric,

The calculation of the 95% CI for the unweighted version of Cohen’s kappa is described on the webpage Cohen’s Kappa.

Shortly I will add the calculation of the 95% CI for the weighted Kappa to the website. I also plan to add support for calculating confidence intervals for weighted kappa to the next release of the Real Statistics Resource Pack. This will be available in a few days.

Charles

Cédric,

I have now added support for s.e. and confidence intervals for Cohen kappa and weighted kappa to the latest release of the Real Statistics software, namely Release 3.8.

Charles

Dear Charles,

thank you for the add-on and all the good explanations!

I will have rather large kappa and weights tables (20 items and weights ranging from 0 to 3). Can I extend the tables according to my needs or do I have to expect problems?

Best,

Niels

Niels,

You should be able to extend the tables. Alternatively you can use the Real Statistics WKAPPA function or the weighted kappa option in the Real Statistics Reliability data analysis tool.

Charles

How would one calculate a standard error for the weighted kappa? and thus a p value.

Richard,

I haven’t implemented a standard error for the weighted kappa yet. I found the following article which may be useful for you:

http://www.itc.nl/~rossiter/teach/R/R_ac.pdf (see page 19).

Charles

Hi Charles,

What do you do if judges do not know ahead of time how many students are being interviewed. For example, judges are asked to identify off-road glances from a video and place them in 3 different categories. Judge1 may identify 10 glances, while Judge2 only 5. They may agree completely on those 5 identified by Judge 2, but do the other 5 non-identified glances count as disagreements then?

Susana,

One way to approach this situation could be to assign a 4th category, namely “non-identified glance”. It is then up to you to determine what weight you want to assign to category k x category 4 (for k = 1, 2, 3, 4).

Charles

Hello Charles,

in literature a Weighted Kappa >.60 is considered Good, but where is this based on? I cant find any article that invests this topic, so where does this .60 comes from?

greets

Rob

Rob,

There really isn’t clear-cut agreement about what is an acceptable level. I have seen .6 or .7, but I would treat all of these cautiously. The following paper goes into more detail about acceptable levels for kappa and other measures, including references in the literature.

http://www.bartlett.ucl.ac.uk/graduate/csh/attachments/indexing_reliability.pdf

Charles

Your table of weights is a symmetric matrix with zeros in the main diagonal (i.e. where there is agreement between the two judges) and positive values off the main diagonal. Elsewhere, for example here http://www.medcalc.org/manual/kappa.php and here http://www.icymas.org/mcic/repositorio/files/Conceptos%20de%20estadistica/Measurement%20of%20Observer%20Agreement.pdf the main diagonal has values of 1.

I figured it out. Instead of using w = 1-(i/(k-1)) you are using w = i/(k-1). k is the number of categories and i the difference in categories.

I checked the WKAPPA results of the example given in against a calculated example in the reference cited in my prior post (Kundel&Polansky) and it all works out.

The nice thing about the WKAPPA function is that you can use a subjective set of weights and you are not limited to linear and quadratic weighting. Thanks!

Klaus,

I used zeros on the main diagonal instead of ones since it seemed more intuitive to me. As you point out, both approaches are equivalent.

Charles

Sir

BTW.The matrix in figure 1 is not symmetric.

Colin

Colin.

Another good catch. I inadvertently switched two cells. I have now changed the webpage with the symmetric weights that I had intended and should have used. Thanks for your diligence. As usual you have helped make the site better and more reliable for everyone.

Charles

Sir

I think the design of predefined table of weights is a little arbitrary. Different people may make different table of weights.

Colin,

I agree. I have provided the most common weights.

Charles