Cohen’s kappa is a measure of the agreement between two raters who determine which category a finite number of subjects belong to whereby agreement due to chance is factored out. The two raters either a agree in their rating (i.e. the category that a subject is assigned to) or they disagree; there are no degrees of disagreement (i.e. no weightings).
We illustrate the technique via the following example.
Example 1: Two psychologists (judges) evaluate 50 patients as to whether they are psychotic, borderline or neither. The results are summarized in Figure 1.
Figure 1 – Data for Example 1
We use Cohen’s kappa to measure the reliability of the diagnosis by measuring the agreement between the two judges, subtracting out agreement due to chance, as shown in Figure 2.
Figure 2 – Calculation of Cohen’s kappa
The diagnoses in agreement are located on the main diagonal of the table in Figure 1. Thus the percentage of agreement is 34/50 = 68%. But this figure includes agreement that is due to chance. E.g. Psychoses represents 16/50 = 32% of Judge 1’s diagnoses and 15/50 = 30% of Judge 2’s diagnoses. Thus 32% ∙ 30% = 9.6% of the agreement about this diagnosis is due to chance, i.e. 9.6% ∙ 50 = 4.8 of the cases. In a similar way, we see that 11.04 of the Borderline agreements and 2.42 of the Neither agreements are due to chance, which means that a total of 18.26 of the diagnoses are due to chance. Subtracting out the agreement due to chance, we get that there is agreement 49.6% of the time, where
Some key formulas in Figure 2 are shown in Figure 3.
Definition 1: If pa = the proportion of observations in agreement and pε = the proportion in agreement due to chance, then Cohen’s kappa is
Observation: Another way to calculate Cohen’s kappa is illustrated in Figure 4, which recalculates kappa for Example 1.
Figure 4 – Calculation of Cohen’s kappa
Property 1: 1 ≥ pa ≥ κ
since 1 ≥ pa ≥ 0 and 1 ≥ pε ≥ 0. Thus 1 ≥ pa ≥ κ.
Observation: Note that
Thus, κ can take any negative value, although we are generally interested only in values of kappa between 0 and 1. Cohen’s kappa of 1 indicates perfect agreement between the raters and 0 indicates that any agreement is totally due to chance.
There isn’t clear-cut agreement on what constitutes good or poor levels of agreement based on Cohen’s kappa, although a common, although not always so useful, set of criteria is: less than 0% no agreement, 0-20% poor, 20-40% fair, 40-60% moderate, 60-80% good, 80% or higher very good.
A key assumption is that the judges act independently, an assumption which isn’t easy to satisfy completely in the real world.
Observation: Provided and npa and n(1–pa) are large enough (usually > 5), κ is normally distributed with an estimated standard error calculated as follows.
Let nij = the number of subjects for which rater A chooses category i and rater B chooses category j and pij = nij/n. Let ni = the number of subjects for which rater A chooses category i and mj = the number of subjects for which rater B chooses category j.
The standard error is given by the formula
Example 2: Calculate the standard error for Cohen’s kappa of Example 1, and use this value to create a 95% confidence interval for kappa.
The calculation of the standard error is shown in Figure 5.
Figure 5 – Calculation of standard error and confidence interval
We see that the standard error of kappa is .10625 (cell M9), and so the 95% confidence interval for kappa is (.28767, .70414), as shown in cells O15 and O16.
Observation: In Example 1, ratings were made by people. The raters could also be two different measurement instruments, as in the next example.
Example 3: A group of 50 college students are given a self-administered questionnaire and asked how often they have used recreational drugs in the past year: Often (more than 5 times), Seldom (1 to 4 times) and Never (0 times). On another occasion the same group of students was asked the same question in an interview. The following table shows their responses. Determine how closely their answers agree.
Figure 6 – Data for Example 3
Since the figures are the same as in Example 1, once again kappa is .496.
Observation: Cohen’s kappa takes into account disagreement between the two raters, but not the degree of disagreement. This is especially relevant when the ratings are ordered (as they are in Example 2. A weighted version of Cohen’s kappa can be used to take the degree of disagreement into account. See Weighted Cohen’s Kappa for more details.
Another modified version of Cohen’s kappa, called Fleiss’ kappa, can be used where there are more than two raters. See Fleiss’ Kappa for more details.
Real Statistics Function: The Real Statistics Resource Pack contains the following function:
WKAPPA(R1) = where R1 contains the observed data (formatted as in range B5:D7 of Figure 2).
Thus for Example 1, WKAPPA(B5:D7) = .496.
Actually WKAPPA is an array function which also returns the standard error and confidence interval. This version of the function is described in Weighted Kappa. The full output from WKAPPA(B5:D7) is shown range AB8:AB11 of Figure 7.
Real Statistics Data Analysis Tool: The Reliability data analysis tool supplied by the Real Statistics Resource Pack can also be used to calculate Cohen’s kappa. To calculate Cohen’s kappa for Example 1 press Ctrl-m and choose the Reliability option from the menu that appears. Fill in the dialog box that appears (see Figure 7 of Cronbach’s Alpha) by inserting B4:D7 in the Input Range and choosing the Cohen’s kappa option.
Figure 7 – Cohen’s Kappa data analysis
If you change the value of alpha in cell AB6, the values for the confidence interval (AB10:AB11) will change automatically.