The intraclass correlation (ICC) assesses the reliability of ratings by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects. The ratings are quantitative. We illustrate the technique applied to Likert scales via the following example.
Example 1: Four judges assess 8 types of wine for quality by assigning a score from 0 to 9 with the ratings given in Figure 1. Each judge tests each wine once. We would like to determine whether the wines can be judged reliably by different judges.
Figure 1 – Data for Example 1
We can see from the data that there is a fair amount of consistency between the ratings of the different judges with a few noticeable differences.
We will assume that the four judges are taken from a random sample of judges and use Excel’s Anova: Two Factor without Replication data analysis (i.e. a repeated measure analysis). The results are given in Figure 2.
Figure 2 – Calculation of Intraclass Correlation
Here the rows relate to the Between Subjects (Wines) and the columns relate to the Judges (who are the raters). The error term is Judge × Subjects. We have added row 29 which contains the calculation of the ICC (in cell I29) using the formula
We will now explain this formula. From Definition 1 in Two Factor ANOVA without Replication we have the model
The intraclass correlation is then
Thus there are three types of variability:
var(β): variability due to differences in the subjects (i.e. the wines).
var(ε): variability due to differences in the evaluations of the subjects by the judges (e.g. judge B really likes wine 3, while judge C finds it to be very bad)
var(α): variability due to differences in the rating levels/scale used by the judges (e.g. judges B and C both find wine 1 to be the worst, but while judge C assigns wine 1 a Likert rating of 0, judge B gives it a bit higher rating with a 2).
Each of these can be estimated as follows:
var(β) = (MSRow – MSE)/k = (26.89 – 2.28)/4 = 6.15
var(ε) = MSE = 2.28
var(α) = (MSCol – MSE)/n = (2.45 – 2.28)/8 = 0.02
where n = number of rows (i.e. subjects = wines for Example 1) and k = number of columns (i.e. raters = judges). A consistent (although biased) estimate of the intraclass correlation is therefore given by
ICC = 6.15/(6.15 + 2.28 + 0.02) = 0.728
This can also be expressed by
The high value of ICC shows there is a fair degree of agreement between the judges.
Real Statistics Function: The Real Statistics Resource Pack contains the following supplemental function:
ICC(R1) = intraclass correlation coefficient of R1 where R1 is formatted as in the data range B5:E12 of Figure 1.
For Example 1, ICC(B5:E12) = .728. This function is actually an array function which provides additional capabilities, as described in Intraclass Correlation Continued.
Real Statistics Data Analysis Tool: The Reliability data analysis tool supplied in the Real Statistics Resource Pack can also be used to calculate the ICC. We show how to use this tool in Intraclass Correlation Continued.
Observation: There are a number of other measures of ICC in use. We have presented the most useful of these measures above. Click here for information about these other versions of ICC.
Observation: We now show how to calculate an approximate confidence interval for the ICC. We start by defining the following
From these we calculate the lower and upper bounds of the confidence interval as follows:
Figure 3 – 95% confidence interval for ICC
Observation: The ratings by the judges indicate the difficulty or leniency of the judge. The raters can also be questions in a test. In this case the rating corresponds to the difficulty or leniency of the question.
Observation: The measure of ICC is dependent on the homogeneity of the population of subjects being measured. For example, if the raters are measuring the level of violence in the general population, the value of var(β) may be high compared to var(α) and var(ε), thus making ICC high. If instead the raters are measuring levels of violence in a population of inmates from maximum security prisons, the value of var(β) may be low compared to var(α) and var(ε), thus making ICC low.