In Intraclass Correlation we reviewed the most common form of the intraclass correlation coefficient (ICC). We now review other approaches to ICC as described in the classic paper on the subject (Shrout and Fleiss). In that paper the following three classes are described:

**Class 1**: For each of the *n* subjects a set of *k* raters is chosen at random from a population of raters and each of these raters rate that subject, but each subject is potentially rated by different raters.

**Class 2**: *k* raters are chosen at random from a population of raters and these *k* raters rate all *n* subjects.

**Class 3**: Each of the *n* subjects are rated by the same *k* raters and the results address only these *k* raters.

The ICC values for these classes are respectively called ICC(1, 1), ICC(2, 1) and ICC(3, 1). Each of these measures the reliability of a single rater. We can also consider the reliability of the mean rating. The intraclass correlation for these are designated ICC(1, *k*), ICC(2, *k*) and ICC(3, *k*).

**Real Statistics Function**: The Real Statistics Resource Pack contains the following supplemental function:

**ICC**(R1, *class*, *type, lab, alpha*): outputs a column range consitin of the intraclass correlation coefficient ICC(*class*, *type*) of R1 where R1 is formatted as in the data range of Figure 1 of Intraclass Correlation, plus the lower and upper bound of the 1 – *alpha* confidence interval of ICC. If *lab* = TRUE then an extra column of labels is added to the output.The default values are *class* = 2, *type* = 1, *lab* = FALSE and *alpha* = .05.

For example, the output from the formula =ICC(B5:E12,2,1,TRUE,05) for Figure 1 of Intraclass Correlation is shown in Figure 1 below.

**Figure 1 – Output from ICC function**

**Real Statistics Data Analysis Tool**: The **Reliability **data analysis tool supplied in the Real Statistics Resource Pack can also be used to calculate the ICC.

To calculate ICC for Example 1 press **Ctrl-m** and choose the **Reliability **option from the menu that appears. Fill in the dialog box that appears (see Figure 3 of Cronbach’s Alpha) by inserting B5:E12 in the Input Range and choosing the ICC option. The output is shown in Figure 2.

**Figure 2 – Output from ICC data analysis tool**

We next show how to calculate the various versions of ICC.

### Class 1 model

For class 1 the model used is

where *μ* is the population mean of the ratings for all the subjects, *μ + β _{j}* is the population mean for the

*j*th subject and

*ε*is the residual, where we assume that the

_{ij}*β*are normally distributed with mean 0 and that the

_{j}*ε*are independently and normally distributed with mean 0 (and the same variance). This is a one-way ANOVA model with random effects.

_{ij}As we saw in One-way ANOVA Basic Concepts

The subjects are the groups/treatments in the ANOVA model. In this case, the intraclass correlation, called ICC(1,1), is

The unbiased estimate for var(*β*) is (*MS _{B} – MS_{W}*)/

*k*and the unbiased estimate for var(

*ε*) is

*MS*. A consistent (although biased) estimate for ICC is

_{W}For Example 1 of Intraclass Correlation, we can calculate the ICC as shown in Figure 3.

**Figure 3 – Calculation of ICC(1, 1)**

First we use Excel’s **Anova: Single Factor** data analysis tool, selecting the data in Figure 1 of Intraclass Correlation and grouping the data by **Rows** (instead of the default **Columns**). Alternatively we can first transpose the data in Figure 1 of Intraclass Correlation (so that the wines become the columns and the judges become the rows) and use the Real Statistics **Single Factor Anova** data analysis tool.

The value of ICC(1, 1) is shown in cell I22 of Figure 1, using the formula shown in the figure.

The confidence interval is calculated using the following formulas:

For Example 1 of Intraclass Correlation, the 95% confidence interval of ICC(1, 1) is (.434, .927) as described in Figure 4.

**Figure 4 – 95% confidence interval for ICC(1,1)**

ICC(1, 1) measures the reliability of a single rater. We can also consider the reliability of the mean rating. The intraclass correlation in this case is designated ICC(1, *k*) and is calculated by the formulas

ICC(1, 4) for Example 1 of Intraclass Correlation is therefore .914 with a 95% confidence interval of (.754, .981).

### Class 2 model

This is the model that is described in Intraclass Correlation. For Example 1 of Intraclass Correlation, we determined that ICC(2, 1) = .728 with a 95% confidence interval of (.408, .950). These are the results for a single rater. The corresponding formulas for the mean rating are as follows:

ICC(2, 4) for Example 1 of Intraclass Correlation is therefore .914 with a 95% confidence interval of (.734, .987).

### Class 3 model

The class 3 model is similar to class 2 model, except that var(*α*) is not used. The intraclass correlation, called ICC(3, 1), is given by the formula

Using the terminology of Two Factor ANOVA without Replication (as for case 2), we see that (*MS _{Row}–MS_{E}*)/

*k*is an estimate for var(

*β*) and

*MS*is an estimate for var(

_{E}*ε*). A consistent (although biased) estimate for ICC is

For Example 1 of Intraclass Correlation, we can calculate ICC(3, 1) and its 95% confidence interval as shown in Figure 5 (referring to the worksheet in Figure 2 of Intraclass Correlation).

**Figure 5 – Calculation of ICC(3,1) and 95% confidence interval**

ICC(3, 4) for Example 1 is therefore .915 with a 95% confidence interval of (.748, .981).

**Observation**: Class 3 is not so commonly used since by definition it doesn’t allow generalization to other raters.

**Observation**: ICC(3, *k*) = Cronbach’s alpha. For Example 1 of Intraclass Correlation, we see that =CRONALPHA(B5:E12) has value .915, just as we saw above for ICC(3, 4).

Does this ICC use an absolute agreement or consistency definition?

Thanks!

Charles,

Please look at the following webpage: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3670985/

Charles

Hello Charles,

I have a problem of testing ICC values by splitting the data in groups. Two raters are measuring each individual. The sample can be divided based on severity. Say I have ICC1 for group 1 and ICC2 for group 2. I can find CI no problem, which are exact. But I need to compare the ICC1 and ICC2 to determine measurements in one group are better agreeable than those of the other group. I attempted a large sample theory but it did not work and it fails badly for small samples. I am looking at n=128 vs. n=9 in one of the comparisons.

Is there a variance formula for ICC in a non-cluster randomized study?

Thanks for your reply in advance!

ps. I tried Fisher’s transformation via atanh in R, which reproduced the exact CI very well. But ICC is not the same as rho (correlation).

Sorry, but I don’t know how to address this. Perhaps someone else in the community can help.

Charles

Dr. Zaiontz:

Thank you very much for your site. In employment testing it is common to conduct oral interviews using three raters/evaluators. With large applicant pools, it may be necessary to assemble multiple panels, and applicants are randomly assigned to only one of the panels. Theoretically, the three interviewers assigned to a panel are randomly assigned from a population of potential interviewers (I say “theoretically” because it’s not easy finding volunteers to sit on these panels!)

It is common for personnel analysts to use Chronbach’s Alpha as a proxy for ICC Class 3 to estimate the mean reliability for a single three-rater panel. Am I correct that ICC Class 2 is a more appropriate estimate? It would seem to be nearly identical to your original example of the wine tasting (which also sounds like a lot more fun than interviews). Thanks.

Bruce,

If the raters are truly randomly chosen then perhaps ICC class 1 would be correct.

Note that ICC(3,k) is mathematically equal to Cronbach’s alpha.

Based on your description, probably the ICC class 2 model is best.

Charles

Hi Charles,

Thank you for this post.

In Figure 5 under the Class 3 Model, the formula for the ICC is displayed in cell K29, referring to another cell H12. However I do not see where the formula (or value) in cell H12 is defined. Can you please clarify what H12 refers to?

Thanks

Paul,

Sorry about that. H12-1 has the same value as the contains of cell I24. In fact, I need to fix the formula to use I24 instead of H12-1.

Charles

Hello Charles,

Thanks for the detail explanation. I am not sure which ICC class I should use to analyse the following:

I have 15 MR Images from 15 subjects. I recruited 4 observers to define the organ contours e.g. Lt and Rt eyes from the images. I recorded the volume of what they had drawn and wanted to assess the inter-observer variability. I ran the repeated measure one way ANOVA and calculated the ICC using the formula on the page “Intraclass Correlation”. The ICC is extremely low while the raw data (volume measured) seems to be similar among different observers. Am I doing it the wrong way? Is it correct to consider ICC class 2? And should I consider the ICC with 4 observers, i.e. ICC(2,4) instead of ICC(2,1) which gave me a very low ICC that seems not making sense?

Many thanks!

Winky

Winky,

1. Which version of the ICC. ICC(1,.), ICC(2,.) or ICC(3,.) depends on what you are trying to measure, as described on the referenced webpage.

2. Whether you use ICC(k,1) or ICC(k,4) depends on whether you want to measure the reliability of an individual (the first version) or the average for the group (the second version). Here k = 1, 2 or 3.

3. Regarding whether you are doing the calculations correctly, if you send me an Excel file with your data and calculations, I will try to see whether the calculations are correct. You can send it to the email address on the Contact Us webpage.

Charles

Charles

Hi Charles,

Thanks for the great article, it has been highly useful. I had a question regarding 95% CI interpretation in regards to ICCs.

I executed an ICC on data from 3 judges and came up with the following results:

ICC 0.752241391

95% Lower CI -0.497986439

95% Upper CI 0.946442785

Obviously that is a fairly high ICC and degree of agreement/reliability. However, I am not sure how to interpret the 95% CIs. I have always been taught that a CI containing 0 or 1 means the test statistic is not significant. Is that the case here as well?

I have seen previous comments eluding to this same question but could not gleam a definitive answer from them on this topic.

Thank you for your help,

Shelby

Shelby,

Since zero is in the 95% CI, the test statistic (namely the ICC value) is not significantly different from 0. Thus even though the calculated ICC value looks pretty high, statistically speaking the population value of the ICC may be zero.

The fact that your 95% CI is so wide is probably due to the fact that you have very few subjects being evaluated.

Charles

Hi Charles,

In that case can we use the ICC or is it not valid

Aznila

Aznila,

What are you referring to when you say “in that case”?

Charles

Hi Charles.

I have been trying to figure out whether I should use an ICC or Fleiss Kappa analysis. I have multiple raters, who rated 17 groups on 8 different categories. The ratings range low to high. Every rater doesn’t rate all groups, but all groups are rated multiple times on each category. I am not sure how to set up the data or which I should use. I haven’t gotten a handle on how to use the software you’ve provided. Can you provide any additional guidance?

Lyn,

Fleiss’s Kappa is used when the ratings are categorical. ICC is used when the ratings are numerical, even ordinal data which can be viewed as numeric.

Fleiss’s Kappa does not require that every rater rate every subject, but that all subjects get the same number of ratings. ICC does require that every rater rate every subject.

It might be that your situation is a partial fit for ICC and a partial fit for Fleiss’s kappa. That may be a problem and so you may need to find a different measurement.

Charles

Dear Dr. Charles,

firts, let me congratulate again with you for your site and for the help you provide us!

I would like to put to your attention this paper that describes an interesting case of Class 2 Inter-Rater and Intra-Rater Reliability assessment, in which there are k raters, each one rating all n subjects of the population, by performing m measurements on every subject.

The peculiarity of this approach is that the m measurements by the i-th rater on the j-th subject are not averaged together but are considered individually, so allowing more detail in model variances estimation.

This is the link:

http://ptjournal.apta.org/content/ptjournal/74/8/777.full.pdf

I’m not one of the authors so I’m not doing self-promotion, I just hope you and all the community can find it useful!

Regards

Piero

Thanks Piero for sharing this with us.

Charles

Hello Charles,

I may be wrong, but should the lower CI level for the mean used in the class 2 model example (0.914) be 0.860 as opposed to 0.734?

Thanks for your time.

Just realised I was wrong. Please ignore my comment.

Apologies and thank you for your time.

First, thanks for this post on ICC calculations using Excel. In the first “Intraclass Correlation” I believe you calculated ICC(2,4) = 0.728, whereas in the “Intraclass Correlation Continued” you refer to this as ICC(2,1) = 0.728, whereas for k=1, the calculation for ICC reverts to a simpler formula (with no dependence on k) which yields a value of 0.914.

As I understand it the ICC(2,4) applies when one is interested in the reliability of the average group score for the wine, whereas the ICC(2,1) examines the judge to judge reliability. Might it be inferred that ICC(2,1) relates to the reliability of the wine score, but that ICC(2,4) focuses on the reliability of the judges?

Robert,

In the first “Intraclass Correlation” I calculated ICC(2,1) = 0.728, although I was not so clear about the terminology since ICC(2,1) is the usual intraclass correlation.

Also ICC(2,4) = 0.914.

I am not sure about what you intend by “reliability of the wine score” and “reliability of the judges”. The ICC is measuring the reliability/agreement of the judges in evaluating wine scores.

Charles

Thanks for your reply. I think my confusion is because the formula for ICC(2,1) is a function of k, whereas ICC(2,k) is not a function of k! I’ve read elsewhere that the inference is that ICC(2,1) estimates the reliability of a single observer rating the wines (even though you use k=4 to obtain this estimate; ICC(2,4) is the reliability estimate using the mean of the 4 judges to assess the wine quality (but is not a function of k=4).

Robert,

That is an interesting observation. Of course, k is indirectly used in the calculation of ICC(2,1) when calculating MSRow and MSE.

Charles

Hi Charles;

I got the result of ICC = 0.649, CI 95% (lower bound (-2.366) and upper bound (0.964)).

How can I report this result please?

Many thanks.

Here is the reporting for different data:

A high degree of reliability was found between XXX measurements. The average measure ICC was .827 with a 95% confidence interval from .783 to .865 (F(162,972)= 5.775, p<.001). I believe that the lower bound for the ICC is -1/(n-1). Thus your CI lower bound of -2.366 seems quite surprising. Charles

Many thanks for your reply.

Can I assume the lower bound zero. and reported as 0.65 (o to 0.96)??

Regards

Thank you Charles.

Something went wrong with the message I sent and made it impossible to understand.

Here follows my message again:

1. What does me not feel at ease with ICC is that after having calculated the ICC value and the 95% confidence interval, I find no way of interpreting rationally the results, which is of course essential for myself and for the jury who is concearned. I read your answer to Sravanti of July 24, 2015: there is no agreement as to what is an acceptable value for ICC, although you have typically seen .7 used. So what to do if ICC<.7 ?

2. I read about Cohen's interpretation of effect size of an experimental manipulation:

phi square = theta square/sigma square.

For phi square = .01 the effect is called "small". For phi square = .0625 the effect is called "medium big". For phi square = .16 the effect is called "big"

A from phi square derived parameter is eta square = phi square/(phi square + 1) with a range [0;1]. It looks like a correlation coëfficiënt, and its estimator gives an impression of the relative size of the factors and combination of factors in Variance Analysis. So, an effect is "small" for eta square =.010, is "medium big" for eta square =.059, and"big" for eta square = .138.

Wouldn't there be a way of transposing to ICC ? Thank you once more.

Erik

Erik,

Sorry, but I don’t know a way of transposing these to ICC. Please note that even the effect size guidelines (small, medium, large) by Cohen and others are really rough and not appropriate for all circumstances.

Charles

Thank you, Charles for your detailed answer.

1. What does me not feel at ease with ICC is that after having calculated the ICC value and the 95% confidence interval I find no way of interpreting rationally the results, which is of course essential for myself and the jury who is concerned.

I read your answer to Sravanti on July 24, 2015: there is no agreement as to what is an acceptable value for ICC, although you have typically seen .7 used. So what to do if ICC “medium big”

Phi Square = .16 > “big”

A from phi square derived parameter is eta square = phi square/(phi square +1) with a range [0;1]. It looks like a correlation coëfficiënt and its estimator gives an impression of the relative size of the effects of the factors and combination of factors in a Variance Analysis. So an effect is “small” for eta square = .010, “medium big” for eta square = .059, and “big” for eta square = .138

Wouldn’t there be a way of transposing to ICC?

Thank you once more

Erik

Erik,

I agree with you. These measurements seem most valuable when they show a problem or when they are very high. Middle values seem less useful.

Charles

Thank you Charles. There is no need to worry about your delayed answer. I hope you are fine and I needed anyway more time to understand…

I struggled further through the matter and realize that my questions in the beginning were not always very adequate.

The state of affairs for me is now as follows:

1. After having defined the appropriate statistical model one calculates an ICC which is an estimate of the population mean value rho of a very great number of samples take in identic conditions.

2. How to interprete this ICC value? It remains an open question to me since

there is no agreement about.

Wouldn’t it be logic to only consider lower limit of the confidence interval?

3. I believe that even the confidence interval, say 95%, relative to that estimate can be questionned. Since that interval is calculated round a one time estimate of rho it can either contain rho or not. Are those chances equal to 95% if you cannot repeat an experience a great number of times?

I begin to wonder if ICC calculations are suited for Case 2 wine tastings.

The example I gave you comes from following book (pg 244-247):

Wines Their sensory Evaluation by

Maynard A. Amerine and Edward B. Roessler

1976, 1983 by W.H. Freeman and Company

This is the data table:

Judge W1 W2 W3 W4 W5

1 8 4 2 5 4

2 6 4 5 6 5

3 6.5 3 8.5 7 5.5

4 3 4 5 6 7

5 8 7 5.5 8 6.5

6 3 3.5 7 9 8

7 7.5 5 4.5 5.5 8.5

Erik

Erik,

How to interpret the ICC value: Generally this statistic is used as a measurement of the agreement between raters. I still don’t completely understand the problem here.

Confidence Interval. I agree that only the lower bound may be of interest.

Regarding the issue of basing the confidence interval only on one sample, this is the usual situation not just for ICC but for confidence intervals of all sorts of statistics (t tests, regression, etc.). Our goal is to measure our confidence of the value of a specific population parameter based on the corresponding statistic from one sample. Since we only have one sample we can’t be certain of the value of the population parameter, but the larger the sample the narrower the confidence interval, and so the more confident we are in its value.

Case Wine Tasting: Why do you believe that the ICC calculations are not suited to this example?

Charles

I forgot to write in the text of my first question that I referred to example 1 in your Intraclass Coorelation chapter (Four colums for the judges and eight rows for the wines).

Erik

Charles,

I read the article “SF” and I wonder if what I understood is correct:

Say Rho(2,1) is the population mean of all ICC’s between the single scores related to a given situation. When the null Hypothesis Ho says rho is equel t0 zero, this is equivalent to saying: the mean square expectation between wines = 0, the mean square expectation between judges and/or the residual mean square expectation being different from zero.

ICC(2,1) is an estimate of rho(2,1) and when that ICC-value lies in the 95% confidence area we conclude at the 95% level of confidence that the judges have been consistent.

Question: Consistancy of a rater does not necessarily mean reliability nor give birth to agreement between raters. What does the ICC exactly assess?

I met an example (out of a book) that puzzles me a lot:

Seven judges rated five wines and this is the result two-way ANOVA showed:

– F wines < F critic .5: we accept Ho of no significant differences between the means of the wine scores.

– F judges < F critic .5: we accept Ho of no significant differences between the means of the judges' scores. The judges are consistent in their scoring.

– ICC(2,1) value equal to .118 (!) and confidence interval -.06 and .68. The ICC thus lies in the non significant zone. There too, can we simple conclude the judges are consistant?

Question: Looking at the correlations of the judges' scores this is so unlikely. How to interprete what figures show exactly?

Please give me your view on those two items?

And again: thank you very much

Erik

Erik,

Sorry for the delayed response.

As the Shrout and Fleiss article says ICC measures reliability. On page 425, they make a distinction between consistency of ratings using ICC(3,1) and agreement of ratings using ICC(2,1).

An ICC(2,1) of .118 indicates a low level of agreement between the judges. Since zero lies in the confidence interval of (-.06, .68), we need to reject the null hypothesis that there is agreement between the raters. I can’t comment on whether this is unlikely since I don’t have access to the data.

Charles

Hi Charles,

What does the negative values for lower band in CI95% tells us!

icc 0.6 (95% CI= lower= -0.0; upper= 0.8).

Thank you.

If the lower bound is -0.0, as in your example, then this likely means a very small negative number. In any case, when a statistic takes a range of value of say 0 to 1 and the lower confidence value is negative, it should be viewed as zero.

Charles

Hi Charles!

this is the result of a ICC test. I will appreciate it if you let me know if the Confidence interval is narrow enough to come to any conclusion.

ICC = .87(CI 95%=.466-.997),F(2,34) = 7.7, p < .005

best regards

Behrouz

Obviously the narrower the better, but I would say that you have a pretty good level of confidence that there is agreement among the raters. (As always, there is some risk that this is less so.)

Charles

Hello Charles,

My objective is to compute intraclass coefficient. I want to assess the degree of agreement between raters on the items of a new proposed tool. The tool is a 5 point Likert rated tool. There are two set of rater, Rater group 1 has psychologists (2 of them) and Rater group 2 has Educators (3 of them). It is a fully crossed model.

My question is :

1) How do I interpret these results? I obtained an intraclass coefficient by use of SPSS 20.For the psychologist group, I obtained the coefficient of o.54 , 95% confidence interval, the lower bound .197 and upper bound.754. These are average measures. and 0.54 for the educator group coefficient of o.54 , 95% confidence interval, the lower bound .257 and upper bound.729

2) I am interested in the degree of agreement for each of the items of the tool. Isn’t this coefficient value an indicator of the overall scale?

Hello,

1. There isn’t agreement as to what is an acceptable value for ICC, although I have typically seen .7 used. Which such a small sample and therefore such large confidence intervals (.197, .754) and (.257, .729) it is pretty hard to derive a lot of meaning from the results, except that they seem significantly different from zero.

2. For one item, I would simply use the variance, but again with such a small sample, this is not going to tell you very much.

Charles

Hi! Charles,

How we calculate ICC with one way anova for teams with different no. of people in each team. From example cited above we can use it only when there are same no. of people in each team. Kindly help.

Thanks,

Sapnaa

Sapnaa,

The approach that I present is only valid for equaò group sizes. There are a number of techniques available for unequal group sizes. E.g. see http://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=1301&context=jmasm.

Charles

In figure 3 above, in cell K37 and K37 there is a reference to cell I28. As far as I can see this cell is empty. Can you inform me what should be the right reference?

I asume it has to refer to F, so the reference should be or I35 or K23. Correct?

Hi Carel,

The correct formulas in cells I36, I37, I38 and I39 are:

=I35/FINV(I31/2,I33,I34)

=I35*FINV(I31/2,I34,I33)

=(I36-1)/(I36+I32-1)

=(I37-1)/(I37+I32-1)

I will correct the references made in cells K36, K37, K38 and K39.

Thanks for catching this error.

Charles