Internal consistency reliability is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions. An experiment is reliable if it yields consistent results of the same measure, i.e. it doesn’t yield random error in measurement. It is unreliable if repeated measurements give different results.

Since there are inaccuracies when taking measurements, even when the same measurements are taken twice there can be differences. We can therefore partition an observed value of *x* into the true value of *x* and an error term. Thus we have *x = t + e.*

**Definition 1**: The **reliability** of *x* is a measure of internal consistency and is the correlation coefficient *r _{xt}* of

*x*and

*t*.

Proof: See Proof of Basic Property

- Split-Half Methodology
- Kuder and Richardson Formula 20
- Cronbach’s Alpha
- Cohen’s Kappa
- Weighted Cohen’s Kappa
- Fleiss’ Kappa
- Intraclass Correlation
- Kendall’s Coefficient of Concordance (W)
- Bland-Altman Analysis
- Item Analysis

Thank you Mr. Charles Zaiontz. This website is amazing.

As an archaeologist, I have little knowledge of statistics. I am trying to test the reliability (consistency) of a method we use for categorizing lithic raw materials. But I am not sure how to approach it, or maybe I am overthinking this.

I have conducted a blind test where 9 individuals have categorized the same 144 lithic artefacts. The participants were free to label the categories themselves, so the number of groups were not fixed. A total of 20 different groups were chosen.

Do you have any ideas on how I can easily show the reliability, or degree of consistency with this? Maybe I have misunderstood the concept of reliability?

Any help is appreciated.

Alexander,

Glad you like the website.

It sounds like a fit for Fleiss’s Kappa, but I need to understand better what you mean by “The participants were free to label the categories themselves, so the number of groups were not fixed”.

Charles

Thank you for the quick reply, Charles.

What i meant was that of the total of 144 lithics, only 18 of them were FLINT. The participants were free to label the lithics as they choose. Meaning some of them labeled them correctly as FLINT, however, some participants might have labaled some of them as CHICKEN or QUARTZ. If this makes any sense?

Most examples I have seen use fixed number scores, like 1 – 5. Which is why i was not sure if fleiss’s kappa was suitable in my case. Then again, maybe percent agreement will suffice in my study.

Thanks you for your help.

Alexander,

I guess you need to decide on what you want to measure: consistency with the right answer or consistency between the raters.

Also you need to decide whether some rating are better than others (i.e. a weighting factor).

Charles

This is awsum Pack for the statistical calculation. Thanks for it.

Thanks to Mr.Charles Zaiontz for his elaborate examples that are making life easy for us to learn how to analyse our research data using Excel scientific packages.

Please keep it up.

This website is recommended especially for students and Professors in Measurement and Testing. Very helpful indeed.