# Reliability

Internal consistency reliability is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions. An experiment is reliable if it yields consistent results of the same measure, i.e. it doesn’t yield random error in measurement. It is unreliable if repeated measurements give different results.

Since there are inaccuracies when taking measurements, even when the same measurements are taken twice there can be differences. We can therefore partition an observed value of x into the true value of x and an error term. Thus we have x = t + e.

Definition 1: The reliability of x is a measure of internal consistency and is the correlation coefficient rxt of x and t.

Property 1:

Proof: See Proof of Basic Property

• Internal Consistency Reliability
• Interrater Reliability
• Item Analysis

### 13 Responses to Reliability

1. dory says:

This website is recommended especially for students and Professors in Measurement and Testing. Very helpful indeed.

2. Kateregga Abdul Karim says:

Thanks to Mr.Charles Zaiontz for his elaborate examples that are making life easy for us to learn how to analyse our research data using Excel scientific packages.

3. Dhruv Pandya says:

This is awsum Pack for the statistical calculation. Thanks for it.

4. Alexander Frivoll says:

Thank you Mr. Charles Zaiontz. This website is amazing.

As an archaeologist, I have little knowledge of statistics. I am trying to test the reliability (consistency) of a method we use for categorizing lithic raw materials. But I am not sure how to approach it, or maybe I am overthinking this.
I have conducted a blind test where 9 individuals have categorized the same 144 lithic artefacts. The participants were free to label the categories themselves, so the number of groups were not fixed. A total of 20 different groups were chosen.

Do you have any ideas on how I can easily show the reliability, or degree of consistency with this? Maybe I have misunderstood the concept of reliability?

Any help is appreciated.

• Charles says:

Alexander,
It sounds like a fit for Fleiss’s Kappa, but I need to understand better what you mean by “The participants were free to label the categories themselves, so the number of groups were not fixed”.
Charles

• Alexander Frivoll says:

Thank you for the quick reply, Charles.

What i meant was that of the total of 144 lithics, only 18 of them were FLINT. The participants were free to label the lithics as they choose. Meaning some of them labeled them correctly as FLINT, however, some participants might have labaled some of them as CHICKEN or QUARTZ. If this makes any sense?

Most examples I have seen use fixed number scores, like 1 – 5. Which is why i was not sure if fleiss’s kappa was suitable in my case. Then again, maybe percent agreement will suffice in my study.

• Charles says:

Alexander,
I guess you need to decide on what you want to measure: consistency with the right answer or consistency between the raters.
Also you need to decide whether some rating are better than others (i.e. a weighting factor).
Charles

5. Chav says:

hello Charles, Ive downloaded the resource pack and the worksheets and I found them very educational although ive had to learn a lot,Im new to statistics . Im on my MA and doing a study about the relationship of social media to college students academic performance.What test of reliability would you suggest on the above study? Also if you have any suggestions on what possible questions i should include on my questionnaire?

Thank you very much and God bless

• Charles says:

Hello Chav,
I am very pleased that the resource pack and example worksheets have been helpful.
Most people would use Cronbach’s alpha to check the internally consistency (reliability) of their questionnaire, although other tools are also used.
Regarding specific questions for your questionnaire, let’s see if anyone else in the community has some suggestions.
Charles

6. Nicole Draper says:

Dear Charles,
Thank you for this wonderful website. I am a nurse living and working in Australia along with being a DHealth candidate. Statistics is something that make me feel both terrified and overwhelmed! I am currently analysing my data from a 20 question survey I conducted with our Doctors, Nurses and allied health professionals as part of my studies. I had 148 respondents out of 244 possible, so a 61% RR. I would like to ensure the tool is reliable and so have been looking into Kappa Stats to determine inter rater reliability. Is this the tool you would suggest?
I have been analysing my data using excel and just using filters for each of the responses, I can see that nurse for example have responded to questions that were Dr specific so I think some of my data will need to be cleaned up. Some of the questions were purely demographics (first 3) and the others were around communication, understanding of a programs aims and objectives, if they had received education on a particular topic at University or in the clinical setting. Thanks so much I really appreciate any and all advice you have

• Charles says:

Nicole,
Glad you have gotten value from the website.
The most commonly used tool to check the reliability of a questionnaire is Cronbach’s alpha. See the following webpage
Cronbach’s alpha
Keep in mind that you need to handle demographic questions separately.
Charles

7. excellent website. I wonder if you can tell what statistic I should be using. We have an instrument which measures the length of the eyeball (IOL Master). The company recommends that you take five readings in which it will calculate the mean. In viewing the individual readings there is some variability. To reduce the variability, I believe that we need to increase the number of measurements, ie 5, 10, or 20. I can easily measure the same finding on a subject(s) multiple times for each condition; or we can have a number of people taking two measurements for each condition. (test retest repeatability) Which is the best way and what statistic should we use. Thanks in advance for your reply

• Charles says:

Jeffrey,
If I understand correctly, you are asking which is better: (a) 2k measurements by the same person or (b) 2 measurements by k different people.
I would guess (b) but I am not sure that I am right. Perhaps someone else can add their view on this.
Charles