What is Inter-Rater Reliability?
Inter-rater reliability measures the agreement between subjective ratings by multiple raters, inspectors, judges, or appraisers. It answers the question, is the rating system consistent? High inter-rater reliability indicates that multiple raters’ ratings for the same item are consistent. Conversely, low reliability means they are inconsistent.
For example, judges evaluate the quality of academic writing samples using ratings of 1 – 5. When multiples raters assess the same writing, how similar are their ratings?
Evaluating inter-rater reliability is vital for understanding how likely a measurement system will misclassify an item. A measurement system is invalid when ratings do not have high inter-rater reliability because the judges frequently disagree.
For the writing example, if the judges give vastly different ratings to the same writing, you cannot trust the results because the ratings are inconsistent. However, if the ratings are very similar, the rating system is consistent.
Related post: Reliability vs. Validity
Examples of Inter-Rater Reliability by Data Types
Ratings data can be binary, categorical, and ordinal. Examples of these ratings include the following:
- Inspectors rate parts using a binary pass/fail system.
- Judges give ordinal scores of 1 – 10 for ice skaters.
- Doctors diagnose diseases using a categorical set of disease names.
In all these examples, inter-rater reliability studies have multiple raters evaluate the same set of items, people, or conditions. Then subsequent analysis quantifies the consistency of ratings between raters.
Related posts: Data Types and How to Graph Them
Methods for Evaluating Inter-Rater Reliability
Evaluating inter-rater reliability involves having multiple raters assess the same set of items and then comparing the ratings for each item. Are the ratings a match, similar, or dissimilar?
There are multiple methods for evaluating rating consistency. I’ll start with percent agreement because it highlights the concept of inter-rater reliability at its most basic level. Then I’ll explain how several more sophisticated analyses improve upon it.
Percent agreement is simply the average amount of agreement expressed as a percentage. Using this method, the raters either agree, or they don’t. It’s a binary outcome with no shades of grey. In other words, this form of inter-rater reliability doesn’t give partial credit for being close.
Imagine we have three judges evaluating the writing samples. They use a rating scale of 1 to 5. The school performs a study to see how closely they agree and record the results in the table below.
|Writing Sample||Judge 1||Judge 2||Judge 3|
Next, count the number of agreements between pairs of judges in each row. With three judges, there are three pairings and, hence, three possible agreements per writing sample. I’ll add columns to record the rating agreements using 1s and 0s for agreement and disagreement, respectively. The final column is the total number of agreements for that writing sample.
|Writing Sample||Judge 1||Judge 2||Judge 3||1 & 2||1 & 3||2 & 3||Total|
Finally, we sum the number of agreements (1 + 3 + 3 + 1 + 1 = 9) and divide by the total number of possible agreements (3 * 5 = 15). Therefore, the percentage agreement for the inter-rater reliability of this dataset is 9/15 = 60%.
Weaknesses of Percent Agreement
While this is the simplest form of inter-rater reliability, it falls short in several ways. First, it doesn’t account for agreements that occur by chance, which causes the percent agreement method to overestimate inter-rater reliability. Second, it doesn’t factor in the degree of agreement, only absolute agreement. It’s either a match or not. On a scale of 1 – 5, two judges scoring 4 and 5 is much better than scores of 1 and 5!
Unsurprisingly, statisticians have developed various methods to account for both facets. In the following sections, I provide an overview of these more sophisticated inter-rater reliability methods. Then I’ll perform a more thorough analysis using these methods and interpret the inter-rater reliability results using an example dataset.
Cohen’s and Fleiss’s Kappa Statistics
Kappa statistics, like percent agreement, measure absolute agreement and treat all disagreements equally. However, they factor in the role of chance when evaluating inter-rater reliability. Kappa statistics are valid for binary, categorical, and ordinal ratings.
For example, imagine that our rating system uses coin tosses. If we toss two coins, they’ll “agree” half the time (two head or two tails) due to random chance alone. Our system must beat that random agreement to have high inter-rater reliability.
There are the following two forms of kappa statistics:
- Cohen’s kappa: Compares two raters.
- Fleiss’s kappa: Expands Cohen’s kappa for more than two raters.
Kappa statistics can technically range from -1 to 1. However, in most cases, they’ll be between 0 and 1. Higher values correspond to higher inter-rater reliability (IRR).
- Kappa < 0: IRR is less than chance. (Rare.)
- Kappa = 0: IRR is at a level that chance would produce.
- Kappa > 1: IRR is higher than chance alone would create.
0.75 is a standard benchmark for a minimally good kappa value. However, acceptable kappa values vary greatly by subject area, and higher is better. Frequently, you’ll want at least 0.9.
Statistical software can calculate confidence intervals and p-values for hypothesis testing purposes.
For inter-rater reliability, the hypotheses for testing a kappa statistic are the following:
- Null: Chance causes the observed agreement between raters.
- Alternative: Chance does not cause the observed agreement between raters.
When the p-value is less than your significance level, reject the null hypothesis. Your sample data provide sufficient evidence to conclude that agreements exist between raters in the population more frequently than chance would produce.
Kendall’s Coefficient of Concordance
Kendall’s coefficient of concordance, also known as Kendall’s W, is a measure of inter-rater reliability that accounts for the strength of the relationship between multiple ratings. It measures the extent of agreement rather than only absolute agreement. In other words, it differentiates between near misses versus not close at all. Think of it as a correlation amongst all raters that assesses the strength of the relationship between their ratings.
Given these properties, Kendall’s coefficient is valuable to use alongside kappa when you have ordinal ratings, such as our 1 – 5 ratings for writing samples. Again, two raters that give writing sample ratings of 4 & 5 is much better than a 1 & 5!
Kendall’s coefficient ranges from 0 to 1, where higher values indicate stronger inter-rater reliability. Values greater than 0.9 are excellent, and 1 indicates perfect agreement.
Statistical software can calculate confidence intervals and p-values for hypothesis testing purposes.
For inter-rater reliability, the hypotheses for testing Kendall’s coefficient are the following:
- Null: No association exists between ratings by different raters.
- Alternative: An association exists between ratings by different raters.
When the p-value is less than your significance level, reject the null hypothesis. Your sample data provide sufficient evidence to conclude that an association exists between raters in the whole population.
In summary, a high Kendall’s coefficient suggests that raters use the same criteria to evaluate the samples.
Interpreting Inter-Rater Reliability Example
Let’s use what you learned about inter-rater reliability. For this example, we’ll revisit rating the writing samples. However, we’ll increase the number of raters to five, and they all rate the same 15 writing samples. All raters receive training about how to score them. Now, the school wants to evaluate inter-rater reliability. Download the CSV dataset for this example. Inter-Rater_Reliability.
Interpreting the Results
Below is the statistical output for the inter-rater reliability assessment. We’ll evaluate both kappa and Kendall’s coefficient because our ratings are ordinal.
All five raters matched on 6 out 15 (40%) writing samples. That sounds low but let’s assess kappa and Kendall’s coefficient of concordance.
Each rating (1 – 5) has a kappa value. The raters agreed most frequently on the best writing samples with ratings of 5 (kappa = 0.736534) and agreed the least often on the lower rating of 2 (0.602754). The overall kappa is 0.672965. The p-value of 0.000 indicates that we can reject the null and conclude that agreement occurs more frequently than by chance.
While overall kappa is statistically significant, the value is slightly below the minimally acceptable value of 0.75. More work is necessary to improve inter-rater reliability.
On the other hand, Kendall’s coefficient of concordance is a very high 0.966317. The significant p-value indicates we can reject the null and conclude that an association exists between the ratings.
Given these mixed results, how do we interpret inter-rater reliability for this dataset?
While kappa indicates that the school should improve the absolute agreement, Kendall’s coefficient suggests that the judges use the same criteria. Higher ratings tend to correspond with other higher ratings, and lower scores align with other lower scores. There are no vast differences between the judges for a given sample.
In short, the training program has been partially successful. With some tweaks, the school should be able to increase the frequency of absolute matches and produce acceptable inter-rater reliability.
Additional Analyses Beyond Inter-Rater Reliability
Strictly speaking, inter-rater reliability measures only the consistency between raters, just as the name implies. However, there are additional analyses that can provide crucial information. For example, does an individual rater agree with themselves when measuring the same item multiple times? That’s intra-rater reliability.
Additionally, the best studies don’t only compare raters to each other but also to a known standard. While the raters might give consistent ratings, are those ratings hitting the correct bullseye? In other words, the various raters might be collectively consistent—but consistently wrong. Inter-rater reliability alone can’t make that determination.
By comparing ratings to a standard value, one that experts agree is correct, a study can measure not only the variability between raters but how close they are to the proper rating.
Lange R.T. (2011) Inter-rater Reliability. In: Kreutzer J.S., DeLuca J., Caplan B. (eds) Encyclopedia of Clinical Neuropsychology. Springer, New York, NY