Reliability and validity are criteria by which researchers assess measurement quality. Measuring a person or item involves assigning scores to represent an attribute. This process creates the data that we analyze. However, to provide meaningful research results, that data must be good. And not all data are good!
How do researchers assess data quality? They can’t just assume they have good measurements. Typically, researchers need to collect data using an instrument and evaluate the quality of the measurements. In other words, they conduct an assessment before the primary research to assess reliability and validity.
For data to be good enough to allow you to draw meaningful conclusions from a research study, they must be reliable and valid. What are the properties of good measurements? In a nutshell, reliability relates to the consistency of measures, and validity addresses whether the measurements are quantifying the correct attribute.
In this post, learn about reliability vs. validity, their relationship, and the various ways to assess them.
Learn more about Experimental Design: Definition, Types, and Examples.
Reliability refers to the consistency of the measure. High reliability indicates that the measurement system produces similar results under the same conditions. If you measure the same item or person multiple times, you want to obtain comparable values. They are reproducible.
If you take measurements multiple times and obtain very different values, your data are unreliable. Numbers are meaningless if repeated measures do not produce similar values. What’s the correct value? No one knows! This inconsistency hampers your ability to draw conclusions and understand relationships.
Suppose you have a bathroom scale that displays very inconsistent results from one time to the next. It’s very unreliable. It would be hard to use your scale to determine your correct weight and to know whether you are losing weight.
Inadequate data collection procedures and low-quality or defective data collection tools can produce unreliable data. Additionally, some characteristics are more challenging to measure reliably. For example, the length of an object is concrete. On the other hand, a psychological construct, such as conscientiousness, depression, and self-esteem, can be trickier to measure reliably.
When assessing studies, evaluate data collection methodologies and consider whether any issues undermine their reliability.
Validity refers to whether the measurements reflect what they’re supposed to measure. This concept is a broader issue than reliability. Researchers need to consider whether they’re measuring what they think they’re measuring. Or do the measurements reflect something else? Does the instrument measure what it says it measures? It’s a question that addresses the appropriateness of the data rather than whether measurements are repeatable.
Validity is a smaller concern for tangible measurements like height and weight. You might have a biased bathroom scale if it tends to read too high or too low—but it still measures weight. Validity is a bigger concern in the social sciences, where you can measure elusive concepts such as positive outlook and self-esteem. If you’re assessing the psychological construct of conscientiousness, you need to confirm that the instrument poses questions that appraise this attribute rather than, say, obedience.
Reliability vs Validity
A measurement must be reliable first before it has a chance of being valid. After all, if you don’t obtain consistent measurements for the same object or person under similar conditions, it can’t be valid. If your scale displays a different weight every time you step on it, it’s unreliable, and it is also invalid.
So, having reliable measurements is the first step towards having valid measures. Validity is necessary for reliability, but it is insufficient by itself.
Suppose you have a reliable measurement. You step on your scale a few times in a short period, and it displays very similar weights. It’s reliable. But the weight might be incorrect.
Just because you can measure the same object multiple times and get consistent values, it does not necessarily indicate that the measurements reflect the desired characteristic.
How can you determine whether measurements are both valid and reliable? Assessing reliability vs. validity is the topic for the rest of this post!
|Importance||Similar measurements for the same person/item under the same conditions.||Measurements reflect what they’re supposed to measure.|
|Assessment||Stability of results across time, between observers, within the test.||Measures have appropriate relationships to theories, similar measures, and different measures.|
|Relationship||Unreliable measurements typically cannot be valid.||Valid measurements are also reliable.|
How to Assess Reliability
Reliability relates to measurement consistency. To evaluate reliability, analysts assess consistency over time, within the measurement instrument, and between different observers. These types of consistency are also known as—test-retest, internal, and inter-rater reliability. Typically, appraising these forms of reliability involves taking multiple measures of the same person, object, or construct and assessing scatterplots and correlations of the measurements. Reliable measurements have high correlations because the scores are similar.
Analysts often assume that measurements should be consistent across a short time. If you measure your height twice over a couple of days, you should obtain roughly the same measurements.
To assess test-retest reliability, the experimenters typically measure a group of participants on two occasions within a few days. Usually, you’ll evaluate the reliability of the repeated measures using scatterplots and correlation coefficients. You expect to see high correlations and tight lines on the scatterplot when the characteristic you measure is consistent over a short period, and you have a reliable measurement system.
This type of reliability establishes the degree to which a test can produce stable, consistent scores across time. However, in practice, measurement instruments are never entirely consistent.
Keep in mind that some characteristics should not be consistent across time. A good example is your mood, which can change from moment to moment. A test-retest assessment of mood is not likely to produce a high correlation even though it might be a useful measurement instrument.
This type of reliability assesses consistency across items within a single instrument. Researchers evaluate internal reliability when they’re using instruments such as a survey or personality inventories. In these instruments, multiple items relate to a single construct. Questions that measure the same characteristic should have a high correlation. People who indicate they are risk-takers should also note that they participate in dangerous activities. If items that supposedly measure the same underlying construct have a low correlation, they are not consistent with each other and might not measure the same thing.
This type of reliability assesses consistency across different observers, judges, or evaluators. When various observers produce similar measurements for the same item or person, their scores are highly correlated. Inter-rater reliability is essential when the subjectivity or skill of the evaluator plays a role. For example, assessing the quality of a writing sample involves subjectivity. Researchers can employ rating guidelines to reduce subjectivity. Comparing the scores from different evaluators for the same writing sample helps establish the measure’s reliability. Learn more about inter-rater reliability.
Related post: Interpreting Correlation
Cronbach’s alpha measures the internal consistency, or reliability, of a set of survey items. Use this statistic to help determine whether a collection of items consistently measures the same characteristic. Learn more about Cronbach’s Alpha.
How to Assess Validity
Validity is more difficult to evaluate than reliability. After all, with reliability, you only assess whether the measures are consistent across time, within the instrument, and between observers. On the other hand, evaluating validity involves determining whether the instrument measures the correct characteristic. This process frequently requires examining relationships between these measurements, other data, and theory. Validating a measurement instrument requires you to use a wide range of subject-area knowledge and different types of constructs to determine whether the measurements from your instrument fit in with the bigger picture!
An instrument with high validity produces measurements that correctly fit the larger picture with other constructs. Validity assesses whether the web of empirical relationships aligns with the theoretical relationships.
The measurements must have a positive relationship with other measures of the same construct. Additionally, they need to correlate in the correct direction (positively or negatively) with the theoretically correct constructs. Finally, the measures should have no relationship with unrelated constructs.
If you need more detailed information, read my post that focuses on Measurement Validity. In that post, I cover the various types, how to evaluate them, and provide examples.
Experimental validity relates to experimental designs and methods. To learn about that topic, read my post about Internal and External Validity.
Whew, that’s a lot of information about reliability vs. validity. Using these concepts, you can determine whether a measurement instrument produces good data!