Validity in research, statistics, psychology, and testing evaluates how well test scores reflect what they’re supposed to measure. Does the instrument measure what it claims to measure? Do the measurements reflect the underlying reality? Or, do they quantify something else?
For example, does an intelligence test assess intelligence or another characteristic, such as education or the ability to recall facts?
Researchers need to consider whether they’re measuring what they think they’re measuring. Validity addresses the appropriateness of the data rather than whether measurements are repeatable (reliability). However, for a test to be valid, it must first be reliable (consistent).
Evaluating validity is crucial because it helps establish which tests to use and which to avoid. If researchers use the wrong instruments, their results can be meaningless!
Validity is usually less of a concern for tangible measurements like height and weight. You might have a cheap bathroom scale that tends to read too high or too low—but it still measures weight. For those types of measurements, you’re more interested in accuracy and precision. However, other types of measurements are not as straightforward.
Validity is often a more significant concern in psychology and the social sciences, where you measure intangible constructs such as self-esteem and positive outlook. If you’re assessing the psychological construct of conscientiousness, you need to ensure that the measurement instrument asks questions that evaluate this characteristic rather than, say, obedience.
Psychological assessments of unobservable latent constructs (e.g., intelligence, traits, abilities, proclivities, etc.) have a specific application known as test validity, which is the extent that theory and data support the interpretations of test scores. Consequently, it is a critical issue because it relates to understanding the test results.
Researchers validate tests using different lines of evidence. An instrument can be strong for one type of validity but weaker for another. Consequently, it is not a black or white issue—it can have degrees.
In this vein, there are many different types of validity and ways of thinking about it. Let’s take a look at several of the more common types. Each kind is a line of evidence that can help support or refute a test’s overall validity. In this post, learn about face, content, criterion, discriminant, concurrent, predictive, and construct validity.
If you want to learn about experimental validity, read my post about internal and external validity. Those types relate to experimental design and methods.
Face validity is the simplest and weakest type. Does the measurement instrument appear “on its face” to measure the intended construct? For a survey that assesses thrill-seeking behavior, you’d expect it to include questions about seeking excitement, getting bored quickly, and risky behaviors. If the survey contains these questions, then “on its face,” it seems like the instrument measures the construct that the researchers intend.
While this is a low bar, it’s an important issue to consider. Never overlook the obvious. Ensure that you understand the nature of the instrument and how it assesses a construct. Look at the questions. After all, if a test can’t clear this fundamental requirement, the other types of validity are a moot point. However, when a measure satisfies face validity, understand it is an intuition or a hunch that it feels correct. It’s not a statistical assessment. If your instrument passes this low bar, you still have more validation work ahead of you.
Content validity is similar to face validity in that you assess it qualitatively—but it’s a more rigorous form. The process often involves assessing individual questions on a test and asking experts whether each item appraises the characteristics that the instrument is designed to cover. This process compares the test against the researcher’s goals and the theoretical properties of the construct. Researchers systematically determine whether each question contributes and that no aspect is overlooked.
For example, if researchers are designing a survey to measure the attitudes and activities of thrill-seekers, they need to determine whether the questions sufficiently cover both of those aspects.
Criterion validity relates to the relationships between the variables in your dataset. If your data are valid, you’d expect to observe a particular correlation pattern between the variables. Researchers typically assess criterion validity by correlating different types of data. For whatever you’re measuring, you expect it to have particular relationships with other variables.
For example, measures of anxiety should correlate positively with the number of negative thoughts. Anxiety scores might also correlate positively with depression and eating disorders. If we see this pattern of relationships, it supports criterion validity. Our measure for anxiety correlates with other variables as expected.
This type is also known as convergent validity because scores for different measures converge or correspond as theory suggests. You should observe high correlations (either positive or negative).
Related post: Interpreting Correlation
This type is the opposite of criterion validity. If you have valid data, you expect particular pairs of variables to correlate positively or negatively. However, for other pairs of variables, you expect no relationship.
For example, if self-esteem and locus of control are not related in reality, their measures should not correlate. You should observe a low correlation between scores.
It is also known as divergent validity because it relates to how different constructs are differentiated. Low correlations (close to zero) indicate that the values of one variable do not relate to the values of the other variables—the measures distinguish between different constructs.
Concurrent validity evaluates the degree to which a measure of a construct correlates with other simultaneous measures of that construct. For example, if you administer two different intelligence tests to the same group, there should be a strong, positive correlation between their scores.
Predictive validity evaluates how well a construct predicts an outcome. For example, standardized tests such as the SAT and ACT are intended to predict how high school students will perform in college. If these tests have high predictive ability, test scores will have a strong, positive correlation with college achievement. Testing this type of validity requires administering the assessment and then measuring the actual outcomes.
A test with high construct validity correctly fits into the big picture with other constructs. Consequently, this type incorporates aspects of criterion, discriminant, concurrent, and predictive validity. A construct must correlate positively and negatively with the theoretically appropriate constructs, have no correlation with the correct constructs, correlate with other measures of the same construct, etc.
Construct validity combines the theoretical relationships between constructs with empirical relationships to see how closely they align. It evaluates the full range of characteristics for the construct you’re measuring and determines whether they all correlate correctly with other constructs, behaviors, and events.
As you can see, validity is a complex issue, particularly when you’re measuring abstract characteristics. To properly validate a test, you need to incorporate a wide range of subject-area knowledge and determine whether the measurements from your instrument fit in with the bigger picture!
Nevo, Baruch (1985), Face Validity Revisited, Journal of Educational Measurement.