Missing data refers to the absence of data entries in a dataset where values are expected but not recorded. They’re the blank cells in your data sheet. Missing values for specific variables or participants can occur for many reasons, including incomplete data entry, equipment failures, or lost files. When data are missing, it’s a problem. However, the issues go beyond merely reducing the sample size. In some cases, they can skew your results.
Data gaps can significantly impact research integrity because they fail to represent the actual values intended for measurement. Understanding the root cause of these gaps is crucial because it determines whether and how to address them.
Read on to learn more about the types of missing data, how they affect your results, and when and how to address them.
Types of Missing Data Explained
Missing data are not all created equal. There are varying types that have distinct impacts on your dataset and the conclusions you draw from your analysis. Furthermore, the extent to which absent values affect the study results largely depends on the type. These types require different strategies to maintain the integrity of your findings.
Missing data can be a form of Selection Bias.
Let’s delve into three types of missing data with examples to illustrate how they might appear in real-world datasets and affect your analysis. We’ll go from the best to worst kind.
Missing Completely at Random (MCAR)
When data are missing completely at random (MCAR), the likelihood of missing values is the same across all observations. In other words, the causes for the missing data are entirely unrelated to the data itself and affect all observations equally. Consequently, you can disregard the potential for the bias that occurs with other kinds of absent values.
For a bone density study I worked on, we measured the subjects’ activity levels for 12 hours with accelerometers and load monitors. Invariably, those monitors would fail randomly, and we’d lose some data. Those data are MCAR because all observations had an equal probability of containing missing values.
Fortunately, when your data gaps are MCAR, you can usually ignore it.
When data are Missing Completely at Random (MCAR), their absence is independent of any measured or unmeasured variables in the study. This randomness means that the missing data are less likely to introduce bias related to the data’s distribution, and you can often ignore them without distorting the analysis. MCAR data reduces sample size and the precision of the sample estimates but tends not to introduce bias. Regular statistical hypothesis testing will compensate for these random losses by adjusting the estimates to reflect the reduced sample size, thus preserving the study’s integrity.
Missing at Random (MAR)
Despite its name, MAR occurs when the absence of data is not random. The probability of missing data is not equal for all measurements. They’re more likely for some observations than others. However, measurements of observed variables predict the unequal probability of missing values occurring. Crucially, those probabilities don’t relate to the missing information itself. Hence, statisticians say that the data gaps correlate with observed values and not the unobserved (missing) values.
For instance, consider a medical study tracking the effects of a new medication. If patients from a particular region are less likely to complete follow-up visits — perhaps due to longer travel distances — their follow-up data would be missing. If the dataset includes the patients’ geographical information, this missing data would be classified as MAR. The missingness depends on the observed geographic location but not directly on the unobserved follow-up outcomes themselves. By acknowledging the role of geography in the availability of follow-up data, researchers can adjust their analyses to better estimate the medication’s effects across all regions.
Analysis of MAR missing data can produce biased results when analysts don’t correctly handle them. This bias occurs because missing values systematically differ from observed values, changing the properties of your sample. It is no longer a representative sample!
However, despite being non-random, MAR is a middle ground where your results can be unbiased when you use the correct methods. If your model uses the observed variables that predict the absent values, you can consider the missing data to be MCAR. Modern techniques for handling absent values often begin with the assumption of MAR, as it allows for more nuanced analyses that can account for observed patterns in the dataset.
Missing Not at Random (MNAR)
This type occurs when the probability of missing data relates to the absent values themselves, indicating a deeper issue within the dataset. Hence, it’s a problem because you can’t understand and model it.
In health surveys, individuals with more severe symptoms might be less likely to report their health status. This pattern creates a dataset where sicker individuals are underrepresented.
Missing Not at Random (MNAR) is the most challenging type of missing data because it occurs when the absence of data directly relates to the missing values themselves. This situation can introduce significant biases because the absent values are systematically different from the ones you record. For instance, if lower-income people are less likely to report their earnings, analysis of these data will likely overestimate the average income.
You might be unable to analyze MNAR data without producing biased results. And, unlike MAR, you can’t correct the bias using your observed variables. In this case, you should critically evaluate your results and compare them to other studies to assess the potential for bias and its degree.
Handling Missing Data
When dealing with missing data, researchers must decide on the best strategy to ensure their analysis remains robust and meaningful.
You typically have three options: accept, remove, or recreate them through estimation.
- Accepting: Leave the blank cells in your dataset and analyze.
- Deletion: Remove data points or records that have missing values. There are two primary methods:
- Listwise: This technique removes an entire record when any value is missing. It’s straightforward and ensures that only complete cases are analyzed.
- Pairwise: Unlike listwise deletion, pairwise deletion uses all available data by analyzing pairs of variables without missing values. This method includes more data points in specific statistical analyses but likely has unequal sample sizes for different pairs.
- Imputation: Fills in missing data with estimated values. The simplest form replaces absent values with the variable’s mean, median, or mode. More sophisticated methods, like regression imputation, predict missing values based on related information in the dataset.
Analyzing Missing Data Discussion
Accepting missing data is best for MCAR because they are unlikely to bias your results.
The deletion methods simplify the data handling process but reduce the sample size. Critically, deletion can introduce bias when the absent values are not MCAR.
Imputation helps maintain statistical power by estimating missing values and addressing reduced sample sizes. However, it risks introducing bias if the calculated values do not accurately reflect the correct values. Choosing the proper imputation method is crucial, as incorrect assumptions can result in misleading analysis outcomes.
For MAR data, advanced techniques like regression or multiple imputation can produce unbiased estimates. Consequently, they offer a significant advantage over the deletion methods.
Note: Using a measure of central tendency to replace missing values will still yield biased results for MAR data.
Navigating missing information is an essential skill in statistical analysis. By understanding the types of missing data and implementing strategies to manage them, researchers can ensure more accurate and reliable outcomes. Effective handling of absent values enriches the quality of your analysis and bolsters the credibility of your findings in the broader research community.
Remember, the goal is to handle missing data, anticipate and mitigate its occurrence, and ensure your dataset is representative and comprehensive.
Comments and Questions