What is Imputation?
Imputation in statistics is the process of replacing missing data points with plausible values. This technique is crucial because missing values can bias the statistical results. When applied correctly, imputed data reduce this bias.
It’s an unfortunate reality that most datasets have incomplete data. Clearly, missing values decrease the effective sample size, which is undesirable by itself. However, they also distort your results when patterns amongst the missing values exist.
For example, your temperature gauge might be more likely to malfunction at higher temperatures. Or study participants in particular demographic groups are less likely to respond. These patterns systematically distort your sample’s characteristics, skewing the results.
In other words, missing values can increase both random and systematic error.
Imputation provides plausible replacement values that restore the full sample size. And when used correctly, it can reduce the bias that missing values create.
The impact of missing data depends on the nature of the missing data and how many values are missing. Here’s a brief summary of the main types that you’ll need to know for data imputation:
- Missing Completely at Random (MCAR): Data is missing without relating to any variables. Results remain unbiased but lose precision because of reduced sample size.
- Missing at Random (MAR): Missing data relates to observed variables. Imputation might be required to produce valid results.
- Missing Not at Random (MNAR): Data is missing due to unobserved factors related to the missing data itself. This type is the most challenging and can create biased results that are difficult to correct. In most cases, it is better to recognize the limitations. We should understand the nature and direction of bias and not assume that imputation can fully fix it.
Learn more about Missing Data Overview: Types, Implications & Handling. and Precision.
Why Use Imputation?
In the past, the most common way to handle missing data was to exclude subjects with incomplete data rather than use data imputation. Statisticians refer to it as “complete case” analysis. Indeed, excluding incomplete cases is the default process for most statistical software when it encounters missing values.
This method is best for MCAR data because it does not bias the results. However, it reduces the sample size which, in turn, produces wider confidence intervals (lower precision). It also complicates comparisons when different analyses use different data subsets.
For MAR data, the problems of not using imputation worsen because of the potential bias.
I must admit that when I was first learning statistics, I was skeptical about imputation. It sounds like you’re trying to get something for nothing. Just fill in those blank spots in your dataset with values that sound good. However, methodological advances have produced effective imputation techniques.
No one thinks that imputing values is as good as having real data. However, it helps you manage a tradeoff.
MAR data can bias your results. This problem occurs because the missing values are systematically different from the non-missing values in some fashion. In some cases, imputation provides sufficient benefits to offset real problems that missing data creates.
A broad range of imputation methods exists. The idea behind all these methods is that you can estimate a plausible missing value using other information in the dataset. So, you’re not getting something for nothing.
Using imputation to reduce bias requires understanding the nature of your missing data and its prevalence in your dataset. Let’s look at some imputation methods, going from simple to more complex. This progression represents the changes in the state of the art over time.
Simple Methods
Instead of dropping missing data, simple imputation methods fill in missing values to keep the full sample size. While this helps preserve precision, these simpler techniques can introduce their own bias. These methods used to be standard but are generally no longer recommended because of the problems they introduce. Indeed, my original reticence over imputing missing values stems from these earlier approaches’ limitations.
Mean Imputation
This imputation method replaces missing values with the mean of the observed data. It’s easy to use and doesn’t affect the sample mean. However, it weakens or strengths correlations between variables depending on the approach.
Replacing all missing values for a variable with a singular value (the mean) weakens correlations. However, replacing missing values with conditional means (e.g., subgroup means) strengthens correlations.
Hot Deck
Hot-deck imputation replaces missing values using similar observations within the same dataset. This method chooses a complete observation that closely matches the one with missing data. It then uses this complete observation’s values to fill in the gaps. By using information from similar cases, hot-deck imputation can better preserve relationships within the data compared to simpler methods like mean imputation.
Cold Deck
Cold deck imputation uses similar data from an external dataset to replace missing values in the active dataset. This process can involve using data from a previous study, expert knowledge, or another reliable source to replace missing values. It risks introducing bias if the external data aren’t well-matched to the current dataset.
Regression Imputation
This method uses a regression model to predict missing values based on other variables in the dataset. While it can provide reasonable estimates, the imputed values fit perfectly along the regression line without accounting for any error. This lack of residual variance overstates the precision of the imputed data. It exaggerates relationships between variables, making the results appear more significant than they are.
The graph below displays the problem. A regression model generated the imputed data points in red to replace missing values. While these values reflect the underlying relationship between the variables, it doesn’t factor in the variability you see in the real data. This method exaggerates the strength of the existing relationship because the new data points fit it exactly.
These methods are simple to use, but they can bias the results. Hence, statisticians usually do not recommend them today. The following techniques are more sophisticated and address some of these issues.
More Advanced Imputation Methods
As time went by, statisticians developed more sophisticated imputation methods. These methods use relationships in the dataset to impute missing values while incorporating natural variability.
Stochastic Regression Imputation
This method improves upon basic regression imputation by adding random error (or unexplained variance) to the imputed values. This process makes them more realistic. Instead of just predicting a missing value using the fitted value from a regression model, this approach includes scatter to reflect the uncertainty.
While it addresses the issue of overly precise estimates seen in regular regression imputation, it still has limitations. It assumes the regression model is correct, which can lead to biased results if the model is misspecified. Additionally, while the added scatter helps, it tends to underestimate the true variability of the data.
Even with these improvements, statisticians do not widely recommend stochastic regression imputation today. More robust methods, like multiple imputation (MI), provide better ways to handle missing data. They capture more uncertainty and variability.
Multiple Imputation (MI)
Multiple Imputation (MI) is different from the previous methods. Instead of using one number to fill in missing values, MI generates several values for each missing data point, reflecting the uncertainty in the imputation process.
For example, MI accounts not only for the natural variation in the data but also for the uncertainty in the regression model itself—specifically, its coefficients. These coefficients are estimates. Therefore, they are uncertain. The MI process draws new, slightly different coefficient values from a range of possible values each time it imputes.
MI generates multiple versions of the dataset, each with different imputed values from these slightly varied models. After creating these datasets, analysts perform standard analyses on each and combine the results to produce better estimates. These estimates reflect the uncertainty in both the missing data and the model used to fill those gaps.
Using Multiple Imputation
Thanks to these advancements, multiple imputation is the current state-of-the-art method. Consequently, most major statistical software packages support multiple imputation, each with its own set of commands.
In R, popular packages like `mice` and `Amelia` handle MI, with the `mice()` function being particularly well known.
In Stata, the `mi` suite of commands, such as `mi impute` and `mi estimate`, provide a set of tools for implementing and analyzing imputed datasets.
SAS users can use the `PROC MI` and `PROC MIANALYZE` procedures to perform MI and pool results.
SPSS also offers MI through the `Multiple Imputation` procedure under the “Analyze” menu.
Reference
Gelman, A., & Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press.


Thank you once again Prof. In all the techniques explained above, where does the KNN method fall in?