Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data. But, that’s not always the case. Removing outliers is legitimate only for specific reasons.
Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. Unfortunately, resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant.
In my previous post, I showed five methods you can use to identify outliers. However, identification is just the first step. Deciding how to handle outliers depends on investigating their underlying cause.
In this post, I’ll help you decide whether you should remove outliers from your dataset and how to analyze your data when you can’t remove them. The proper action depends on what causes the outliers. In broad strokes, there are three causes for outliers—data entry or measurement errors, sampling problems and unusual conditions, and natural variation.
Let’s go over these three causes!
Data Entry and Measurement Errors and Outliers
Errors can occur during measurement and data entry. During data entry, typos can produce weird values. Imagine that we’re measuring the height of adult men and gather the following dataset.
In this dataset, the value of 10.8135 is clearly an outlier. Not only does it stand out, but it’s an impossible height value. Examining the numbers more closely, we conclude the zero might have been accidental. Hopefully, we can either go back to the original record or even remeasure the subject to determine the correct height.
These types of errors are easy cases to understand. If you determine that an outlier value is an error, correct the value when possible. That can involve fixing the typo or possibly remeasuring the item or person. If that’s not possible, you must delete the data point because you know it’s an incorrect value.
Sampling Problems Can Cause Outliers
Inferential statistics use samples to draw conclusions about a specific population. Studies should carefully define a population, and then draw a random sample from it specifically. That’s the process by which a study can learn about a population.
Unfortunately, your study might accidentally obtain an item or person that is not from the target population. There are several ways this can occur. For example, unusual events or characteristics can occur that deviate from the defined population. Perhaps the experimenter measures the item or subject under abnormal conditions. In other cases, you can accidentally collect an item that falls outside your target population, and, thus, it might have unusual characteristics.
Related post: Inferential vs. Descriptive Statistics
Examples of Sampling Problems
Let’s bring this to life with several examples!
Suppose a study assesses the strength of a product. The researchers define the population as the output of the standard manufacturing process. The normal process includes standard materials, manufacturing settings, and conditions. If something unusual happens during a portion of the study, such as a power failure or a machine setting drifting off the standard value, it can affect the products. These abnormal manufacturing conditions can cause outliers by creating products with atypical strength values. Products manufactured under these unusual conditions do not reflect your target population of products from the normal process. Consequently, you can legitimately remove these data points from your dataset.
During a bone density study that I participated in as a scientist, I noticed an outlier in the bone density growth for a subject. Her growth value was very unusual. The study’s subject coordinator discovered that the subject had diabetes, which affects bone health. Our study’s goal was to model bone density growth in pre-adolescent girls with no health conditions that affect bone growth. Consequently, her data were excluded from our analyses because she was not a member of our target population.
If you can establish that an item or person does not represent your target population, you can remove that data point. However, you must be able to attribute a specific cause or reason for why that sample item does not fit your target population.
Natural Variation Can Produce Outliers
The previous causes of outliers are bad things. They represent different types of problems that you need to correct. However, natural variation can also produce outliers—and it’s not necessarily a problem.
All data distributions have a spread of values. Extreme values can occur, but they have lower probabilities. If your sample size is large enough, you’re bound to obtain unusual values. In a normal distribution, approximately 1 in 340 observations will be at least three standard deviations away from the mean. However, random chance might include extreme values in smaller datasets! In other words, the process or population you’re studying might produce weird values naturally. There’s nothing wrong with these data points. They’re unusual, but they are a normal part of the data distribution.
Example of Natural Variation Causing an Outlier
For example, I fit a model that uses historical U.S. Presidential approval ratings to predict how later historians would ultimately rank each President. It turns out a President’s lowest approval rating predicts the historian ranks. However, one data point severely affects the model. President Truman doesn’t fit the model. He had an abysmal lowest approval rating of 22%, but later historians gave him a relatively good rank of #6. If I remove that single observation, the R-squared increases by over 30 percentage points!
However, there was no justifiable reason to remove that point. While it was an oddball, it accurately reflects the potential surprises and uncertainty inherent in the political system. If I remove it, the model makes the process appear more predictable than it actually is. Even though this unusual observation is influential, I left it in the model. It’s bad practice to remove data points simply to produce a better fitting model or statistically significant results.
If the extreme value is a legitimate observation that is a natural part of the population you’re studying, you should leave it in the dataset. I’ll explain how to analyze datasets that contain outliers you can’t exclude shortly!
To learn more about the example above, read my article about it, Understanding Historians’ Rankings of U.S. Presidents using Regression Models.
Guidelines for Dealing with Outliers
Sometimes it’s best to keep outliers in your data. They can capture valuable information that is part of your study area. Retaining these points can be hard, particularly when it reduces statistical significance! However, excluding extreme values solely due to their extremeness can distort the results by removing information about the variability inherent in the study area. You’re forcing the subject area to appear less variable than it is in reality.
When considering whether to remove an outlier, you’ll need to evaluate if it appropriately reflects your target population, subject-area, research question, and research methodology. Did anything unusual happen while measuring these observations, such as power failures, abnormal experimental conditions, or anything else out of the norm? Is there anything substantially different about an observation, whether it’s a person, item, or transaction? Did measurement or data entry errors occur?
If the outlier in question is:
- A measurement error or data entry error, correct the error if possible. If you can’t fix it, remove that observation because you know it’s incorrect.
- Not a part of the population you are studying (i.e., unusual properties or conditions), you can legitimately remove the outlier.
- A natural part of the population you are studying, you should not remove it.
When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences. Comparing results in this manner is particularly useful when you’re unsure about removing an outlier and when there is substantial disagreement within a group over this question.
Statistical Analyses that Can Handle Outliers
What do you do when you can’t legitimately remove outliers, but they violate the assumptions of your statistical analysis? You want to include them but don’t want them to distort the results. Fortunately, there are various statistical analyses up to the task. Here are several options you can try.
Nonparametric hypothesis tests are robust to outliers. For these alternatives to the more common parametric tests, outliers won’t necessarily violate their assumptions or distort their results.
In regression analysis, you can try transforming your data or using a robust regression analysis available in some statistical packages.
Finally, bootstrapping techniques use the sample data as they are and don’t make assumptions about distributions.
These types of analyses allow you to capture the full variability of your dataset without violating assumptions and skewing results.