Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data. But, that’s not always the case. Removing outliers is legitimate only for specific reasons.

Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. Unfortunately, resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant.

In my previous post, I showed five methods you can use to identify outliers. However, identification is just the first step. Deciding how to handle outliers depends on investigating their underlying cause.

In this post, I’ll help you decide whether you should remove outliers from your dataset and how to analyze your data when you can’t remove them. The proper action depends on what causes the outliers. In broad strokes, there are three causes for outliers—data entry or measurement errors, sampling problems and unusual conditions, and natural variation.

Let’s go over these three causes!

## Data Entry and Measurement Errors and Outliers

Errors can occur during measurement and data entry. During data entry, typos can produce weird values. Imagine that we’re measuring the height of adult men and gather the following dataset.

In this dataset, the value of 10.8135 is clearly an outlier. Not only does it stand out, but it’s an impossible height value. Examining the numbers more closely, the zero might have been accidental. Hopefully, we can either go back to the original record or even remeasure the subject to determine the correct height.

These types of errors are easy cases to understand. If you determine that an outlier value is an error, correct the value when possible. That can involve fixing the typo or possibly remeasuring the item or person. If that’s not possible, you must delete the data point because you know it’s an incorrect value.

## Sampling Problems Can Cause Outliers

Inferential statistics use samples to draw conclusions about a specific population. Studies should carefully defined a population, and then draw a random sample from it specifically. That’s the process by which a study can learn about a population.

Unfortunately, your study might accidentally obtain an item or person that is not from the target population. There are several ways this can occur. For example, unusual events or characteristics can occur that deviate from the defined population. Perhaps the experimenter measures the item or subject under abnormal conditions. In other cases, you can accidentally collect an item that falls outside your target population, and, thus, it might have unusual characteristics.

**Related post**: Inferential vs. Descriptive Statistics

### Examples of Sampling Problems

Let’s bring this to life with several examples!

Suppose a study assesses the strength of a product. The researchers define the population as the output of the standard manufacturing process. The normal process includes standard materials, manufacturing settings, and conditions. If something unusual happens during a portion of the study, such as a power failure or a machine setting drifting off the standard value, it can affect the products. These abnormal manufacturing conditions can cause outliers by creating products with atypical strength values. Products manufactured under these unusual conditions do not reflect your target population of products from the normal process. Consequently, you can legitimately remove these data points from your dataset.

In a bone density study that I participated in as a scientist, I noticed an outlier in the bone density growth for a subject. Her growth value was very unusual. The study’s subject coordinator discovered that the subject had diabetes, which affects bone health. The goal of our study was to model bone density growth in pre-adolescent girls with no health conditions that affect bone growth. Consequently, her data were excluded from our analyses because she was not a member of our target population.

If you can establish that an item or person does not represent your target population, you can remove that data point. However, you must be able to attribute a specific cause or reason for why that sample item does not fit your target population.

## Natural Variation Can Produce Outliers

The previous causes of outliers are bad things. They represent different types of problems that you need to correct. However, natural variation can also produce outliers—and it’s not necessarily a problem.

All data distributions have a spread of values. Extreme values can occur, but they have lower probabilities. If your sample size is large enough, you’re bound to obtain unusual values. In a normal distribution, approximately 1 in 340 observations will be at least three standard deviations away from the mean. However, random chance might include extreme values in smaller datasets! In other words, the process or population you’re studying might produce weird values naturally. There’s nothing wrong with these data points. They’re unusual, but they are a normal part of the data distribution.

**Related post**: Normal Distribution and Measures of Variability

### Example of Natural Variation Causing an Outlier

For example, I fit a model that uses historical U.S. Presidential approval ratings to predict how later historians would ultimately rank each President. It turns out a President’s lowest approval rating predicts the historian ranks. However, one data point severally affects the model. President Truman doesn’t fit the model. He had an abysmal lowest approval rating of 22%, but later historians give him a relatively good rank of #6. If I remove that single observation, the R-squared increases by over 30 percentage points!

However, there was no justifiable reason to remove that point. While it was an oddball, it accurately reflects the potential surprises and uncertainty inherent in the political system. If I remove it, the model makes the process appear more predictable than it actually is. Even though this unusual observation is influential, I left it in the model. It’s bad practice to remove data points simply to produce a better fitting model or statistically significant results.

If the extreme value is a legitimate observation that is a natural part of the population you’re studying, you should leave it in the dataset. I’ll explain how to analyze datasets that contain outliers you can’t exclude shortly!

## Guidelines for Dealing with Outliers

Sometimes it’s best to keep outliers in your data. They can capture valuable information that is part of your study area. Retaining these points can be hard, particularly when it reduces statistical significance! However, excluding extreme values solely due to their extremeness can distort the results of the study by removing information about the variability inherent in the study area. You’re forcing the subject area to appear less variable than it is in reality.

When considering whether to remove an outlier, you’ll need to evaluate if it appropriately reflects your target population, subject-area, research question, and research methodology. Did anything unusual happen while measuring these observations, such as power failures, abnormal experimental conditions, or anything else out of the norm? Is there anything substantially different about an observation, whether it’s a person, item, or transaction? Did measurement or data entry errors occur?

If the outlier in question is:

- A measurement error or data entry error, correct the error if possible. If you can’t fix it, remove that observation because you know it’s incorrect.
- Not a part of the population you are studying (i.e., unusual properties or conditions), you can legitimately remove the outlier.
- A natural part of the population you are studying, you should not remove it.

When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences. Comparing results in this manner is particularly useful when you’re unsure about removing an outlier and when there is substantial disagreement within a group over this question.

## Statistical Analyses that Can Handle Outliers

What do you do when you can’t legitimately remove outliers, but they violate the assumptions of your statistical analysis? You want to include them but don’t want them to distort the results. Fortunately, there are various statistical analyses up to the task. Here are several options you can try.

Nonparametric hypothesis tests are robust to outliers. For these alternatives to the more common parametric tests, outliers won’t necessarily violate their assumptions or distort their results.

In regression analysis, you can try transforming your data or using a robust regression analysis available in some statistical packages.

Finally, bootstrapping techniques use the sample data as they are and don’t make assumptions about distributions.

These types of analyses allow you to capture the full variability of your dataset without violating assumptions and skewing results.

Marlon says

Jim, how can you probe that a dataset is normally distributed using Excel?

Mohd Shehzoor Hussain says

Thank you for your reply Jim.

Can we use median and IQR to measure CT and variance if data is skewed? If yes, i have two questions.

1) what tests are there to measure the change in median and IQR?

2) what do we do with the outliers?

Mohd Shehzoor Hussain says

Hi Jim, how do you get standard deviations for data set without a proper bell curve due to outliers?

Jim Frost says

Hi Mohd,

You can calculate standard deviations using the usual formula regardless of the distribution. However, only in the normal distribution does the SD have special meaning that you can relate to probabilities. If your data are highly skewed, it could affect the standard deviations that you’d expect to see and what counts as an outliers. It’s always important to graph the distribution of your data to help you understand things like outliers. I’ll often make the point throughout my books that using graphs and numerical measures together will help you better understand your data. And this example of understanding the meaning of SDs as a measure of being an outlier makes more sense when you can see the distribution of your data in a graph is a good illustration of this principle.

Thanks for writing!

Jimoh, S. O. says

Thank you very much for this post. It is very clear and informative.

Helge says

Thank you! Yes, we were not interested in individuals with ongoing infections, so it seems legit to conclude that the 19 were not our target population. I use Stata for my analyses, and I added the command “vce(robust)” to the syntax to apply robust standard errors to account for any kind of violation of assumptions. Is it possible to say if this was a good or bad idea? 🙂 I understand that robust standard errors account for heteroscedasticity, but since I also used the lrtest syntax (likelihood ratio test) to examine whether variables were heteroscedastic or homoscedastic, and added the syntax “residuals(independent, by(variable)” to allow for heteroscedasticity for the heteroscedastic variables, it might be unnecessary to use both robust standard errors as well as allowing for heteroscedasticity?

Thank you so much. 🙂

Jim Frost says

My hope would be that after dropping those 19 cases with ongoing infections you won’t need to use anything outside the normal analysis. I’d only make adjustments for outliers or heteroscedasticity if you see evidence in the residual plots that there is a need to do so.

Unfortunately, I don’t use Stata and I’m not familiar with their commands. So, I don’t know which ones would be appropriate should there be the issues you describe. But, here’s the general approach that I recommend.

Start with regular analysis first and see how that works. Check the residual plots and assumptions. Typically, when you use an alternative method there is some sort of tradeoff. Don’t go down that road unless residual analysis indicates you need to! If you find that you do need to make adjustments, the residual plots should give you an idea of what needs fixing. Start with minimal corrections and escalate only as needed.

Helge says

Thank you for your swift answer. The 19 were removed due to suspected ongoing infection (e.g. having a cold, HIV or hepatitis), as the variable was a biomarker for inflammation. So the decision was made based on the idea that ongoing infection would bias the biomarkers we were looking at. I will however run the analyses with and without the 19 and compare results, as you suggest. Thank you very much, I really appreciate your work, Jim.

Jim Frost says

Hi Helge,

Ah, so knowing that additional information makes all the difference! It sounds like removing them is the correct action. That’s the additional mystery I suspected was there!

In this post, I talk about this as a sampling problem. It’s similar to the situation I describe with the bone density study and the subject who had diabetes.

It sounds like in your study you’ve defined your target population as something like comprising healthy individuals or individuals without a condition that affects inflammation. That’s your target population that are studying. In this light, an individual with a condition that affects inflammation would not be a part of that target population. So, if you identify these conditions, those are the specific reasons you can attribute to those individuals.

Based on that information, I’d concur in principle with removing those observations. The results from your study should inform you about the target population. In your report, I’d be sure to clearly define the target population and then explain how you excluded individuals outside of that population. The study is designed to understand the healthy/normal population and not individuals with conditions that affect inflammation.

Additionally, with this information, it’s probably not necessary to run the analyses with and without those subjects–unless it’s informative in someway.

You’re very welcome! And, I’m really glad that you wrote back with the followup information. Your study provides a great example to other readers by highlighting issues that real world studies face involving outliers and deciding how to handle them.

Helge says

Thank you so much for explaining this subject so well! I hope it is okay to ask one question: In multilevel models using a dataset with a number of extreme outliers in a medium-size dataset (19 subjects above 95th percentile in a continuous variable in a dataset of 147 subjects), would you say that the multilevel modeling technique is robust enough to handle the outliers? Or should these 19 be removed in order to not violate assumptions?

Jim Frost says

Hi Helge,

Multilevel modeling is a generalization of linear least squares modeling. As such, it is highly sensitive to outliers.

Whether you should remove these observations is a separate matter though. Use the principles outlined in this blog post to help you make that decision. Do these subjects represent your target population as it is defined for your study? Removing even several outliers is a big deal. So, removing 19 would be far beyond that! On the face of it, removing all 19 doesn’t sound like a good idea. But, as you hopefully gathered from this blog post, answering that question depends on a lot of subject-area knowledge and real close investigation of the observations in question. It’s not possible to give you a blanket answer about it.

I’d recommend really thinking about the target population for your study and take a very close look at these observations. How is that you obtained so many subjects above the 95th percentile? Maybe they do represent your target population and you wouldn’t want to artificially reduce the variability? Keep in mind, simply being an extreme value is not enough by itself to warrant exclusion. You need to find an additional reason you can attribute to every data point you exclude.

Again, it’s hard for me to imagine removing 19 observation, or 13% of your dataset! It seems like there must be more to the story here. It’s impossible for me to say what it is, of course, but you should investigate.

If you leave some or all of these outliers in the dataset, you might need to change something about the analysis. However, you should try the regular analysis first, and then check the residual plots and assess the regression assumptions. If you’re lucky, your model might satisfy the assumptions and you won’t need to make adjustments.

If your model does violate assumptions, you can try transforming the data or possibly using a robust regression analysis that you can find in some statistical software packages. These techniques reduce the impact of outliers, including making it so they don’t violate the assumptions.

Another thing to consider is comparing the results with and without the outliers and understanding how it changes the outcomes. As I mention in this post, if a research group is in disagreement or completely unsure about what to do, you can analyze it both ways and report the differences, etc.

Best of luck with your study!