Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data. But, that’s not always the case. Removing outliers is legitimate only for specific reasons.

Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. Unfortunately, resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant.

In my previous post, I showed five methods you can use to identify outliers. However, identification is just the first step. Deciding how to handle outliers depends on investigating their underlying cause.

In this post, I’ll help you decide whether you should remove outliers from your dataset and how to analyze your data when you can’t remove them. The proper action depends on what causes the outliers. In broad strokes, there are three causes for outliers—data entry or measurement errors, sampling problems and unusual conditions, and natural variation.

Let’s go over these three causes!

## Data Entry and Measurement Errors and Outliers

Errors can occur during measurement and data entry. During data entry, typos can produce weird values. Imagine that we’re measuring the height of adult men and gather the following dataset.

In this dataset, the value of 10.8135 is clearly an outlier. Not only does it stand out, but it’s an impossible height value. Examining the numbers more closely, we conclude the zero might have been accidental. Hopefully, we can either go back to the original record or even remeasure the subject to determine the correct height.

These types of errors are easy cases to understand. If you determine that an outlier value is an error, correct the value when possible. That can involve fixing the typo or possibly remeasuring the item or person. If that’s not possible, you must delete the data point because you know it’s an incorrect value.

## Sampling Problems Can Cause Outliers

Inferential statistics use samples to draw conclusions about a specific population. Studies should carefully define a population, and then draw a random sample from it specifically. That’s the process by which a study can learn about a population.

Unfortunately, your study might accidentally obtain an item or person that is not from the target population. There are several ways this can occur. For example, unusual events or characteristics can occur that deviate from the defined population. Perhaps the experimenter measures the item or subject under abnormal conditions. In other cases, you can accidentally collect an item that falls outside your target population, and, thus, it might have unusual characteristics.

**Related post**: Inferential vs. Descriptive Statistics

### Examples of Sampling Problems

Let’s bring this to life with several examples!

Suppose a study assesses the strength of a product. The researchers define the population as the output of the standard manufacturing process. The normal process includes standard materials, manufacturing settings, and conditions. If something unusual happens during a portion of the study, such as a power failure or a machine setting drifting off the standard value, it can affect the products. These abnormal manufacturing conditions can cause outliers by creating products with atypical strength values. Products manufactured under these unusual conditions do not reflect your target population of products from the normal process. Consequently, you can legitimately remove these data points from your dataset.

During a bone density study that I participated in as a scientist, I noticed an outlier in the bone density growth for a subject. Her growth value was very unusual. The study’s subject coordinator discovered that the subject had diabetes, which affects bone health. Our study’s goal was to model bone density growth in pre-adolescent girls with no health conditions that affect bone growth. Consequently, her data were excluded from our analyses because she was not a member of our target population.

If you can establish that an item or person does not represent your target population, you can remove that data point. However, you must be able to attribute a specific cause or reason for why that sample item does not fit your target population.

## Natural Variation Can Produce Outliers

The previous causes of outliers are bad things. They represent different types of problems that you need to correct. However, natural variation can also produce outliers—and it’s not necessarily a problem.

All data distributions have a spread of values. Extreme values can occur, but they have lower probabilities. If your sample size is large enough, you’re bound to obtain unusual values. In a normal distribution, approximately 1 in 340 observations will be at least three standard deviations away from the mean. However, random chance might include extreme values in smaller datasets! In other words, the process or population you’re studying might produce weird values naturally. There’s nothing wrong with these data points. They’re unusual, but they are a normal part of the data distribution.

**Related post**: Normal Distribution and Measures of Variability

### Example of Natural Variation Causing an Outlier

For example, I fit a model that uses historical U.S. Presidential approval ratings to predict how later historians would ultimately rank each President. It turns out a President’s lowest approval rating predicts the historian ranks. However, one data point severely affects the model. President Truman doesn’t fit the model. He had an abysmal lowest approval rating of 22%, but later historians gave him a relatively good rank of #6. If I remove that single observation, the R-squared increases by over 30 percentage points!

However, there was no justifiable reason to remove that point. While it was an oddball, it accurately reflects the potential surprises and uncertainty inherent in the political system. If I remove it, the model makes the process appear more predictable than it actually is. Even though this unusual observation is influential, I left it in the model. It’s bad practice to remove data points simply to produce a better fitting model or statistically significant results.

If the extreme value is a legitimate observation that is a natural part of the population you’re studying, you should leave it in the dataset. I’ll explain how to analyze datasets that contain outliers you can’t exclude shortly!

## Guidelines for Dealing with Outliers

Sometimes it’s best to keep outliers in your data. They can capture valuable information that is part of your study area. Retaining these points can be hard, particularly when it reduces statistical significance! However, excluding extreme values solely due to their extremeness can distort the results by removing information about the variability inherent in the study area. You’re forcing the subject area to appear less variable than it is in reality.

When considering whether to remove an outlier, you’ll need to evaluate if it appropriately reflects your target population, subject-area, research question, and research methodology. Did anything unusual happen while measuring these observations, such as power failures, abnormal experimental conditions, or anything else out of the norm? Is there anything substantially different about an observation, whether it’s a person, item, or transaction? Did measurement or data entry errors occur?

If the outlier in question is:

- A measurement error or data entry error, correct the error if possible. If you can’t fix it, remove that observation because you know it’s incorrect.
- Not a part of the population you are studying (i.e., unusual properties or conditions), you can legitimately remove the outlier.
- A natural part of the population you are studying, you should not remove it.

When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences. Comparing results in this manner is particularly useful when you’re unsure about removing an outlier and when there is substantial disagreement within a group over this question.

## Statistical Analyses that Can Handle Outliers

What do you do when you can’t legitimately remove outliers, but they violate the assumptions of your statistical analysis? You want to include them but don’t want them to distort the results. Fortunately, there are various statistical analyses up to the task. Here are several options you can try.

Nonparametric hypothesis tests are robust to outliers. For these alternatives to the more common parametric tests, outliers won’t necessarily violate their assumptions or distort their results.

In regression analysis, you can try transforming your data or using a robust regression analysis available in some statistical packages.

Finally, bootstrapping techniques use the sample data as they are and don’t make assumptions about distributions.

These types of analyses allow you to capture the full variability of your dataset without violating assumptions and skewing results.

Ana says

Thanks so much for this. Do you have any recommended reading on this that would also be something I could cite in order to justify my choice of not removing my outliers? I am studying a particular group (classical musicians) and in a sample of over 700 I have 12 outliers for one of the measures. When I look at them individually across 9 other measures I have no reason to lean towards the possibility of measurement error (although not sure how to justify this entirely). But it looks to me that they engaged well with the 9 measures of the survey and that these are legitimate observations. Those 12 don’t affect the assumptions but do indeed change the means a bit. Thanks so much for your thoughts.

Louis says

Thanks Jim for fast reply.

See my question only a theoretical one. Of course, I am aware about understanding the why of those outliers, and unless there is a solid reason, I would keep data. Maybe I can be a bit more precise: imagine it is a field case study, several treatments were performed in a design with randomized blocks (10 blocks – the 10 replicates). Each block contains at least 20 plants. As one block represents the stat. unit, measures performed on plants should be averaged. Now, one series of measures is performed on let ´s say 10 plants in each block but averaged data finally show a significant outlier in the block 1 as compared with the other blocks. Later, another series of measures (a different analysis than previously) is performed on 5 plants, different ones than the 10 in the first analysis. Average data show a significant outlier in block 2, as compared with other blocks. The question is how to deal with outlier in this case? I mean here, let´s assume outliers should be removed (whatever the reason is): should I remove the block 1 and 2 from my all data set? Should I only consider to remove data from block 1 in the first analysis, and block 2 in the second analysis because they were performed from distinct individual groups? or should I consider the most important analysis (for example the first one with outlier in block 1) and I should remove data set of the block 1 from the second analysis ? My question is also dealing outliers when variables are independant or not. I know my question is a bit strange, it is only for curiosity.

Thanks a lot for your time !

Louis says

Thanks Jim for this interesting post. I have a little theoretical and very basic question: imagine a trial with 10 replicates per condition (each replicate contains let´s say 20 individuals), and 2 evaluated independant variables (independant because measures were done from different groups of individuals within each replicates). I am in the optic that these outliers are coming from natural variation. There is for example a significant outlier in repetition 1 with the variable 1, and one significant outlier in repetition 2 with the variable 2. How to deal then those outliers? Could I remove those outliers independantly from the variable, or should I connect them between variables – i.e. I remove data from repetition 1 and 2 for each of the 2 variables? Then, maybe it is interesting also if you could say some words when variables are dependant. Thanks a lot for your reply !

Jim Frost says

Hi Louis,

If I understand your question correctly, I’d remove only the outliers if you determine that they really need to be removed. I wouldn’t remove observations in other replicates just because a different replicate has an outlier.

As for the dependent case, I’m assuming you mean multiple observations on the same individuals. In that case, if you know that one observation is an outlier, yes, you’d probably remove that individual completely. As usual, you’d want to investigate why it is an outlier. It would be strange if the same individual has regular values and then suddenly one is an outlier. Maybe it’s a fixable data entry error and you just need to correct it? Or, perhaps, that’s just normal fluctuation for an individual that your capturing. So, investigate the underlying cause.

But, if you determine it is an outlier, it seems likely you’d need to remove the individual entirely.

king mofasa says

Thanks very much for your reply. I will certainly keep them in.

King Mofasa says

Thanks for taking the time to explain this in simple words. I would like to ask the following question: regression analysis is sensitive to multivariate outliers. Most of the references I have reviewed suggest that multivariate outliers should be removed. I don’t see any other suggestion anywhere other than removal. I am hesitant to remove them as the cases seem valid, just different from the rest of the cases. I personally find myself against removing ANY valid outlier. Is there a way to keep multivariate outliers and at least winsorize them (similar to univariate outliers)?

Jim Frost says

Hi King,

Your hesitancy is very wise. I talk about this in this article, but it’s important to distinguish between unusual values that are caused by some sort of problem (unusual conditions, data entry errors, subject is from the wrong population, etc.) versus unusual values that are caused by the natural variation in the process you’re studying. I mention the the regression case where one observation was very unusual when it came to predicting the eventual ranking of U.S. President’s by historians. However, that unusual value was a normal part of the process, so I left it in.

That’s the important distinction that you need to evaluate for these outliers. If they’re valid, then you don’t want to remove them using any method. They provide important information about the natural variability in your subject-area. If you remove them, you’re losing that important information.

I hope that helps! My regression ebook discussions outliers, unusual values, and leverage points in much greater detail from the perspective of regression analysis specifically. It might be helpful.

Tamara says

Hi Jim,

Can you explain the process of winsorizing to deal with outliers that are not measurement errors or mistakes but outliers that are true from the data set?

I had a few outliers in my data set and I winsorized the outliers by changing the outliers to the largest and smallest values that are nearest to them which are not outliers themselves.

Thank you for your continued knowledge about statistical techniques.

Jim Frost says

Hi Tamara,

Winsorizing is process that either reduces the weight of an unusual value or changes it to be a value that is not so unusual. It sounds like you used a method changes them to less extreme values.

I’m not a big fan of this process for several reasons. For one thing, I think it’s vital to learn why these outliers are happening. You never know what valuable information you might learn about your study area through this investigation. Ideally, you should determine whether these points are valid data or not.

If they are a valid part of the population you’re studying, changing these values will mean that your sample doesn’t reflect the true variability in the population and could lead you to draw incorrect conclusions. In short, you’ll draw conclusions based on an assumption that there is less variability than what actually exists.

If the data are not a valid part of the population (measurement/data entry error, unusual conditions, drawn from a different population, etc.), then those points should not be included at all.

So, in my mind, Winsorizing doesn’t address either condition, and it reduces your chances of learning something new about your population.

Methods that reduce/remove outliers will usually increase the power of your test and make the results look stronger. So, it can be hard to resist the temptation to use automated methods and whenever possible really look into each outlier and learn what is happening.

I suppose the case for Winsorizing is that if, for whatever reason, you cannot assess the outliers and make the determination I describe, but you highly suspect that they’re invalid values but can’t prove it, Winsorizing might be a better middle ground approach than just removing suspected but unproven bad data. You’d have to carefully weight that decision based on the data you have about the outliers and subject area knowledge.

I hope this helps!

Youssef Karam says

Hi Jim,

Please I am a student carying a study on the compensation of directors and how this compensation is mainely affected by performence of the firm. 3 outliers were noticed (graphically and by parameters) where 2 of them are realted to years of experience and age and one is related to %profit knowing that the profit is also included as a variable among others (sales, market value,…). Which of the 3 do you recommend to keep knowing that the a significant improvant was achived (Multiple R, R-squarred, Error, Adjusted R squarred, parameters, std errors of paramaters, p- values) when omitted the 3 of them.

Thank you

Youssef

Lauren says

Hi Jim,

Thank you so much for your reply. That all makes sense – your reasoning is articulated very well! It is a tricky area but I feel a lot more confident with the decision I have made now from reading this. Thank you again.

Kind regards,

Lauren.

Lauren says

Hi,

I did an experiment and through visual inspection I have identified 7 major outliers from the data set. Most of the outliers belong to one participant who appears to have found the experiment particularly hard. Is this a justified reason for not including these outliers in further analysis?

Thanks,

Lauren.

Jim Frost says

Hi Lauren,

This can be tricky! One thing you need to ask yourself is whether your population normally contains individuals who find the task particularly hard. In other words, does the individual in question simply represent the normal range of skill in the population you’re studying? If yes, then you should include the participant because s/he represents part of the variability in the population.

However, if there is some underlying condition or factor that makes it so the participant doesn’t reflect the normal range of abilities, you can consider removing. I’m thinking of things like some sort of medical/psychological condition that makes the subject not a part of your target population. For example, I excluded the girl with diabetes from the bone density study because we were studying bone density in girls who didn’t have conditions that affected it. Or, perhaps there were unusual conditions that made that particular session more difficult. Fire alarms, interruptions, etc.

There should be some reason for excluding this participant beyond just it being extra hard. If you can say, it was extra hard BECAUSE he has a condition that makes it hard to focus and that is not the population under study. Ok. Or, it was extra hard because fire alarms kept going off during the test. That’s OK too. But, if it was just extra hard just because the subject was on the low end of the distribution of abilities, I’d say that’s typically not enough reason by itself.

John Grenci says

Hey Jim, what about dealing with zeros? Particularly, where you have many of them? say, 50% of your values are zeros and the context in my case, anyway, is that we are talking about scrap steel per bar of steel (some have it, some don’t), leaving them in would give a weird distribution in the way of modeling, but taking them out would take away not only much of the data set, but defeat the whole essence of cause and effect. do we standardize in some way?

I have your latest book. I do not recall if you address that. I have not gotten thru all of it 🙂 . typing this from work. thanks John

Hui says

Hi Jim,

When we have a dataset to deal with, missing data or outliners which one treat first?

Jim Frost says

Hi Hui,

You should determine how you’ll handle missing data before you even begin data collection. After you collect the data, you can assess outliers.

If you’re going to toss out observations with missing data, it’s probably easier to do that first and then assess outliers, but the order probably doesn’t matter too much.

However, if you’re going to estimate the values of the missing data, it’s probably better to generate those estimates after removing the outliers because the outliers will affect those estimates. Here’s the logic for removing outliers first. By removing outliers, you’ve explicitly decided that those values should not affect the results, which includes the process of estimating missing values.

Both cases suggest removing outliers first, but it’s more critical if you’re estimating the values of missing data. I’m not sure there is much literature on this issue, but you should determine whether studies in your field employ standard procedures regarding this question.

Marlon says

Jim, how can you probe that a dataset is normally distributed using Excel?

Mohd Shehzoor Hussain says

Thank you for your reply Jim.

Can we use median and IQR to measure CT and variance if data is skewed? If yes, i have two questions.

1) what tests are there to measure the change in median and IQR?

2) what do we do with the outliers?

Mohd Shehzoor Hussain says

Hi Jim, how do you get standard deviations for data set without a proper bell curve due to outliers?

Jim Frost says

Hi Mohd,

You can calculate standard deviations using the usual formula regardless of the distribution. However, only in the normal distribution does the SD have special meaning that you can relate to probabilities. If your data are highly skewed, it could affect the standard deviations that you’d expect to see and what counts as an outliers. It’s always important to graph the distribution of your data to help you understand things like outliers. I’ll often make the point throughout my books that using graphs and numerical measures together will help you better understand your data. And this example of understanding the meaning of SDs as a measure of being an outlier makes more sense when you can see the distribution of your data in a graph is a good illustration of this principle.

Thanks for writing!

Jimoh, S. O. says

Thank you very much for this post. It is very clear and informative.

Helge says

Thank you! Yes, we were not interested in individuals with ongoing infections, so it seems legit to conclude that the 19 were not our target population. I use Stata for my analyses, and I added the command “vce(robust)” to the syntax to apply robust standard errors to account for any kind of violation of assumptions. Is it possible to say if this was a good or bad idea? 🙂 I understand that robust standard errors account for heteroscedasticity, but since I also used the lrtest syntax (likelihood ratio test) to examine whether variables were heteroscedastic or homoscedastic, and added the syntax “residuals(independent, by(variable)” to allow for heteroscedasticity for the heteroscedastic variables, it might be unnecessary to use both robust standard errors as well as allowing for heteroscedasticity?

Thank you so much. 🙂

Jim Frost says

My hope would be that after dropping those 19 cases with ongoing infections you won’t need to use anything outside the normal analysis. I’d only make adjustments for outliers or heteroscedasticity if you see evidence in the residual plots that there is a need to do so.

Unfortunately, I don’t use Stata and I’m not familiar with their commands. So, I don’t know which ones would be appropriate should there be the issues you describe. But, here’s the general approach that I recommend.

Start with regular analysis first and see how that works. Check the residual plots and assumptions. Typically, when you use an alternative method there is some sort of tradeoff. Don’t go down that road unless residual analysis indicates you need to! If you find that you do need to make adjustments, the residual plots should give you an idea of what needs fixing. Start with minimal corrections and escalate only as needed.

Helge says

Thank you for your swift answer. The 19 were removed due to suspected ongoing infection (e.g. having a cold, HIV or hepatitis), as the variable was a biomarker for inflammation. So the decision was made based on the idea that ongoing infection would bias the biomarkers we were looking at. I will however run the analyses with and without the 19 and compare results, as you suggest. Thank you very much, I really appreciate your work, Jim.

Jim Frost says

Hi Helge,

Ah, so knowing that additional information makes all the difference! It sounds like removing them is the correct action. That’s the additional mystery I suspected was there!

In this post, I talk about this as a sampling problem. It’s similar to the situation I describe with the bone density study and the subject who had diabetes.

It sounds like in your study you’ve defined your target population as something like comprising healthy individuals or individuals without a condition that affects inflammation. That’s your target population that are studying. In this light, an individual with a condition that affects inflammation would not be a part of that target population. So, if you identify these conditions, those are the specific reasons you can attribute to those individuals.

Based on that information, I’d concur in principle with removing those observations. The results from your study should inform you about the target population. In your report, I’d be sure to clearly define the target population and then explain how you excluded individuals outside of that population. The study is designed to understand the healthy/normal population and not individuals with conditions that affect inflammation.

Additionally, with this information, it’s probably not necessary to run the analyses with and without those subjects–unless it’s informative in someway.

You’re very welcome! And, I’m really glad that you wrote back with the followup information. Your study provides a great example to other readers by highlighting issues that real world studies face involving outliers and deciding how to handle them.

Helge says

Thank you so much for explaining this subject so well! I hope it is okay to ask one question: In multilevel models using a dataset with a number of extreme outliers in a medium-size dataset (19 subjects above 95th percentile in a continuous variable in a dataset of 147 subjects), would you say that the multilevel modeling technique is robust enough to handle the outliers? Or should these 19 be removed in order to not violate assumptions?

Jim Frost says

Hi Helge,

Multilevel modeling is a generalization of linear least squares modeling. As such, it is highly sensitive to outliers.

Whether you should remove these observations is a separate matter though. Use the principles outlined in this blog post to help you make that decision. Do these subjects represent your target population as it is defined for your study? Removing even several outliers is a big deal. So, removing 19 would be far beyond that! On the face of it, removing all 19 doesn’t sound like a good idea. But, as you hopefully gathered from this blog post, answering that question depends on a lot of subject-area knowledge and real close investigation of the observations in question. It’s not possible to give you a blanket answer about it.

I’d recommend really thinking about the target population for your study and take a very close look at these observations. How is that you obtained so many subjects above the 95th percentile? Maybe they do represent your target population and you wouldn’t want to artificially reduce the variability? Keep in mind, simply being an extreme value is not enough by itself to warrant exclusion. You need to find an additional reason you can attribute to every data point you exclude.

Again, it’s hard for me to imagine removing 19 observation, or 13% of your dataset! It seems like there must be more to the story here. It’s impossible for me to say what it is, of course, but you should investigate.

If you leave some or all of these outliers in the dataset, you might need to change something about the analysis. However, you should try the regular analysis first, and then check the residual plots and assess the regression assumptions. If you’re lucky, your model might satisfy the assumptions and you won’t need to make adjustments.

If your model does violate assumptions, you can try transforming the data or possibly using a robust regression analysis that you can find in some statistical software packages. These techniques reduce the impact of outliers, including making it so they don’t violate the assumptions.

Another thing to consider is comparing the results with and without the outliers and understanding how it changes the outcomes. As I mention in this post, if a research group is in disagreement or completely unsure about what to do, you can analyze it both ways and report the differences, etc.

Best of luck with your study!