Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data. But, that’s not always the case. Removing outliers is legitimate only for specific reasons.
Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. Unfortunately, resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant.
In my previous post, I showed five methods you can use to identify outliers. However, identification is just the first step. Deciding how to handle outliers depends on investigating their underlying cause.
In this post, I’ll help you decide whether you should remove outliers from your dataset and how to analyze your data when you can’t remove them. The proper action depends on what causes the outliers. In broad strokes, there are three causes for outliers—data entry or measurement errors, sampling problems and unusual conditions, and natural variation.
Let’s go over these three causes!
Data Entry and Measurement Errors and Outliers
Errors can occur during measurement and data entry. During data entry, typos can produce weird values. Imagine that we’re measuring the height of adult men and gather the following dataset.
In this dataset, the value of 10.8135 is clearly an outlier. Not only does it stand out, but it’s an impossible height value. Examining the numbers more closely, we conclude the zero might have been accidental. Hopefully, we can either go back to the original record or even remeasure the subject to determine the correct height.
These types of errors are easy cases to understand. If you determine that an outlier value is an error, correct the value when possible. That can involve fixing the typo or possibly remeasuring the item or person. If that’s not possible, you must delete the data point because you know it’s an incorrect value.
Sampling Problems Can Cause Outliers
Inferential statistics use samples to draw conclusions about a specific population. Studies should carefully define a population, and then draw a random sample from it specifically. That’s the process by which a study can learn about a population.
Unfortunately, your study might accidentally obtain an item or person that is not from the target population. There are several ways this can occur. For example, unusual events or characteristics can occur that deviate from the defined population. Perhaps the experimenter measures the item or subject under abnormal conditions. In other cases, you can accidentally collect an item that falls outside your target population, and, thus, it might have unusual characteristics.
Related post: Inferential vs. Descriptive Statistics
Examples of Sampling Problems
Let’s bring this to life with several examples!
Suppose a study assesses the strength of a product. The researchers define the population as the output of the standard manufacturing process. The normal process includes standard materials, manufacturing settings, and conditions. If something unusual happens during a portion of the study, such as a power failure or a machine setting drifting off the standard value, it can affect the products. These abnormal manufacturing conditions can cause outliers by creating products with atypical strength values. Products manufactured under these unusual conditions do not reflect your target population of products from the normal process. Consequently, you can legitimately remove these data points from your dataset.
During a bone density study that I participated in as a scientist, I noticed an outlier in the bone density growth for a subject. Her growth value was very unusual. The study’s subject coordinator discovered that the subject had diabetes, which affects bone health. Our study’s goal was to model bone density growth in pre-adolescent girls with no health conditions that affect bone growth. Consequently, her data were excluded from our analyses because she was not a member of our target population.
If you can establish that an item or person does not represent your target population, you can remove that data point. However, you must be able to attribute a specific cause or reason for why that sample item does not fit your target population.
Natural Variation Can Produce Outliers
The previous causes of outliers are bad things. They represent different types of problems that you need to correct. However, natural variation can also produce outliers—and it’s not necessarily a problem.
All data distributions have a spread of values. Extreme values can occur, but they have lower probabilities. If your sample size is large enough, you’re bound to obtain unusual values. In a normal distribution, approximately 1 in 340 observations will be at least three standard deviations away from the mean. However, random chance might include extreme values in smaller datasets! In other words, the process or population you’re studying might produce weird values naturally. There’s nothing wrong with these data points. They’re unusual, but they are a normal part of the data distribution.
Related post: Normal Distribution and Measures of Variability
Example of Natural Variation Causing an Outlier
For example, I fit a model that uses historical U.S. Presidential approval ratings to predict how later historians would ultimately rank each President. It turns out a President’s lowest approval rating predicts the historian ranks. However, one data point severely affects the model. President Truman doesn’t fit the model. He had an abysmal lowest approval rating of 22%, but later historians gave him a relatively good rank of #6. If I remove that single observation, the R-squared increases by over 30 percentage points!
However, there was no justifiable reason to remove that point. While it was an oddball, it accurately reflects the potential surprises and uncertainty inherent in the political system. If I remove it, the model makes the process appear more predictable than it actually is. Even though this unusual observation is influential, I left it in the model. It’s bad practice to remove data points simply to produce a better fitting model or statistically significant results.
If the extreme value is a legitimate observation that is a natural part of the population you’re studying, you should leave it in the dataset. I’ll explain how to analyze datasets that contain outliers you can’t exclude shortly!
To learn more about the example above, read my article about it, Understanding Historians’ Rankings of U.S. Presidents using Regression Models.
Guidelines for Dealing with Outliers
Sometimes it’s best to keep outliers in your data. They can capture valuable information that is part of your study area. Retaining these points can be hard, particularly when it reduces statistical significance! However, excluding extreme values solely due to their extremeness can distort the results by removing information about the variability inherent in the study area. You’re forcing the subject area to appear less variable than it is in reality.
When considering whether to remove an outlier, you’ll need to evaluate if it appropriately reflects your target population, subject-area, research question, and research methodology. Did anything unusual happen while measuring these observations, such as power failures, abnormal experimental conditions, or anything else out of the norm? Is there anything substantially different about an observation, whether it’s a person, item, or transaction? Did measurement or data entry errors occur?
If the outlier in question is:
- A measurement error or data entry error, correct the error if possible. If you can’t fix it, remove that observation because you know it’s incorrect.
- Not a part of the population you are studying (i.e., unusual properties or conditions), you can legitimately remove the outlier.
- A natural part of the population you are studying, you should not remove it.
When you decide to remove outliers, document the excluded data points and explain your reasoning. You must be able to attribute a specific cause for removing outliers. Another approach is to perform the analysis with and without these observations and discuss the differences. Comparing results in this manner is particularly useful when you’re unsure about removing an outlier and when there is substantial disagreement within a group over this question.
Statistical Analyses that Can Handle Outliers
What do you do when you can’t legitimately remove outliers, but they violate the assumptions of your statistical analysis? You want to include them but don’t want them to distort the results. Fortunately, there are various statistical analyses up to the task. Here are several options you can try.
Nonparametric hypothesis tests are robust to outliers. For these alternatives to the more common parametric tests, outliers won’t necessarily violate their assumptions or distort their results.
In regression analysis, you can try transforming your data or using a robust regression analysis available in some statistical packages.
Finally, bootstrapping techniques use the sample data as they are and don’t make assumptions about distributions.
These types of analyses allow you to capture the full variability of your dataset without violating assumptions and skewing results.
Aaron says
Hi Jim,
What percentage of a dataest is it usually acceptable to remove?
Jim Frost says
Hi Aaron,
Please read the post. That’s not something you can put a one-size-fits-all percentage on. Some unusual values are informative and represent the natural variation of the subject matter. Only values that are incorrect, have problems, or are from a different population should be removed. That’s the determination rather than some preset percentage.
M says
Hi Jim,
I am dealing with data from cancer patients and running different analyses with it. I want to investigate the correlation between some of the variables (a variable related to tumor measurements, let’s call it X, to some cancer-related markers Y and Z). There have been issues with data gathering/processing before and we have had to drop datapoints (this is a retrospective study and we are dealing with old reports that are sometimes not as comprehensive or well-documented as we wished they’d be). After my most recent analysis, I noticed a datapoint that is 10x larger than the closest X value and which skews the scatterplots for correlation of X to Y,Z… The correlation for X-Z is significant with the outlier and not significant without it (sepearman correlation), R remains almost the same. I am wondering whether I should drop that datapoint, especially considering I will run cox regression models and kaplan meier analyses with these data. I am worried the X outlier will skew the regressions as well and I am unsure whether it can be considered “natural” variation when the difference with the rest of the datapoints is so striking.
Thanks for the help!
Sir Ermaya says
Hi Jim, I enjoy reading your article.
I am writing a paper with 5 variables, it looks like a path analysis model.
How can I make a dummy data for model, which in end can affect each other perfectly. Is there any AI or Program that can make perfect result for this in order to avoid red VIF and Multicolonierity. Thank you very much for the response.. God bless
Sarah says
Thanks for explaining this point very well.
I have a question about the outliers if we estimate a SVAR model with monthly data and our sample includes extreme values for a certain period (i.e., the financial crisis). Our specification includes 12 lags. Does this number of lags determine how many observations do we need to drop from our sample? for example, do we need to drop the whole year that includes the extreme observations? or just we remove the few months that have these extreme observations?
Jim Frost says
Hi Sarah,
That’s a great question!
The number of lags (12 in this case) primarily determines the temporal structure of the model and how past values influence current estimates. Typically, the choice of lags does not directly dictate how many observations to drop.
Instead of dropping values, consider using a dummy variable for the period of outliers (e.g., financial crisis) to account for structural breaks.
Dave Wilson says
Hi Jim,
Very much enjoy your articles. I’m trying to find a better way to handle outliers for small sample sizes. We do proficiency testing where several engineers make the same measurement with the same equipment and we compare. Traditionally this has used z-scores, but sometimes we have less than 10 samples. I tried using t-Test, but for example with five measurements (4.7,4.8,4.8,4.8,4.8) the 4.7 shows as a large outlier (T value 4.0) which, using the eye test, seems extreme. Any thoughts you can share? Cheers, Dave
Jim Frost says
Hi Dave,
Finding outliers with small datasets can be challenging given the limited data.
Let’s start with Z-scores. They’re probably not a good choice for your scenario for several reasons. For the usual rules that apply about using Z-scores more extreme than +/-3 as a cutoff only applies to normally distributed data, which yours isn’t. Also, small datasets limit the potential extremeness of Z-scores. With n = 5, Z-scores can’t be more extreme than +/-1.789. The Z-score for the 4.7 value is right near that limit. You just can’t get higher z-scores with such small datasets. So, you’ll never get a Z-score of +/-3 with n = 5. In fact, you need n >= 10 for it to even be possible.
I’m not sure how you’re performing the t-test but that tells you about the mean value. It won’t tell you about outliers.
Instead, I would use subject area knowledge, such as an understanding of the instrument’s precision. Are any values far enough outside the bulk of values to be unreasonable given the instrument’s precision? Compare values to historical values. Do any stick out as being historically unusual? This comes down more to expertise given such limited data and an understanding of the measurement instruments.
I wish I had something more concrete!
Doug Guidry says
Hi Jim,
I would appreciate your opinion on the following Business Interruption Insurance claim:
My client is a personal injury law firm whose revenue (fees) are based on contingency fees
which fluctuates from month to month. Some cases takes years to close and fees are only
collected when the case is settled.
The three year monthly fees are as follows:
Year 1 Year 2 Year 3
$4,200,000 $1,400,000 $6,000,000
In one month in year 3 our client collected $4,000,000–which represents 35% of the total
received in the three year period.
From these amounts the insurance company excluded the high month of $4,100,000 and
the low month of $2,400 with the following explanation:
For purposes of this average/ the highest and lowest months have been excluded to
normalize the average. This is done for purposes of extreme outliers (high and low) that
skew the average heavily in either direction.
We contend the the outliers was produced by Natural Variation and should not be removed
because is a natural part of the business.
By removing the high and low months, our client’s claim was cut in half..
Thanking you in advance.
Doug Guidry,CPA
Jim Frost says
Hi Doug,
For starters, my response will be specifically about the statistical aspects of this situation. I’m not familiar with the business, insurance, legal aspects of it.
The insurance company is using a trimmed mean, which is a valid method in particular cases involving inferential statistics–that’s where you use a sample to estimate the properties of a population. Specifically, you’d use that when outliers reduce the quality of your population estimate (such as the mean). By removing the outliers, the sample average more likely reflects the population. That seems to be the insurance company’s rational. For more information, read my post about Trimmed Means.
However, I don’t think using a trimmed mean is appropriate here because this doesn’t seem to be a case where you’re estimating a population. If I’m understanding the situation correctly, the population here is all your client’s fees. You have all the data for that entire population. Hence, no estimation is necessary. Therefore, there’s no need to remove outliers. As you say, it’s a natural part of the variability in the population. We know this because you have the population of all fees your client received.
An analogy would be like a restaurant bill. You’d never remove the highest and lowest item because the bill reflects the population of items that the diners ordered. The highest and lowest priced items are the natural variability in the population of items they ordered. Consequently, you simply add the items up and that is your population value. No estimation necessary and, therefore, no correction for outliers.
In short, the insurance company is using an inferential statistics procedure inappropriately because you have data for the entire population.
That’s my take on it. I hope that helps!
Peter Collard (@sybaseguru) says
If you exclude data points that fall within your population then you have either to change your definition of population or commit fraud. The pharmaceutical community appear to be very good at the latter, judging by recent escapades.
Y.N.A. says
Dear Jim,
I have a question about outliers. What do you do with a participant who is an outlier based on one variable (in my case, age, years of employment or number of working hours per week), but not based on other variables. I do not include the variables ‘age’, ‘years of employment’ and ‘number of working hours’ in my analyses, I only use these variables to describe the sample. In this case, do I remove the participant when calculating the mean age, years of employment and number of working hours, but include him in further analyses? Or do I remove him completely from the dataset?
Kind regards,
Y.N.A.
Jim Frost says
Hi,
Removing potential outliers can be a tricky question. And you have an interesting case. The answer is potentially, yes, you might need to remove that subject from your analysis completely. However, the answer will depend on your subject-area knowledge.
The subject is unusual according to some of your variables but not others. You’ll need to consider whether those unusual values will affect your main outcome of interest in a manner that is not consistent with your target population. There’s no way to determine that statistically. I’m not sure what your primary outcome(s) are or the nature of your analysis, but if those variables you mention indicate that the subject does not represent your target population, you might need to remove that person from the analysis altogether.
Here’s an example that I had in my bone density research. Our target population was healthy pre-adolescent teens with no conditions that affect bone density. Turns out that one of our subjects had a condition that affects bone density. However, her other data were all typical values. We excluded her from the study because her condition put her outside our target population. We weren’t studying people with that condition but rather a normal, healthy population. And her condition could affect our primary outcome of interest (changes in her bone density).
However, if she was unusual in a different way that didn’t put her outside our target population and didn’t directly affect the outcome we were studying, we might have decided to leave her in the experiment.
That’s the kind of decision-making you’ll have to make here.
Fats says
Thank you, this makes a lot more sense!!!!
Fats says
Hi,
I have a question regarding statistical transformation and outliers.
Firstly, why actually is statistical transformation needed? It seems like we are changing the data?
Secondly, from reading i can see there are 2 different orders to treat outliers. Should outliers be checked first then non normality or non normality to be checked first and then outliers? Which way is better for rectifying outlying scores?
Thanks!!!
Jim Frost says
Hi Fats,
I’ll start by saying that I’m not a big fan of data transformations. They’re always my last resort.
You’re right, they change the data, but the idea is that they change the data into a form that satisfies the assumptions of the analysis, such a normality. Of course, the results apply to the transformed data, which reduces their usefulness. You can use back transformations to get some information in the natural data units. Again, I use them as the last resort when nothing else will do the trick.
Your second question is very insightful. Many people will use Z-scores to identify outliers. However, if the original data doesn’t follow a normal distribution, then the Z-scores won’t either. For example, if your data are right skewed, the Z-scores will also be right skewed. If you see a Z-score of three, that might indicate an outlier, but for right skewed data that could be a typical and valid score!
And you should never remove outliers solely to produce a normal distribution. If your data don’t naturally follow a normal distribution and you delete outliers to force it to follow one, you’re distorting the results!
So, to answer your question, there’s really only one way to use normality with outliers. If you suspect that your data should follow a normal distribution (via other research, subject-area knowledge, etc.) then using a measure like Z-scores can help you identify potential outliers. And, again, they’re just candidates. Just because they’re unusual values doesn’t mean they’re necessarily outliers that should be removed. They might represent a normal part of the process or subject area!
However, you should not look at a distribution (at any point of the process), see that it’s non-normal and assume there must be outliers.
Tread very lightly when using normality as a guide for finding outliers! And don’t connect normality/non-normality with the presence or absence of outliers. At most, Z-scores might highlight potential candidates as outliers in particular situations (i.e., when you strongly suspect that the distribution should be normal for theoretical reason–not because you want the distribution to be normal to satisfy assumptions).
A common mistake that analysts make occurs with distributions that are naturally skewed. Skewed distributions produce more unusual values than the normal distribution, but there’s nothing wrong with those values. They’re a natural part of the study area.
Isa says
Hello Jim, I am now dealing with a tricky situation. I have a variable that has a high range between minimum and maximum (min= 25, max=3000, mean=350), however the majority of points is concentrated in one end. This means that I have a few observations that are levereging the results. However they are legitimate observations, so it doesn’t make sense to exclude them. Besides, the linear assumption is satisfied and when introducing all variables, the model is homoscedastic. To assess the linear behaviour I ploted a gam model vs a linear one and the lines were very similar (I don’t know if this is a good approach, but someone told me). So, I was wondering if, in cases where some points are very far from the majority, do I have to transform the data? If I transform my data, some variables change their significance, so I have to be cautious about it. Thank you
Jim Frost says
Hi Isa, there’s a lot to unpack in there and, as always, the context matters a lot, and I don’t know anything about that.
But here are some points to keep in mind.
For linear assumptions, remember you don’t need to worry about the distribution of the IVs and DV but rather the residuals.
It sounds like you have variable that is naturally right skewed. In that case, having unusually high values is expected and you shouldn’t remove them. They’re a natural part of the process for skewed variables.
It doesn’t sound like you’ve identified any problem with your residuals. Have you check them for all the assumptions? If the residuals look good, and the model makes theoretical sense, I don’t see any obvious problem. Unless you have additional reasons, I don’t see why you should transform your data.
Sarah says
Hi,
Thanks for this enlightening article! I have a question for you. I have an outlier for one of my variable. However, I realised that this outlier is from a participant who dropped out of the longitudinal study. Therefore, this observation will not be included in the regression model predicting long-term outcomes because of the missing data. I was thus wondering if I should transform the outlier or leave it as it since it will not affect the statistical power of the regression analysis. I hope to be clear.
Thank you!
Jim Frost says
Hi Sarah,
If you’re removing the observation from the study because the subject dropped out, then it seems like you don’t have a problem. That outlier can’t affect the model if you’re removing the observation for other reasons. Or am I misunderstanding something?
Azy says
Hi Jim,
Thank you for your notes. I have a question.. is there any acceptable percentage in deleting cases due to outliers (range between 2% to 10%)? Any suggestions or specific references that I can continue reading for my thesis writing… Thank you Jim
Jim Frost says
Hi Azy,
Going by a percentage isn’t a good idea. Here’s why.
Imagine you discover that 50% of your observations had serious measurement errors. In that case, you’d be justified in removing 50% of your data.
However, if every single data point was valid, related to your population of study, and measured accurately and precisely, then removing any of them would be incorrect.
What you need to do is find specific reasons for data points for why they should be excluded. Identify a problem for each observation you’re considering removing. Just being unusual isn’t enough by itself. If there is no legitimate reason for their removal, they might represent a natural part of the variation in what you’re studying!
That’s probably not the answer you’re hoping for, but it’s the best way to approach removing outliers.
Farida says
The best article ever, thanks a lot for your informative text, i found an outlier and I kept it in my study so I was afraid that i’m wrong about this ,your article clarified things for me.
hadar klein (@KleinHadar) says
Hi! Thanks for the article, it’s very helpful.
I analyzed my data using LASSO regression and my data follows most assumptions:
Homoscedasticity, relationships between the IVs and the DV are linear, there is no multicollinearity, the residuals are independent and normally distributed, and their variance is constant.
But, I do have several leverage points and outliers that aren’t measurement errors. Is having them mean my results are weaker or does it mean their invalid?
Jim Frost says
Hi Hadar,
Those leverage points and outlies are potential complications that you’ll need to figure out how to handle. There are pros and cons both ways.
First, before doing anything else, I’d recommend comparing your model with all datapoints to the model without them. Remove them one by one and determine if any affect the model greatly. It’s possible that some (or even none) affect the results to a substantial degree. The ones that don’t affect the results (by removing them) you probably don’t have to worry about. If removing any of them do affect the results substantially, then you’ll have to make some decisions.
If they are legitimate data points that reflect your population (however you define it), there are reasons to leave them in. The outliers might be just the normal part of the subject area’s variability. However, if they don’t represent your population or represent abnormal conditions, or measurement errors (which you say they aren’t), those are reasons to take them out. These decisions can lead to a big debate. I also recommend looking at studies with similar models so you can see how they handled it.
You’ll need to document what and why you do whatever you decide to do. I discuss these issues in more detail in my Regression book.
I hope that helps at least somewhat. Unfortunately, it’s impossible for me to give more specific advice because the requires in-depth knowledge about the subject matter and the model.
Nemani Satish says
Hi Jim, I am studying two indices market anomalies (Small firm effect) for 10 year data 2011 to 2021. Due to COVID 19 Pandemic there is a market crash during February to April 2020. These three months data largely affecting entire data. if I calculate the same statistical tests without these three months data the result is quit opposite to previous result (Annualized returns, Monthly returns, Correlation, etc). Kindly suggest me what to do with these type of data. Is it okay if I remove these three months data from the data set or any other method to reduce the impact?
Pavitra Vijayakumar says
In house rent prediction dataset, outliers present in number of bedroom, number of bathroom and area. Should I remove the outliers??
Fred says
Hi Jim,
This information about outliers is incredibly helpful and so easy to understand. I intend on using some of this information in my dissertation. I am hoping to find a source to reference this information. Is this information also located in any of your books?
Thank you so much, Fred
Jim Frost says
Hi Fred,
Thanks for your kind words! I’m so glad it is helpful!
I talk about outliers in my Hypothesis Testing book because they can affect the results greatly!
Melissa says
Hello Jim, thank you for the thorough explanation and resources!
I’m having an issue with flagging the outliers in my dataset. For instance, the outliers are being flagged with ‘0’ and a dot, rather than being flagged with ‘0’ and 1’s,
Steve says
Hi Jim,
Suppose I have a dataset of 3,4,4,5,9. The average is 5. However, 9 is sort of an outlier from first 4 numbers. Is there some kind of weighted average that will give the 9 less influence and bring the overall average closer to 4 which is the average of the first 4 numbers?
Jim Frost says
Hi Steve, how about the median? See my post about measures of central tendency for more details!
Faheem Jan says
Hi, Jim I have a time series data of 24*2192 matrix, In which each row are the 24 hours a days and each column is the represent a single day. By boxplot its seems their are so many outliers in the data set. in my data first five (total six years data) are use as validation set and last one year as testing set, when I measure the accuracy of the model through mean absolute and mean absolute percentage error are quite large so my supervisor suggest me that it may due to outliers, so he suggest me the moving window filter method, but i could not implement through R, please he in implementing such a method in R or suggest me some other outliers treatment method which minimize my forecast error.
Regard,
abdullah says
i have secondary data for 120 companies, and i have outliers around 20 companies. the difference high in ACP and CCC. i want to treat these outliers. can you suggest me how I can?.
Jim Frost says
Hi Abdullah,
Your first questions should be whether they are truly outliers or part of the natural variation. If they are outliers, then you need wonder why you’re obtaining so many outliers! That’s very unusual.
There’s no way I can tell if they’re truly outliers and whether you need to do anything about them. But following the guidelines I present. Learn about them, why they occur, etc. Determine whether they’re natural variation of subject area or truly outliers caused by one of the reasons I discuss. Making the determination will help you decide what to do. Also, look at similar studies to see how they handled them.
Wamiti says
Hi Jim,
I am working 36 paired samples from 18 study sites, a pair each from dry and wet seasons. These are measures of biomass of invertebrates. One observation in the wet season is an outlier (it has a value of 5.52g compared to the mean of 1.45g). While most other samples had insect invertebrates, this one was dominated by snails! Like the other invertebrates, snails also constitute (or potentially constitute) the diet of my study species, a waterbird. This outlier, from your notes, is a natural condition since it forms part of what waterbirds may feed on. Normality test of the wet season dataset gives a Shapiro-Wilk value of 0.83 and a p-value of 0.004 without normalization while the dry season had W = 0.84, p-value = 0.007.
I guess I should include this outlier in the analysis since its a natural condition, and make notes/observations that sampling may have happened in a site with specific conditions that favor the survival of the snail species in question. Do you agree? Any otherwise thoughts will be appreciated, and many thanks for your educative and enlightening posts. We take our time to read through because they are valuable.
Jim Frost says
Hi Wamiti,
So, first a caveat. These types of decisions always use a large amount of subject area knowledge. And, that’s not exactly my area.
With that out of the way, it sure sounds like natural variability to me. However. consider how you’re defining your population or study area. If it falls outside what you’re defining at the population you’re studying, that’s another reason to exclude it. Like in the bone density study I was involved in. We defined our target population as healthy individuals with no disease that affect bone density. One of our potential subject had diabetes, which affects the bones. We had to exclude her from the study because she wasn’t part of our target population even though people with diabetes are part of the larger population. Of course, our results applied only to those without a condition that affects bone density.
So, factor that in too. How are you defining your study area? To what population do you want to generalize your findings?
It’s hard for me to give you a concrete answer! But that’s the type of issue I’d think about. Is it natural variation in your target population? Or is it outside your target population?
I hope that helps at least somewhat. Discussions with someone else in the field or assessing similar studies might helpful to see how others have handled similar situations.
Tiffany says
Hello, thanks so much for your explanation! I have a couple of questions since my colleagues and I are having some trouble dealing with variability in our cell-based assays:
1. We plan to use Grubbs test on 4 replicate values to remove outliers. Is it okay to proceed with our computations even if some set-ups have an outlier removed (i.e. we’ll be averaging 3 values only), while others may not have any outliers (we’ll be averaging 4 values since there’s no outlier)?
2. Which statistical value should we compute for to ensure that our trials are valid? We are looking at B-score, Z-score, Z-factor, Z’-score, but we are not quite sure what the difference of these are and which one is more appropriate for cell assays done in 96-well plates.
Thank you so much! Would mean a lot if you could share your insights and expertise since we have no statistician on our team. 🙂
Jim Frost says
Hi Tiffany,
Why are you already so sure that you’ll be removing values values from the set-ups? Would they definitely be outliers for removal or a part of the natural variation? Would you investigate and try to understand the cause for these outliers?
I’d recommend investigating the outliers, understanding the reason they occurred, and then making the determination if it’s because some identifiable event or problem occurred with the set-up that makes it invalid because it’s not part of the normal variation. If there’s no identifiable reason, it might just part of the normal variability. Typically, you don’t remove values only because they’re unusual. Usually, you need some identifiable situation or condition that caused them to be invalid because the don’t represent represent the population/conditions that you’re studying.
I’m asking these questions because it seems like you’re already planning to remove a large proportion of values without understanding what is causing them. You don’t want to remove them if they valid values that just happen to be a bit unusual (but a normal part of the variation). On the other hand, if they truly are invalid values, then you’d want to understand why you’re getting so many of them!
I also think that performing a Grubbs test (or any hypothesis test) on a sample size of 4 is problematic!
Please read my related post about 5 Ways to Find Outliers. In that article, I write about methods such as Z-scores and the Grubbs tests, and particularly their limitation. Note that with a sample size of only 4, you’re maximum Z-score can be only 1.5, which won’t be flagged as an outlier. I’m not familiar with using Z-factor, aka Z prime and Z’, to find outliers. My understanding is that is an effect size for differences between sample means. I’m not sure how or if you can use it to identify outliers. I believe it is typically used to identify potentially interesting effects. It is different from Z-values.
I hope that helps at least point you in the right direction!
KG says
Hello,
Thank you for this post. I have a question that is not really about outliers, but I can’t figure out what keywords to search to find any answers. I am working with a very large dataset relating fishing effort to spatial location. The sampling unit is individual fishing boats; all fishing boats were surveyed on random days with the goal of capturing 20% of the population. When a boat was surveyed, variables collected included the target fish species, how much they caught, how many anglers were aboard the boat, how many days they were out fishing before returning to shore, and the “block” they were fishing in. I want to look at summary statistics by block and species, but I have numerous instances where only one boat was recorded fishing in a given block. I am unsure as to whether I should drop any blocks that have only one observation or even any blocks with fewer than three observations. Do you have any suggestions or can you point me in the right direction? Thank you in advance!
Jim Frost says
Hi KG,
I’m going to assume you mean blocks in the experimental sense where you’re grouping observations by similar conditions to reduce variability. However, typically, blocks contain nuisance variables that you’re not particularly interested in but you need to control for them. However, you indicate that you want to understand the summary statistics by blocks, so perhaps you’re meaning something else? Are they geographic regions?
That’s a bit of a tricky situation. It doesn’t sound like you have enough observations for the blocks. The more complex answer depends on how precise your estimates must be, the variability in the data, and, potentially, the type of analyses you might want to perform. More simply, I’d say that even three is too few. The problem is that estimates of the mean with only three data points are too unstable. While there’s no concrete answer that covers all situations (see the more complex answer), a good rule of thumb is that you probably would want 10 data points per mean. In a pinch, you can go a bit lower but I wouldn’t go too much lower.
Is there a way to combine blocks meaningfully?
Marco De Nardi says
Thanks for the feedback.
Marco
Marco De Nardi says
HI Jim,
as part of a technical assistance project in a East European country we are developing a framework (istitutional, technical and IT) to collect regularly and analyzse data on milk quality from milk producers (total bacterial count, somatic cell counts etc…) present in specific regions of this country. These data are then aggregated in quarter periods (3 months each) and geometric means are being calculated. Speific thresold in these parameters would indicate whether producers are producing milk according to quality standards or not. Looking at the dataset there are clearly outliners (individual farms) influencing the mean calcualtion. I don’t feel removing these data (these could very well be true values) and I am reflecting on the best way to analyze those data (calculating the mean) taking into considerationt the outliners effect. What would you suggest?
Thanks for the feedback
Marco
Jim Frost says
Hi Marco,
The key question you need to ask yourself is whether these outliers are a normal part of the population you’re studying? Populations have a normal amount of variability. Some populations have a lot of variability. If these farms are unusual but a part of your defined population, I’d lean towards leaving them in.
However, if there is something unusual about them that makes them demonstrably not a part of the population you define, then you have reason to exclude them. For example, you might define your population as farms that use method X for producing milk. If some farms use method Y, it would be legitimate to remove them. Or perhaps there was some other unusual circumstance that affected their milk which is not a part of the study.
So, it comes down to a close understanding of your study, the target population, and the specific farms.
If you do leave in these outliers (which it sounds like you’re leaning towards), you might need use another type of analysis, such as a nonparametric analysis that compares the means. For more information, read my post about parametric versus nonparametric analyses.
Boruch Fishman says
Hi.
I’ve been measuring the correlation between the number of international adoptions countries make and their cases of coronavirus/million. With T tests comparing the rate of COVID-19 in the 35 countries that adopt to the COVID-19/million rate in all 214 countries. it was significantly a different population. Likewise the correlation was positive with Pearson coefficients. When I looked at the association in regression equations (using 32 randomly picked countries), the effect of international adoptions was sometimes significant depending on which other explanatory variables I included in the equation. The association was even was significant in a mediator analysis. And my whole paper explains why the association should be significant. However, the data set of # international adoptions/country has a big outlier – the USA, which adopts the most and has a large COVID-19 / million rate. When I take out the USA, international adoptions is not significant in regression equations.
I don’t yet have the software for bootstrapping. However, I found equations which suggest that if I add about 40-50 countries, my results will be significant regardless of the distribution.
But will a maneuver like this pass peer review? How can I focus in statistically on the measurements in the USA?
Melisa says
If you are analyzing an entire data set (descriptive) rather than a sample of the group, is there any reason to remove outliers?
Jim Frost says
Hi Melisa,
I’d say that usually you don’t. Descriptive statistics assumes that you want to understand that particular group. And, if that particular group has an outlier, then understanding that group would suggest leaving the outlier in. So, I’m having a difficult time thinking why you’d want to remove an outlier in that case. If you’re tempted to use that group to understand a larger picture, and that’s the motivation for removing an outlier, that’s not descriptive statistics. You’re simply describing a group with outliers and all. Removing an outlier would be an incomplete/inaccurate description of that group.
However, I suppose it’s possible that if a measurement was invalid, that could be a legitimate reason. For example, imagine you’re measuring a group for some characteristic but if the measurement device was incorrectly calibrated for one subject, it might be valid to remove that value from the group. If you can show that the measurement is invalid (doesn’t represent the subject), that’s probably a good reason to exclude it for a descriptive study because it will make the description inaccurate.
I guess that would be the main reason in my mind for excluding data from a descriptive study. If a measurement doesn’t accurately reflect the subject due to some glitch or temporary condition, you wouldn’t want to include it because it makes the description of the group as whole inaccurate.
Ana says
Thanks so much for this. Do you have any recommended reading on this that would also be something I could cite in order to justify my choice of not removing my outliers? I am studying a particular group (classical musicians) and in a sample of over 700 I have 12 outliers for one of the measures. When I look at them individually across 9 other measures I have no reason to lean towards the possibility of measurement error (although not sure how to justify this entirely). But it looks to me that they engaged well with the 9 measures of the survey and that these are legitimate observations. Those 12 don’t affect the assumptions but do indeed change the means a bit. Thanks so much for your thoughts.
Louis says
Thanks Jim for fast reply.
See my question only a theoretical one. Of course, I am aware about understanding the why of those outliers, and unless there is a solid reason, I would keep data. Maybe I can be a bit more precise: imagine it is a field case study, several treatments were performed in a design with randomized blocks (10 blocks – the 10 replicates). Each block contains at least 20 plants. As one block represents the stat. unit, measures performed on plants should be averaged. Now, one series of measures is performed on let ´s say 10 plants in each block but averaged data finally show a significant outlier in the block 1 as compared with the other blocks. Later, another series of measures (a different analysis than previously) is performed on 5 plants, different ones than the 10 in the first analysis. Average data show a significant outlier in block 2, as compared with other blocks. The question is how to deal with outlier in this case? I mean here, let´s assume outliers should be removed (whatever the reason is): should I remove the block 1 and 2 from my all data set? Should I only consider to remove data from block 1 in the first analysis, and block 2 in the second analysis because they were performed from distinct individual groups? or should I consider the most important analysis (for example the first one with outlier in block 1) and I should remove data set of the block 1 from the second analysis ? My question is also dealing outliers when variables are independant or not. I know my question is a bit strange, it is only for curiosity.
Thanks a lot for your time !
Louis says
Thanks Jim for this interesting post. I have a little theoretical and very basic question: imagine a trial with 10 replicates per condition (each replicate contains let´s say 20 individuals), and 2 evaluated independant variables (independant because measures were done from different groups of individuals within each replicates). I am in the optic that these outliers are coming from natural variation. There is for example a significant outlier in repetition 1 with the variable 1, and one significant outlier in repetition 2 with the variable 2. How to deal then those outliers? Could I remove those outliers independantly from the variable, or should I connect them between variables – i.e. I remove data from repetition 1 and 2 for each of the 2 variables? Then, maybe it is interesting also if you could say some words when variables are dependant. Thanks a lot for your reply !
Jim Frost says
Hi Louis,
If I understand your question correctly, I’d remove only the outliers if you determine that they really need to be removed. I wouldn’t remove observations in other replicates just because a different replicate has an outlier.
As for the dependent case, I’m assuming you mean multiple observations on the same individuals. In that case, if you know that one observation is an outlier, yes, you’d probably remove that individual completely. As usual, you’d want to investigate why it is an outlier. It would be strange if the same individual has regular values and then suddenly one is an outlier. Maybe it’s a fixable data entry error and you just need to correct it? Or, perhaps, that’s just normal fluctuation for an individual that your capturing. So, investigate the underlying cause.
But, if you determine it is an outlier, it seems likely you’d need to remove the individual entirely.
king mofasa says
Thanks very much for your reply. I will certainly keep them in.
King Mofasa says
Thanks for taking the time to explain this in simple words. I would like to ask the following question: regression analysis is sensitive to multivariate outliers. Most of the references I have reviewed suggest that multivariate outliers should be removed. I don’t see any other suggestion anywhere other than removal. I am hesitant to remove them as the cases seem valid, just different from the rest of the cases. I personally find myself against removing ANY valid outlier. Is there a way to keep multivariate outliers and at least winsorize them (similar to univariate outliers)?
Jim Frost says
Hi King,
Your hesitancy is very wise. I talk about this in this article, but it’s important to distinguish between unusual values that are caused by some sort of problem (unusual conditions, data entry errors, subject is from the wrong population, etc.) versus unusual values that are caused by the natural variation in the process you’re studying. I mention the the regression case where one observation was very unusual when it came to predicting the eventual ranking of U.S. President’s by historians. However, that unusual value was a normal part of the process, so I left it in.
That’s the important distinction that you need to evaluate for these outliers. If they’re valid, then you don’t want to remove them using any method. They provide important information about the natural variability in your subject-area. If you remove them, you’re losing that important information.
I hope that helps! My regression ebook discussions outliers, unusual values, and leverage points in much greater detail from the perspective of regression analysis specifically. It might be helpful.
Tamara says
Hi Jim,
Can you explain the process of winsorizing to deal with outliers that are not measurement errors or mistakes but outliers that are true from the data set?
I had a few outliers in my data set and I winsorized the outliers by changing the outliers to the largest and smallest values that are nearest to them which are not outliers themselves.
Thank you for your continued knowledge about statistical techniques.
Jim Frost says
Hi Tamara,
Winsorizing is process that either reduces the weight of an unusual value or changes it to be a value that is not so unusual. It sounds like you used a method changes them to less extreme values.
I’m not a big fan of this process for several reasons. For one thing, I think it’s vital to learn why these outliers are happening. You never know what valuable information you might learn about your study area through this investigation. Ideally, you should determine whether these points are valid data or not.
If they are a valid part of the population you’re studying, changing these values will mean that your sample doesn’t reflect the true variability in the population and could lead you to draw incorrect conclusions. In short, you’ll draw conclusions based on an assumption that there is less variability than what actually exists.
If the data are not a valid part of the population (measurement/data entry error, unusual conditions, drawn from a different population, etc.), then those points should not be included at all.
So, in my mind, Winsorizing doesn’t address either condition, and it reduces your chances of learning something new about your population.
Methods that reduce/remove outliers will usually increase the power of your test and make the results look stronger. So, it can be hard to resist the temptation to use automated methods and whenever possible really look into each outlier and learn what is happening.
I suppose the case for Winsorizing is that if, for whatever reason, you cannot assess the outliers and make the determination I describe, but you highly suspect that they’re invalid values but can’t prove it, Winsorizing might be a better middle ground approach than just removing suspected but unproven bad data. You’d have to carefully weight that decision based on the data you have about the outliers and subject area knowledge.
I hope this helps!
Youssef Karam says
Hi Jim,
Please I am a student carying a study on the compensation of directors and how this compensation is mainely affected by performence of the firm. 3 outliers were noticed (graphically and by parameters) where 2 of them are realted to years of experience and age and one is related to %profit knowing that the profit is also included as a variable among others (sales, market value,…). Which of the 3 do you recommend to keep knowing that the a significant improvant was achived (Multiple R, R-squarred, Error, Adjusted R squarred, parameters, std errors of paramaters, p- values) when omitted the 3 of them.
Thank you
Youssef
Lauren says
Hi Jim,
Thank you so much for your reply. That all makes sense – your reasoning is articulated very well! It is a tricky area but I feel a lot more confident with the decision I have made now from reading this. Thank you again.
Kind regards,
Lauren.
Lauren says
Hi,
I did an experiment and through visual inspection I have identified 7 major outliers from the data set. Most of the outliers belong to one participant who appears to have found the experiment particularly hard. Is this a justified reason for not including these outliers in further analysis?
Thanks,
Lauren.
Jim Frost says
Hi Lauren,
This can be tricky! One thing you need to ask yourself is whether your population normally contains individuals who find the task particularly hard. In other words, does the individual in question simply represent the normal range of skill in the population you’re studying? If yes, then you should include the participant because s/he represents part of the variability in the population.
However, if there is some underlying condition or factor that makes it so the participant doesn’t reflect the normal range of abilities, you can consider removing. I’m thinking of things like some sort of medical/psychological condition that makes the subject not a part of your target population. For example, I excluded the girl with diabetes from the bone density study because we were studying bone density in girls who didn’t have conditions that affected it. Or, perhaps there were unusual conditions that made that particular session more difficult. Fire alarms, interruptions, etc.
There should be some reason for excluding this participant beyond just it being extra hard. If you can say, it was extra hard BECAUSE he has a condition that makes it hard to focus and that is not the population under study. Ok. Or, it was extra hard because fire alarms kept going off during the test. That’s OK too. But, if it was just extra hard just because the subject was on the low end of the distribution of abilities, I’d say that’s typically not enough reason by itself.
John Grenci says
Hey Jim, what about dealing with zeros? Particularly, where you have many of them? say, 50% of your values are zeros and the context in my case, anyway, is that we are talking about scrap steel per bar of steel (some have it, some don’t), leaving them in would give a weird distribution in the way of modeling, but taking them out would take away not only much of the data set, but defeat the whole essence of cause and effect. do we standardize in some way?
I have your latest book. I do not recall if you address that. I have not gotten thru all of it 🙂 . typing this from work. thanks John
Hui says
Hi Jim,
When we have a dataset to deal with, missing data or outliners which one treat first?
Jim Frost says
Hi Hui,
You should determine how you’ll handle missing data before you even begin data collection. After you collect the data, you can assess outliers.
If you’re going to toss out observations with missing data, it’s probably easier to do that first and then assess outliers, but the order probably doesn’t matter too much.
However, if you’re going to estimate the values of the missing data, it’s probably better to generate those estimates after removing the outliers because the outliers will affect those estimates. Here’s the logic for removing outliers first. By removing outliers, you’ve explicitly decided that those values should not affect the results, which includes the process of estimating missing values.
Both cases suggest removing outliers first, but it’s more critical if you’re estimating the values of missing data. I’m not sure there is much literature on this issue, but you should determine whether studies in your field employ standard procedures regarding this question.
Marlon says
Jim, how can you probe that a dataset is normally distributed using Excel?
Mohd Shehzoor Hussain says
Thank you for your reply Jim.
Can we use median and IQR to measure CT and variance if data is skewed? If yes, i have two questions.
1) what tests are there to measure the change in median and IQR?
2) what do we do with the outliers?
Mohd Shehzoor Hussain says
Hi Jim, how do you get standard deviations for data set without a proper bell curve due to outliers?
Jim Frost says
Hi Mohd,
You can calculate standard deviations using the usual formula regardless of the distribution. However, only in the normal distribution does the SD have special meaning that you can relate to probabilities. If your data are highly skewed, it could affect the standard deviations that you’d expect to see and what counts as an outliers. It’s always important to graph the distribution of your data to help you understand things like outliers. I’ll often make the point throughout my books that using graphs and numerical measures together will help you better understand your data. And this example of understanding the meaning of SDs as a measure of being an outlier makes more sense when you can see the distribution of your data in a graph is a good illustration of this principle.
Thanks for writing!
Jimoh, S. O. says
Thank you very much for this post. It is very clear and informative.
Helge says
Thank you! Yes, we were not interested in individuals with ongoing infections, so it seems legit to conclude that the 19 were not our target population. I use Stata for my analyses, and I added the command “vce(robust)” to the syntax to apply robust standard errors to account for any kind of violation of assumptions. Is it possible to say if this was a good or bad idea? 🙂 I understand that robust standard errors account for heteroscedasticity, but since I also used the lrtest syntax (likelihood ratio test) to examine whether variables were heteroscedastic or homoscedastic, and added the syntax “residuals(independent, by(variable)” to allow for heteroscedasticity for the heteroscedastic variables, it might be unnecessary to use both robust standard errors as well as allowing for heteroscedasticity?
Thank you so much. 🙂
Jim Frost says
My hope would be that after dropping those 19 cases with ongoing infections you won’t need to use anything outside the normal analysis. I’d only make adjustments for outliers or heteroscedasticity if you see evidence in the residual plots that there is a need to do so.
Unfortunately, I don’t use Stata and I’m not familiar with their commands. So, I don’t know which ones would be appropriate should there be the issues you describe. But, here’s the general approach that I recommend.
Start with regular analysis first and see how that works. Check the residual plots and assumptions. Typically, when you use an alternative method there is some sort of tradeoff. Don’t go down that road unless residual analysis indicates you need to! If you find that you do need to make adjustments, the residual plots should give you an idea of what needs fixing. Start with minimal corrections and escalate only as needed.
Helge says
Thank you for your swift answer. The 19 were removed due to suspected ongoing infection (e.g. having a cold, HIV or hepatitis), as the variable was a biomarker for inflammation. So the decision was made based on the idea that ongoing infection would bias the biomarkers we were looking at. I will however run the analyses with and without the 19 and compare results, as you suggest. Thank you very much, I really appreciate your work, Jim.
Jim Frost says
Hi Helge,
Ah, so knowing that additional information makes all the difference! It sounds like removing them is the correct action. That’s the additional mystery I suspected was there!
In this post, I talk about this as a sampling problem. It’s similar to the situation I describe with the bone density study and the subject who had diabetes.
It sounds like in your study you’ve defined your target population as something like comprising healthy individuals or individuals without a condition that affects inflammation. That’s your target population that are studying. In this light, an individual with a condition that affects inflammation would not be a part of that target population. So, if you identify these conditions, those are the specific reasons you can attribute to those individuals.
Based on that information, I’d concur in principle with removing those observations. The results from your study should inform you about the target population. In your report, I’d be sure to clearly define the target population and then explain how you excluded individuals outside of that population. The study is designed to understand the healthy/normal population and not individuals with conditions that affect inflammation.
Additionally, with this information, it’s probably not necessary to run the analyses with and without those subjects–unless it’s informative in someway.
You’re very welcome! And, I’m really glad that you wrote back with the followup information. Your study provides a great example to other readers by highlighting issues that real world studies face involving outliers and deciding how to handle them.
Helge says
Thank you so much for explaining this subject so well! I hope it is okay to ask one question: In multilevel models using a dataset with a number of extreme outliers in a medium-size dataset (19 subjects above 95th percentile in a continuous variable in a dataset of 147 subjects), would you say that the multilevel modeling technique is robust enough to handle the outliers? Or should these 19 be removed in order to not violate assumptions?
Jim Frost says
Hi Helge,
Multilevel modeling is a generalization of linear least squares modeling. As such, it is highly sensitive to outliers.
Whether you should remove these observations is a separate matter though. Use the principles outlined in this blog post to help you make that decision. Do these subjects represent your target population as it is defined for your study? Removing even several outliers is a big deal. So, removing 19 would be far beyond that! On the face of it, removing all 19 doesn’t sound like a good idea. But, as you hopefully gathered from this blog post, answering that question depends on a lot of subject-area knowledge and real close investigation of the observations in question. It’s not possible to give you a blanket answer about it.
I’d recommend really thinking about the target population for your study and take a very close look at these observations. How is that you obtained so many subjects above the 95th percentile? Maybe they do represent your target population and you wouldn’t want to artificially reduce the variability? Keep in mind, simply being an extreme value is not enough by itself to warrant exclusion. You need to find an additional reason you can attribute to every data point you exclude.
Again, it’s hard for me to imagine removing 19 observation, or 13% of your dataset! It seems like there must be more to the story here. It’s impossible for me to say what it is, of course, but you should investigate.
If you leave some or all of these outliers in the dataset, you might need to change something about the analysis. However, you should try the regular analysis first, and then check the residual plots and assess the regression assumptions. If you’re lucky, your model might satisfy the assumptions and you won’t need to make adjustments.
If your model does violate assumptions, you can try transforming the data or possibly using a robust regression analysis that you can find in some statistical software packages. These techniques reduce the impact of outliers, including making it so they don’t violate the assumptions.
Another thing to consider is comparing the results with and without the outliers and understanding how it changes the outcomes. As I mention in this post, if a research group is in disagreement or completely unsure about what to do, you can analyze it both ways and report the differences, etc.
Best of luck with your study!