Outliers are data points that are far from other data points. In other words, they’re unusual values in a dataset. Outliers are problematic for many statistical analyses because they can cause tests to either miss significant findings or distort real results.
Unfortunately, there are no strict statistical rules for definitively identifying outliers. Finding outliers depends on subject-area knowledge and an understanding of the data collection process. While there is no solid mathematical definition, there are guidelines and statistical tests you can use to find outlier candidates.
In this post, I’ll explain what outliers are and why they are problematic, and present various methods for finding them. Additionally, I close this post by comparing the different techniques for identifying outliers and share my preferred approach.
Outliers and Their Impact
Outliers are a simple concept—they are values that are notably different from other data points, and they can cause problems in statistical procedures.
To demonstrate how much a single outlier can affect the results, let’s examine the properties of an example dataset. It contains 15 height measurements of human males. One of those values is an outlier. The table below shows the mean height and standard deviation with and without the outlier.
Throughout this post, I’ll be using this example CSV dataset: Outliers.
With Outlier | Without Outlier | Difference |
2.4m (7’ 10.5”) | 1.8m (5’ 10.8”) | 0.6m (~2 feet) |
2.3m (7’ 6”) | 0.14m (5.5 inches) | 2.16m (~7 feet) |
From the table, it’s easy to see how a single outlier can distort reality. A single value changes the mean height by 0.6m (2 feet) and the standard deviation by a whopping 2.16m (7 feet)! Hypothesis tests that use the mean with the outlier are off the mark. And, the much larger standard deviation will severely reduce statistical power!
Before performing statistical analyses, you should identify potential outliers. That’s the subject of this post. In the next post, we’ll move on to figuring out what to do with them.
There are a variety of ways to find outliers. All these methods employ different approaches for finding values that are unusual compared to the rest of the dataset. I’ll start with visual assessments and then move onto more analytical assessments.
Let’s find that outlier! I’ve got five methods for you to try.
Sorting Your Datasheet to Find Outliers
Sorting your datasheet is a simple but effective way to highlight unusual values. Simply sort your data sheet for each variable and then look for unusually high or low values.
For example, I’ve sorted the example dataset in ascending order, as shown below. The highest value is clearly different than the others. While this approach doesn’t quantify the outlier’s degree of unusualness, I like it because, at a glance, you’ll find the unusually high or low values.
Graphing Your Data to Identify Outliers
Boxplots, histograms, and scatterplots can highlight outliers.
Boxplots display asterisks or other symbols on the graph to indicate explicitly when datasets contain outliers. These graphs use the interquartile method with fences to find outliers, which I explain later. The boxplot below displays our example dataset. It’s clear that the outlier is quite different than the typical data value.
You can also use boxplots to find outliers when you have groups in your data. The boxplot below shows a different dataset that has an outlier in the Method 2 group. Click here to learn more about boxplots.
Histograms also emphasize the existence of outliers. Look for isolated bars, as shown below. Our outlier is the bar far to the right. The graph crams the legitimate data points on the far left.
Click here to learn more about histograms.
Most of the outliers I discuss in this post are univariate outliers. We look at a data distribution for a single variable and find values that fall outside the distribution. However, you can use a scatterplot to detect outliers in a multivariate setting.
In the graph below, we’re looking at two variables, Input and Output. The scatterplot with regression line shows how most of the points follow the fitted line for the model. However, the circled point does not fit the model well.
Interestingly, the Input value (~14) for this observation isn’t unusual at all because the other Input values range from 10 through 20 on the X-axis. Also, notice how the Output value (~50) is similarly within the range of values on the Y-axis (10 – 60). Neither the Input nor the Output values themselves are unusual in this dataset. Instead, it’s an outlier because it doesn’t fit the model.
This type of outlier can be a problem in regression analysis. Given the multifaceted nature of multivariate regression, there are numerous types of outliers in that realm. In my ebook about regression analysis, I detail various methods and tests for identifying outliers in a multivariate context.
For another advanced multivariate method for detecting outliers, consider using principal component analysis (PCA). This approach is particularly helpful when you have many variables in a high-dimensional dataset. Learn more in, Principal Component Analysis Guide and Example.
For the rest of this post, we’ll focus on univariate outliers.
Using Z-scores to Detect Outliers
Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean.
To calculate the Z-score for an observation, take the raw measurement, subtract the mean, and divide by the standard deviation. Mathematically, the formula for that process is the following:
The further away an observation’s Z-score is from zero, the more unusual it is. A standard cut-off value for finding outliers are Z-scores of +/-3 or further from zero. The probability distribution below displays the distribution of Z-scores in a standard normal distribution. Z-scores beyond +/- 3 are so extreme you can barely see the shading under the curve.
In a population that follows the normal distribution, Z-score values more extreme than +/- 3 have a probability of 0.0027 (2 * 0.00135), which is about 1 in 370 observations. However, if your data don’t follow the normal distribution, this approach might not be accurate.
Z-scores and Our Example Dataset
In our example dataset below, I display the values in the example dataset along with the Z-scores. This approach identifies the same observation as being an outlier.
Note that Z-scores can be misleading with small datasets because the maximum Z-score is limited to (n−1) / √ n.*
Indeed, our Z-score of ~3.6 is right near the maximum value for a sample size of 15. Sample sizes of 10 or fewer observations cannot have Z-scores that exceed a cutoff value of +/-3.
Also, note that the outlier’s presence throws off the Z-scores because it inflates the mean and standard deviation as we saw earlier. Notice how all the Z-scores are negative except the outlier’s value. If we calculated Z-scores without the outlier, they’d be different! Be aware that if your dataset contains outliers, Z-values are biased such that they appear to be less extreme (i.e., closer to zero).
For more information about z-scores, read my post, Z-score: Definition, Formula, and Uses.
The z-score cutoff value is based on the empirical rule. For more information, read my post, Empirical Rule: Definition, Formula, and Uses.
Related posts: Normal Distribution and Understanding Probability Distributions
Using the Interquartile Range to Create Outlier Fences
You can use the interquartile range (IQR), several quartile values, and an adjustment factor to calculate boundaries for what constitutes minor and major outliers. Minor and major denote the unusualness of the outlier relative to the overall distribution of values. Major outliers are more extreme. Analysts also refer to these categorizations as mild and extreme outliers.
The IQR is the middle 50% of the dataset. It’s the range of values between the third quartile and the first quartile (Q3 – Q1). We can take the IQR, Q1, and Q3 values to calculate the following outlier fences for our dataset: lower outer, lower inner, upper inner, and upper outer. These fences determine whether data points are outliers and whether they are mild or extreme.
Values that fall inside the two inner fences are not outliers. Let’s see how this method works using our example dataset.
Click here to learn more about interquartile ranges and percentiles.
Calculating the Outlier Fences Using the Interquartile Range
Using statistical software, I can determine the interquartile range along with the Q1 and Q3 values for our example dataset. We’ll need these values to calculate the “fences” for identifying minor and major outliers. The output below indicates that our Q1 value is 1.714 and the Q3 value is 1.936. Our IQR is 1.936 – 1.714 = 0.222.
To calculate the outlier fences, do the following:
- Take your IQR and multiply it by 1.5 and 3. We’ll use these values to obtain the inner and outer fences. For our example, the IQR equals 0.222. Consequently, 0.222 * 1.5 = 0.333 and 0.222 * 3 = 0.666. We’ll use 0.333 and 0.666 in the following steps.
- Calculate the inner and outer lower fences. Take the Q1 value and subtract the two values from step 1. The two results are the lower inner and outer outlier fences. For our example, Q1 is 1.714. So, the lower inner fence = 1.714 – 0.333 = 1.381 and the lower outer fence = 1.714 – 0.666 = 1.048.
- Calculate the inner and outer upper fences. Take the Q3 value and add the two values from step 1. The two results are the upper inner and upper outlier fences. For our example, Q3 is 1.936. So, the upper inner fence = 1.936 + 0.333 = 2.269 and the upper outer fence = 1.936 + 0.666 = 2.602.
Using the Outlier Fences with Our Example Dataset
For our example dataset, the values for these fences are 1.048, 1.381, 2.269, and 2.602. Almost all of our data should fall between the inner fences, which are 1.381 and 2.269. At this point, we look at our data values and determine whether any qualify as being major or minor outliers. 14 out of the 15 data points fall inside the inner fences—they are not outliers. The 15th data point falls outside the upper outer fence—it’s a major or extreme outlier.
The IQR method is helpful because it uses percentiles, which do not depend on a specific distribution. Additionally, percentiles are relatively robust to the presence of outliers compared to the other quantitative methods.
Boxplots use the IQR method to determine the inner fences. Typically, I’ll use boxplots rather than calculating the fences myself when I want to use this approach. Of the quantitative approaches in this post, this is my preferred method. The interquartile range is robust to outliers, which is clearly a crucial property when you’re looking for outliers!
Related post: What are Robust Statistics?
Finding Outliers with Hypothesis Tests
You can use hypothesis tests to find outliers. Many outlier tests exist, but I’ll focus on one to illustrate how they work. In this post, I demonstrate Grubbs’ test, which tests the following hypotheses:
- Null: All values in the sample were drawn from a single population that follows the same normal distribution.
- Alternative: One value in the sample was not drawn from the same normally distributed population as the other values.
If the p-value for this test is less than your significance level, you can reject the null and conclude that one of the values is an outlier. The analysis identifies the value in question.
Let’s perform this hypothesis test using our sample dataset. Grubbs’ test assumes your data are drawn from a normally distributed population, and it can detect only one outlier. If you suspect you have additional outliers, use a different test.
Grubbs’ outlier test produced a p-value of 0.000. Because it is less than our significance level, we can conclude that our dataset contains an outlier. The output indicates it is the high value we found before.
If you use Grubbs’ test and find an outlier, don’t remove that outlier and perform the analysis again. That process can cause you to remove values that are not outliers.
Challenges of Using Outlier Hypothesis Tests: Masking and Swamping
When performing an outlier test, you either need to choose a procedure based on the number of outliers or specify the number of outliers for a test. Grubbs’ test checks for only one outlier. However, other procedures, such as the Tietjen-Moore Test, require you to specify the number of outliers. That’s hard to do correctly! After all, you’re performing the test to find outliers! Masking and swamping are two problems that can occur when you specify the incorrect number of outliers in a dataset.
Masking occurs when you specify too few outliers. The additional outliers that exist can affect the test so that it detects no outliers. For example, if you specify one outlier when there are two, the test can miss both outliers.
Conversely, swamping occurs when you specify too many outliers. In this case, the test identifies too many data points as being outliers. For example, if you specify two outliers when there is only one, the test might determine that there are two outliers.
Because of these problems, I’m not a big fan of outlier tests. More on this in the next section!
My Philosophy about Finding Outliers
As you saw, there are many ways to identify outliers. My philosophy is that you must use your in-depth knowledge about all the variables when analyzing data. Part of this knowledge is knowing what values are typical, unusual, and impossible.
I find that when you have this in-depth knowledge, it’s best to use the more straightforward, visual methods. At a glance, data points that are potential outliers will pop out under your knowledgeable gaze. Consequently, I’ll often use boxplots, histograms, and good old-fashioned data sorting! These simple tools provide enough information for me to find unusual data points for further investigation.
Typically, I don’t use Z-scores and hypothesis tests to find outliers because of their various complications. Using outlier tests can be challenging because they usually assume your data follow the normal distribution, and then there’s masking and swamping. Additionally, the existence of outliers makes Z-scores less extreme. It’s ironic, but these methods for identifying outliers are actually sensitive to the presence of outliers! Fortunately, as long as researchers use a simple method to display unusual values, a knowledgeable analyst is likely to know which values need further investigation.
In my view, the more formal statistical tests and calculations are overkill because they can’t definitively identify outliers. Ultimately, analysts must investigate unusual values and use their expertise to determine whether they are legitimate data points. Statistical procedures don’t know the subject matter or the data collection process and can’t make the final determination. You should not include or exclude an observation based entirely on the results of a hypothesis test or statistical measure.
At this stage of the analysis, we’re only identifying potential outliers for further investigation. It’s just the first step in handling them. If we err, we want to err on the side of investigating too many values rather than too few.
In my next post, I’ll explain what you’re looking for when investigating outliers and how that helps you determine whether to remove them from your dataset. Not all outliers are bad and some should not be deleted. In fact, outliers can be very informative about the subject-area and data collection process. It’s important to understand how outliers occur and whether they might happen again as a normal part of the process or study area.
Read my Guidelines for Removing and Handling Outliers.
If you’re learning about hypothesis testing and like the approach I use in my blog, check out my eBook!
Reference
Ronald E. Shiffler (1988) Maximum Z Scores and Outliers, The American Statistician, 42:1, 79-80, DOI: 10.1080/00031305.1988.10475530
Thanks! That’s very helpful. And, actually, it helps to confirm my intuition about the need for control charts, which I was able to get somewhat. It’s not perfect because any one school can different in meaningful ways from others, but it’s the best approximation of what’s “normal.”
Thanks again for your feedback. Your willingness to provide such a thorough response goes above and beyond what I had anticipated. It was exceptional.
You’re very welcome!
I think pretty much any approach you use, whether its control charts or something else, you’ll need to use for individual schools. Unless you have reason to suspect all schools should follow the same distribution.
Jim, I really appreciate the thorough (and timely) response. If you’ll allow me to impose on you just a bit more, I’ll reframe it.
Imagine there’s a class held annually that has 25-30 students. The grade on the mid-term does an excellent job of predicting the grade on the final, say, the mean difference from mid-term to final is -2 and the st.dev. is 4. Over a ten-year period, that mean & st.dev. have been consistent; however for the first 3 years, there were zero “outliers,” say, final exam scores that were 3 or more st.dev. from the mean; however, over the last 7 years, there was always at least one student and as many as three who received grades on the final that were 5 st.dev. greater than their mid-term grades.
What I’m wondering is how to determine if 7 consecutive years with 1-3 outliers is reasonable to expect or if the 3 years with zero outliers are actually the “odd” years.
My gut tells me that the reoccurrence of extreme outliers suggests there’s some sort of explanation, something “systematic” that can explain the consistency of those outliers; however, maybe it’s simply random variation. Also, there is no control because it wouldn’t be appropriate to compare the results from this class with a different one in a different subject or with a different instructor, and it’s not possible to see if the outliers made similar jumps in other courses (e.g., perhaps they become highly motivated after mid-terms) because they’ve only taken that one course.
Thanks again for your input!
Hi Bart,
Even though you reframed the scenarios, it’s the same basic idea. You’re trying to determine whether the outcomes over time reflect one underlying distribution, or has that distribution changed. That’s what control charts are designed to detect. So, I’d still recommend using a control chart in that scenario. In fact, I’ve written a post a while ago urging people outside manufacturing to use these charts more often. To learn more about them, read my post about Using Control Charts. If I remember correctly, I actually use changing test scores in an education setting as an example scenario.
Outside of control charts, you’d need to know the distribution of test scores from the years you’d consider normal. Then you could determine the likelihood of producing the outliers in later years if the distribution stayed constant. For example, if the scores follow a normal distribution, then you’d expect that 0.3% or 1 out of every 370 observations will fall more than +/- 3 standard deviations from the mean. You could see if the outliers in later years exceed those expected unusual values. By the way, control charts basically have those tests built into them, which is why I recommend them. They have a variety of tests including whether those points all fall on one side or both sides of the mean. And more. But that’s the type of think you need to look for an evaluate.
As for your question about which years are the normal or odd years, you won’t be able to answer that using statistics. Statistics can help you determine whether those two sets of years are different or not, but not which one is normal. That takes subject area knowledge.
It does seem like something has changed just reading your description. Although, it’s possible that the first three years were just flukes that happened to not have outliers. The difference between a year without outliers and a year with just one outlier might not be so large.
However, five standard deviations from the mean is extremely large and a single score with that value is almost definitely a red flag all by itself. You’d expect less than 1/100,000 scores to be that many standard deviations away from the mean. That’s got to be a red flag. It’s really off the charts. And if you have more than one that is 5 standard deviations from the mean, there’s really no question.
I don’t know how many students scores are being considered here. But there’s a massive difference between the 3 standard deviations (1 in 370) vs 5 standard deviations (< 1 in 100,000). It's possible that there are even enough +/-3SD outliers at play to be a red flag, but the +/-5SD have got to be. Also, if they're all one-sided (i.e., above the mean), that is even more of a red flag because you'd expect a random process to produce those on both sides of the mean equally. Except, note the caveat in the next paragraph. Check that you're actually dealing with a normal distribution. Perhaps you're dealing with a distribution that is naturally right skewed. In that case, those statistics don't apply. You'd expect more unusual large values as being a natural part of the process. So, investigate the shape of the distribution. Many analysts have been tripped up by what they thought were outliers but in fact turned out to be a naturally right-skewed distribution. If it is a right-skewed distribution, you'd still have a bit of the mystery as to why there were no outliers the first several years. If they are truly red flags, you have to figure out what a "red flag" means. Obviously, those students are doing far better than expected, but is that through legitimate means or not? The stats can't tell you.
If you have multiple samples, how can you tell if outliers are “expected.” What I mean is this: imagine a machine that churns out cookies. Each batch has 40 cookies with a mean of, say, 20g and all cookies weigh between 18-22g except for 1-3 that weigh 38g. How many batches would you need to produce where this outcome occurred before it’s “normal,” i.e., 1-3 extreme outliers can be expected every batch. Also, what if the machine made 3 batches with same mean and distribution of weight across 40 cookies but no outliers, and then the next X batches had 1-3 extreme outliers. What would the value of X need to be to suggest there’s likely a problem with the machine? Thanks so much!
Hi Bart,
There are different ways to answer this question.
In a real-world manufacturing setting, client requirements will set the spec limits for your cookies. Out of spec cookies are a problem. Typically, manufacturers will have some process related knowledge about the expected machine performance for being able to produce cookies that are in spec. In that way, they can compare actual performance to expected performance and determine whether the machine has a problem. Additionally, they can create control charts to determine whether the cookie machine is a stable process and perform capability analysis to get some standard capability values that they can compare to benchmark values.
That’s the approach from a real manufacturing perspective.
You can also look at it from a more distributional perspective. And you need to really define “outlier.” Some unusual values might be the regular part of the process and not real outliers. They’re expected. Like human heights follow a normal distribution and you’ll naturally see people who are unusually tall or short. In that case, it’s not indicative of a problem, just part the natural process. The same idea applies to your cookie machine. You’ll need to determine whether the unusual cookies are a natural part of the process’ distribution or indicative of a fault in the machine.
To make that type of determination, you’ll need to know the usual distribution of cookie weights for the machine from historical data. Then perform a new study and determine whether the machine is producing more unusual cookies than the usual distribution of cookie weights predicts.
Because I don’t know the previous capability of the machine, I can answer your question specifically, but that’s the general process.
In a real production process, the manufacture would use control charts to track their output and that would be the first way they know that something is out of whack. In general, control charts will do what you’re asking about. They’ll tell you if the proportion of defects (i.e., unusual values or outliers) is changing from sample to sample. The control limits on the chart indicate when a process is statistically out of control.
wow this helped me better understand an outlier thx
Hi Jim,
If we are measuring overall user experience using a CSAT with the Likert scale ranging from: (1) Very Unsatisfied (2) Unsatisfied (3) Neither Satisfied Nor Unsatisfied (4) Satisfied (5) Very Satisfied. And the dataset has only a few who scored (1) and (5). Can we say (1) and (2) are outliers in this self-evaluation? Or do we judge outliers from objective measurements (as per your example, measuring heights of a sample)? Kindly clarify. Thank you
Great stuff. Thanks, Jim! A next-level question re: IQR: once we identify the outliers, and then examine why they were there in the first place, would it be advisable to remove them and run the numbers again? I’m thinking the reason for doing this would be to eliminate outliers that we know would never happen again– particularly if we adjust our processes to make SURE they don’t happen again! Also, I’m thinking that doing so would allow us to get to next-level refinements so that we can narrow our ranges going forward. What are your thoughts?
Hi Mike,
After you identify the potential outliers, if you decide that they really don’t belong in the dataset, which takes a lot of deliberation, then I’d certainly remove them and rerun your statistics. The trick is to really determine whether they belong in the dataset (they might be legitimate extreme values) or not. The link in this post to my other post about outliers discusses some of the ways to make that determination.
The DGP of multiple linear regression model is given
Y_i=0.3+2X_1i+1.5X_2i+ε_i
Where ε_i ~Norm(0,10)
Understand it DGP carefully and generate 500 observations of each variable in excel.
And prove that:
In case of normally distributed data, the value of SE (of estimators) are efficient, and t-statistics is valid, and parameters are not biased.
Generate outlier with value of 3000 (in Y_i) and show that how a one outlier violate all the distribution of the data in which the SE of parameters are not more efficient, t-statistics is not valid, and parameters become biased.
Note: Answers of above question should be given on official university answer sheet (which I have given you) but result of excel file should be paste in this world document, only word document for your excel results will be accepted.
Hi Jim,
Great article!
Can you elaborate on why it is incorrect to use Grubs test several times to remove more than one outlier in a dataset? In a sufficiently large data set, why can’t you just run it several times?
I would think that one strategy to automate the procedure could be to take another few extra samples, then run Grubs test a few times.
(assuming we can afford to take maybe 30 samples or more )
Hi Jonas,
What happens if you repeat Grubs test is that it’ll tend to remove data points that are not outliers. The first outlier it finds is based on the entire distribution. Then, you remove an outlier and the distribution of the remaining data now has less variability. A point that was not an outlier might now appear to be an outlier because of the reduced variability. Statisticians base this recommendation to only use Grubbs test once per dataset due to its propensity for removing valid data points when you use the test multiple times.
I hope that helps!
Hey, Jim.
I have a question re the Fitted Line Plot — Output as a function of Input. Visually, the one data point does not “fit the model.” If we look at the residuals, we should get a mean of zero. Can we not use the Standard Deviation of the residuals to calculate the Z-scores for THEM and then determine whether a datum or two can be omitted? Or is that subject to the same issues you mention above regarding the actual observed data?
If using the Z-scores of residuals is not a great idea, can we use percentiles instead? Can we just flag data with a residual of less then, say, the 1st percentile and greater than the 99th percentile? Again, I am referring to the residuals, here.
Cheers,
Joe
Hi Joe,
My overall point is to use all the methods carefully. Most of them have some drawback but I’m not saying to avoid them. In the end, it really comes down to your subject area knowledge and the investigation of candidate outliers. It’s always possible that an unusual value is part of the natural variation of the process rather than a problematic point!
The residuals you describe are known as standardize residuals and, yes, they do have a value equivalent to a Z-score. They are a good way to identify potential outliers. With normal residuals, you might see a value of X, but whether X is unusual depends on the data units, data variability, etc. Standardized residuals are good way to determine whether a residual is unusual given the properties of the data because it incorporates those factors. However, the same caveats for Z-values apply for standardized residuals. Just something to be aware of while assessing them. Additionally, remember that it’s normal for about 5% of the residuals to have standardized scores of +/-2. Again, use subject area knowledge and investigate particular data points.
Another useful type of residual are the deleted Studentized residuals. These are like the standardized residuals above but the calculations for the ith deleted studentized residual does not include the ith observation. That helps avoid the problem where the presence of an unusual residual actually causes its own standardized value to be lower because it’s inflating the residuals’ variability.
You could convert to percentiles as you describe. Most software I see use either +/- 2 or 3 standard deviations to identify candidate outliers.
In regression, there are multiple ways that an observation can be unusual. Residuals measure unusualness in the y-direction. But you can also have unusual observations in the X-direction. And points that individually affect the model’s coefficient estimates by a larger than usual degree. I can’t remember if you have my regression book, but I discuss those issues in it!
Take care!
Dear Jim,
I hope all is great and well.
i have a master data sheet include few variables. In order to perform my regression, I need to make sure I get ride of the outliers. I understand that there are many ways to get the outliers out. I plan to use Box Plot method or Z Score method. The distributions of the data sometimes normally distributed, left skewed and right skewed. All over, non is consistent. The master data sheet will be resorted based on specific variables values. so I will create from the master data sheet few specific data sheets.
But the questions that need help are listed below;
1. How we deal with outliers when the master data sheet include various distributions. It is not consistent; some of them normally and the majority are skewed.
2. Should we apply one method to remove the outliers or we can apply more than one method, like these two methods. And when to be applied? Should this applied to the master data sheet or we still need to apply it after sorting the data as indicated above.
3. If we use the box plot to fix one column of variable, it will impact the other variables since it eliminate one complete row. That row may have other good test for other values.
4. Any advise or suggestions in general to deal with the outliers and at same time not impacting significantly the obtained data.
Thank you so much!
Hosam
Hi Jim,
Is there a correct way to run a outlier analysis? What if there are 3 variate ( 12 variables) , is there any rulling about this? Thank you
It best f you use several methods to find outliers. The true outliers will satisfy multiple methods.
You can compare the findings of the different methods and have confidence those data points can be treated as outliers when flagged by different methods independently.
It also helps to have a clear understand of what your dataset describes in reality, and what those outliers really represent, how they came to be in existence. If it’s likely they are errors, great, that’s more justification to ignore them.
If they are certainly correctly sampled you must consider what it means to remove them from your study, how their removal affects the integrity of your analysis.
Hi D,
Thanks for writing! In terms of flagging observations for investigation, I’d agree that if multiple methods find the same values, there’s good reason to investigate them. However, flagging by multiple methods doesn’t necessarily increase the likelihood that removing those values is appropriate.
I definitely agree that understanding what your dataset describes and how the outliers came to be are both crucial tasks.
However, removal really depends on understanding how those values came to exist. That can get fairly complicated. I discuss those issues in my post about determining whether to remove outliers. Identifying the candidate is the easy part. You’re just looking for unusual values. Why they exist and what to do about them is where it can get complicated!
Sorry didn’t want to blind you brother. Thank you I’m studying it now. I sent you a message I had a question.
No worries! 🙂
I will reply to your other question soon.
HELLO JIM, IF YOU DON’T HAVE DATA AT START CAN YOU STILL CRAFT THE RESEARCH QUESTION THEN DESIGN A STUDY TO COLLECT DATA? IF SO HOW?
-Thank you
Hi Kechler, first, please don’t use ALL CAPS! It hurts my eyes!
Yes, you can craft your research question and study design before collecting data. In fact, that’s the preferred approach. If you start collecting data before you have a research question and design, it’s very likely that you won’t be collecting the necessary data to answer your question. I write about this in my post about conducting scientific studies with statistical analyses.
Is it legitimate to detect outliers based on the Z-score for a large population ( 800 K observations) even if it’s not normal?
Or it’s more appropriate to use the IQR and then compute an Upper hinge and lower hinge ? or are there other methods to apply in this case ?
Thanks in advance
Hello Sir Good morning,
i’m a research scholar and i’m comparing mean of two independent sample data sets ,(i.e stock price returns of to pricing method )now in order to test parametric test my data should be normally distributed,but when i test normality ,i found my data is not normal..as my guide is suggesting me to normalize my data using Z score and finding their areas under curve,but i;m not able to understand that,please help me sir to normalize data.
Hi Jim,
I have a dataset with 11 columns and I have written a common function detect_outliers() to find outliers in the columns.
For first 6 columns, the function is working out but for rest of the 5 outliers , function returns empty list though the columns have outliers. U can see the code written below:
################
def detect_outliers(data):
outliers = []
threshold=3
mean = np.mean(data)
std = np.std(data)
for i in data:
z_score = (i-mean)/std
print(z_score)
if np.abs(z_score) > threshold:
outliers.append(i)
return outliers
################
As you can see, if I have taken the value of “threshold = 3”. For first six columns, the function is working out as z_score>3 for outliers.
But for rest of the columns, z_score for outliers is greater than 1 (z_score>1), so the threshold should be taken 1 for rest of the six coulmns.
Here I have 11 columns only in the dataset. But what if I have 1000 columns in my dataset. In that case,I can’t check the threshold for each and every column.
Please!!!! help me and reply at the earliest.
Hi Suruchi,
Why would you use a Z-score of 1 to detect outliers? I’m not sure why it’s not working but with such a low threshold you should have more detections. How many observations per column?
Hi Jim, Thanks for sharing details on outliers. I have one question, happy if you can advise me.
My questions is, I am building a MachineLearning model, I have traning dataset and testing dataset. I removed outliers from traning dataset and building ML model with good efficient level. Now, I did have large amount of outliers in testing dataset (which I have to submit as it is).
Now, in that my ML model is less efficient when I applied on unseen test dataset (with outliers).
Can you please advice me, how shall I achive more efficiency on test dataset. Plus, I don’t want to loose any observed values in the test dataset.
Thanks.
Rutvij
How to treat outliers ??? Please help me
Hi Narasimha,
Read my follow up post to this one: Guidelines for Removing and Handling Outliers.
I usually use a Q-Q plot to detect outliers – just a visualization of what you suggest as using the Z-score.
Hi Denny,
Thanks for the suggestion. Unusual Z-scores might stand out more in a plot than a list. Just be aware of the constraints on Z-scores in small samples and the fact that Z-scores themselves are sensitive to outliers.
I haven’t seen this formula before related to Z-scores: (n−1) / √ n
Can you share more details about where this comes from? It’s not intuitive to me at first glance
Hi Brion,
I’ve added a reference to this post for this formula. The referenced article discusses this limitation in the context of finding outliers and it includes references to other sources where the limit was derived. In a nutshell, maximizing Z-scores depends on minimizing the standard deviation (or variance). As I showed earlier in this post, the outlier is far from the mean score. While it increases the mean, it drastically increases the standard deviation. The net result of both increases is that it limits the maximum Z-value. In small samples, this limitation is even greater and severely constrains the maximum absolute Z-scores.
In general, an outlier pulls the mean towards it and inflates the standard deviation. Both effects reduce it’s Z-score. Indeed, our outlier’s Z-score of ~3.6 is greater than 3, but just barely. The Z-score seems to indicate that the value is just across the boundary for being outlier. However, it’s truly a severe outlier when you observe how unusual it truly is. Both the boxplot and IQR method make this clear. And, simply observing the value compared to reasonable values, it very far beyond legitimately possible values for human height.
The article uses an example of a dataset with 5 values {0, 0, 0, 0, 1 million}. The Z-score for the value of 1 million is only 1.789! Not an outlier using Z-scores!
To quote the article, “The concept of a Z score as a measure of a value’s position within a data set in terms of standard deviations is intuitively appealing. Unfortunately, the behavior of Z is quite constrained for small data sets.”
To illustrate this constraint, I’m including the table below that lists the maximum absolute Z-scores by sample size. Note how absolute Z-scores can exceed 3 only when the sample size is 11 and greater.
I hope this helps!
Excellent work. Congratulations