Outliers are data points that are far from other data points. In other words, they’re unusual values in a dataset. Outliers are problematic for many statistical analyses because they can cause tests to either miss significant findings or distort real results.
Unfortunately, there are no strict statistical rules for definitively identifying outliers. Finding outliers depends on subject-area knowledge and an understanding of the data collection process. While there is no solid mathematical definition, there are guidelines and statistical tests you can use to find outlier candidates.
In this post, I’ll explain what outliers are and why they are problematic, and present various methods for finding them. Additionally, I close this post by comparing the different techniques for identifying outliers and share my preferred approach.
Outliers and Their Impact
Outliers are a simple concept—they are values that are notably different from other data points, and they can cause problems in statistical procedures.
To demonstrate how much a single outlier can affect the results, let’s examine the properties of an example dataset. It contains 15 height measurements of human males. One of those values is an outlier. The table below shows the mean height and standard deviation with and without the outlier.
Throughout this post, I’ll be using this example CSV dataset: Outliers.
|With Outlier||Without Outlier||Difference|
|2.4m (7’ 10.5”)||1.8m (5’ 10.8”)||0.6m (~2 feet)|
|2.3m (7’ 6”)||0.14m (5.5 inches)||2.16m (~7 feet)|
From the table, it’s easy to see how a single outlier can distort reality. A single value changes the mean height by 0.6m (2 feet) and the standard deviation by a whooping 2.16m (7 feet)! Hypothesis tests that use the mean with the outlier are off the mark. And, the much larger standard deviation will severely reduce statistical power!
Before performing statistical analyses, you should identify potential outliers. That’s the subject of this post. In the next post, we’ll move on to figuring out what to do with them.
There are a variety of ways to find outliers. All these methods employ different approaches for finding values that are unusual compared to the rest of the dataset. I’ll start with visual assessments and then move onto more analytical assessments.
Let’s find that outlier! I’ve got five methods for you to try.
Sorting Your Datasheet to Find Outliers
Sorting your datasheet is a simple but effective way to highlight unusual values. Simply sort your data sheet for each variable and then look for unusually high or low values.
For example, I’ve sorted the example dataset in ascending order, as shown below. The highest value is clearly different than the others. While this approach doesn’t quantify the outlier’s degree of unusualness, I like it because, at a glance, you’ll find the unusually high or low values.
Graphing Your Data to Identify Outliers
Boxplots, histograms, and scatterplots can highlight outliers.
Boxplots display asterisks or other symbols on the graph to indicate explicitly when datasets contain outliers. These graphs use the interquartile method with fences to find outliers, which I explain later. The boxplot below displays our example dataset. It’s clear that the outlier is quite different than the typical data value.
You can also use boxplots to find outliers when you have groups in your data. The boxplot below shows a different dataset that has an outlier in the Method 2 group. Click here to learn more about boxplots.
Histograms also emphasize the existence of outliers. Look for isolated bars, as shown below. Our outlier is the bar far to the right. The graph crams the legitimate data points on the far left.
Click here to learn more about histograms.
Most of the outliers I discuss in this post are univariate outliers. We look at a data distribution for a single variable and find values that fall outside the distribution. However, you can use a scatterplot to detect outliers in a multivariate setting.
In the graph below, we’re looking at two variables, Input and Output. The scatterplot with regression line shows how most of the points follow the fitted line for the model. However, the circled point does not fit the model well.
Interestingly, the Input value (~14) for this observation isn’t unusual at all because the other Input values range from 10 through 20 on the X-axis. Also, notice how the Output value (~50) is similarly within the range of values on the Y-axis (10 – 60). Neither the Input nor the Output values themselves are unusual in this dataset. Instead, it’s an outlier because it doesn’t fit the model.
This type of outlier can be a problem in regression analysis. Given the multifaceted nature of multivariate regression, there are numerous types of outliers in that realm. In my ebook about regression analysis, I detail various methods and tests for identifying outliers in a multivariate context.
For the rest of this post, we’ll focus on univariate outliers.
Using Z-scores to Detect Outliers
Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean.
To calculate the Z-score for an observation, take the raw measurement, subtract the mean, and divide by the standard deviation. Mathematically, the formula for that process is the following:
The further away an observation’s Z-score is from zero, the more unusual it is. A standard cut-off value for finding outliers are Z-scores of +/-3 or further from zero. The probability distribution below displays the distribution of Z-scores in a standard normal distribution. Z-scores beyond +/- 3 are so extreme you can barely see the shading under the curve.
In a population that follows the normal distribution, Z-score values more extreme than +/- 3 have a probability of 0.0027 (2 * 0.00135), which is about 1 in 370 observations. However, if your data don’t follow the normal distribution, this approach might not be accurate.
Z-scores and Our Example Dataset
In our example dataset below, I display the values in the example dataset along with the Z-scores. This approach identifies the same observation as being an outlier.
Note that Z-scores can be misleading with small datasets because the maximum Z-score is limited to (n−1) / √ n.* Indeed, our Z-score of ~3.6 is right near the maximum value for a sample size of 15. Sample sizes of 10 or fewer observations cannot have Z-scores that exceed a cutoff value of +/-3.
Also, note that the presence of the outlier throws off the Z-scores because it inflates the mean and standard deviation as we saw earlier. Notice how all the Z-scores are negative except the outlier’s value. If we calculated Z-scores without the outlier, they’d be different! Be aware that if your dataset contains outliers, Z-values are biased such that they appear to be less extreme (i.e., closer to zero).
Using the Interquartile Range to Create Outlier Fences
You can use the interquartile range (IQR), several quartile values, and an adjustment factor to calculate boundaries for what constitutes minor and major outliers. Minor and major denote the unusualness of the outlier relative to the overall distribution of values. Major outliers are more extreme. Analysts also refer to these categorizations as mild and extreme outliers.
The IQR is the middle 50% of the dataset. It’s the range of values between the third quartile and the first quartile (Q3 – Q1). We can take the IQR, Q1, and Q3 values to calculate the following outlier fences for our dataset: lower outer, lower inner, upper inner, and upper outer. These fences determine whether data points are outliers and whether they are mild or extreme.
Values that fall inside the two inner fences are not outliers. Let’s see how this method works using our example dataset.
Related post: Percentiles: Interpretations and Calculations
Calculating the Outlier Fences Using the Interquartile Range
Using statistical software, I can determine the interquartile range along with the Q1 and Q3 values for our example dataset. We’ll need these values to calculate the “fences” for identifying minor and major outliers. The output below indicates that our Q1 value is 1.714 and the Q3 value is 1.936. Our IQR is 1.936 – 1.714 = 0.222.
To calculate the outlier fences, do the following:
- Take your IQR and multiply it by 1.5 and 3. We’ll use these values to obtain the inner and outer fences. For our example, the IQR equals 0.222. Consequently, 0.222 * 1.5 = 0.333 and 0.222 * 3 = 0.666. We’ll use 0.333 and 0.666 in the following steps.
- Calculate the inner and outer lower fences. Take the Q1 value and subtract the two values from step 1. The two results are the lower inner and outer outlier fences. For our example, Q1 is 1.714. So, the lower inner fence = 1.714 – 0.333 = 1.381 and the lower outer fence = 1.714 – 0.666 = 1.048.
- Calculate the inner and outer upper fences. Take the Q3 value and add the two values from step 1. The two results are the upper inner and upper outlier fences. For our example, Q3 is 1.936. So, the upper inner fence = 1.936 + 0.333 = 2.269 and the upper outer fence = 1.936 + 0.666 = 2.602.
Using the Outlier Fences with Our Example Dataset
For our example dataset, the values for these fences are the following: 1.048, 1.381, 2.269, 2.602. Almost all of our data should fall between the inner fences, which are 1.381 and 2.269. At this point, we look at our data values and determine whether any qualify as being major or minor outliers. 14 out of the 15 data points fall inside the inner fences—they are not outliers. The 15th data point falls outside the upper outer fence—it’s a major or extreme outlier.
The IQR method is helpful because it uses percentiles, which do not depend on a specific distribution. Additionally, percentiles are relatively robust to the presence of outliers compared to the other quantitative methods.
Boxplots use the IQR method for calculating the inner fences. Typically, I’ll use boxplots rather than calculating the fences myself when I want to use this approach. Of the quantitative approaches in this post, this is my preferred method.
Finding Outliers with Hypothesis Tests
You can use hypothesis tests to find outliers. Many outlier tests exist, but I’ll focus on one to illustrate how they work. In this post, I demonstrate Grubbs’ test, which tests the following hypotheses:
- Null: All values in the sample were drawn from a single population that follows the same normal distribution.
- Alternative: One value in the sample was not drawn from the same normally distributed population as the other values.
Let’s perform this hypothesis test using our sample dataset. Grubbs’ test assumes your data are drawn from a normally distributed population, and it can detect only one outlier. If you suspect you have additional outliers, use a different test.
Grubbs’ outlier test produced a p-value of 0.000. Because it is less than our significance level, we can conclude that our dataset contains an outlier. The output indicates it is the high value we found before.
If you use Grubbs’ test and find an outlier, don’t remove that outlier and perform the analysis again. That process can cause you to remove values that are not outliers.
Challenges of Using Outlier Hypothesis Tests: Masking and Swamping
When performing an outlier test, you either need to choose a procedure based on the number of outliers or specify the number of outliers for a test. Grubbs’ test checks for only one outlier. However, other procedures, such as the Tietjen-Moore Test, require you to specify the number of outliers. That’s hard to do correctly! After all, you’re performing the test to find outliers! Masking and swamping are two problems that can occur when you specify the incorrect number of outliers in a dataset.
Masking occurs when you specify too few outliers. The additional outliers that exist can affect the test so that it detects no outliers. For example, if you specify one outlier, and there are two, the test can miss both outliers.
Conversely, swamping occurs when you specify too many outliers. In this case, the test identifies too many data points as being outliers. For example, if you specify two outliers, and there is only one, the test might determine that there are two outliers.
Because of these problems, I’m not a big fan of outlier tests. More on this in the next section!
My Philosophy about Finding Outliers
As you saw, there are many ways to identify outliers. My philosophy is that when analyzing data, you must go into the analysis with in-depth knowledge about all the variables. Part of this knowledge is knowing what values are typical, unusual, and impossible.
I find that when you have this in-depth knowledge, it’s best to use the more straightforward, visual methods. At a glance, data points that are potential outliers will pop out under your knowledgeable gaze. Consequently, I’ll often use boxplots, histograms, and good old-fashioned data sorting! These simple tools provide enough information for me to find unusual data points for further investigation.
Typically, I don’t use Z-scores and hypothesis tests to find outliers because of their various complications. Using outlier tests can be challenging because they usually assume your data follow the normal distribution, and then there’s masking and swamping. Additionally, the existence of outliers makes Z-scores less extreme. It’s ironic, but these methods for identifying outliers are actually sensitive to the presence of outliers! Fortunately, as long as researchers use a simple method to display unusual values, a knowledgeable analyst is likely to know which values need further investigation.
In my view, the more formal statistical tests and calculations are overkill because they can’t definitively identify outliers. Ultimately, analysts must investigate unusual values and use their expertise to determine whether they are legitimate data points. Statistical procedures don’t know the subject matter or the data collection process and can’t make the final determination. You should not include or exclude an observation based entirely on the results of a hypothesis test or statistical measure.
At this stage of the analysis, we’re only identifying potential outliers for further investigation. It’s just the first step in handling them. If we err, we want to err on the side of investigating too many values rather than too few.
In my next post, I’ll explain what you’re looking for when investigating outliers and how that helps you determine whether to remove them from your dataset. Not all outliers are bad and some should not be deleted. In fact, outliers can be very informative about the subject-area and data collection process. It’s important to understand how outliers occur and whether they might happen again as a normal part of the process or study area.