Outliers are data points that are far from other data points. In other words, they’re unusual values in a dataset. Outliers are problematic for many statistical analyses because they can cause tests to either miss significant findings or distort real results.

Unfortunately, there are no strict statistical rules for definitively identifying outliers. Finding outliers depends on subject-area knowledge and an understanding of the data collection process. While there is no solid mathematical definition, there are guidelines and statistical tests you can use to find outlier candidates.

In this post, I’ll explain what outliers are and why they are problematic, and present various methods for finding them. Additionally, I close this post by comparing the different techniques for identifying outliers and share my preferred approach.

## Outliers and Their Impact

Outliers are a simple concept—they are values that are notably different from other data points, and they can cause problems in statistical procedures.

To demonstrate how much a single outlier can affect the results, let’s examine the properties of an example dataset. It contains 15 height measurements of human males. One of those values is an outlier. The table below shows the mean height and standard deviation with and without the outlier.

Throughout this post, I’ll be using this example CSV dataset: Outliers.

With Outlier | Without Outlier | Difference |

2.4m (7’ 10.5”) | 1.8m (5’ 10.8”) | 0.6m (~2 feet) |

2.3m (7’ 6”) | 0.14m (5.5 inches) | 2.16m (~7 feet) |

From the table, it’s easy to see how a single outlier can distort reality. A single value changes the mean height by 0.6m (2 feet) and the standard deviation by a whooping 2.16m (7 feet)! Hypothesis tests that use the mean with the outlier are off the mark. And, the much larger standard deviation will severely reduce statistical power!

Before performing statistical analyses, you should identify potential outliers. That’s the subject of this post. In the next post, we’ll move on to figuring out what to do with them.

There are a variety of ways to find outliers. All these methods employ different approaches for finding values that are unusual compared to the rest of the dataset. I’ll start with visual assessments and then move onto more analytical assessments.

Let’s find that outlier! I’ve got five methods for you to try.

## Sorting Your Datasheet to Find Outliers

Sorting your datasheet is a simple but effective way to highlight unusual values. Simply sort your data sheet for each variable and then look for unusually high or low values.

For example, I’ve sorted the example dataset in ascending order, as shown below. The highest value is clearly different than the others. While this approach doesn’t quantify the outlier’s degree of unusualness, I like it because, at a glance, you’ll find the unusually high or low values.

## Graphing Your Data to Identify Outliers

Boxplots, histograms, and scatterplots can highlight outliers.

Boxplots display asterisks or other symbols on the graph to indicate explicitly when datasets contain outliers. These graphs use the interquartile method with fences to find outliers, which I explain later. The boxplot below displays our example dataset. It’s clear that the outlier is quite different than the typical data value.

You can also use boxplots to find outliers when you have groups in your data. The boxplot below shows a different dataset that has an outlier in the Method 2 group. Click here to learn more about boxplots.

Histograms also emphasize the existence of outliers. Look for isolated bars, as shown below. Our outlier is the bar far to the right. The graph crams the legitimate data points on the far left.

Click here to learn more about histograms.

Most of the outliers I discuss in this post are univariate outliers. We look at a data distribution for a single variable and find values that fall outside the distribution. However, you can use a scatterplot to detect outliers in a multivariate setting.

In the graph below, we’re looking at two variables, Input and Output. The scatterplot with regression line shows how most of the points follow the fitted line for the model. However, the circled point does not fit the model well.

Interestingly, the Input value (~14) for this observation isn’t unusual at all because the other Input values range from 10 through 20 on the X-axis. Also, notice how the Output value (~50) is similarly within the range of values on the Y-axis (10 – 60). Neither the Input nor the Output values themselves are unusual in this dataset. Instead, it’s an outlier because it doesn’t fit the model.

This type of outlier can be a problem in regression analysis. Given the multifaceted nature of multivariate regression, there are numerous types of outliers in that realm. In my ebook about regression analysis, I detail various methods and tests for identifying outliers in a multivariate context.

For the rest of this post, we’ll focus on univariate outliers.

## Using Z-scores to Detect Outliers

Z-scores can quantify the unusualness of an observation when your data follow the normal distribution. Z-scores are the number of standard deviations above and below the mean that each value falls. For example, a Z-score of 2 indicates that an observation is two standard deviations above the average while a Z-score of -2 signifies it is two standard deviations below the mean. A Z-score of zero represents a value that equals the mean.

To calculate the Z-score for an observation, take the raw measurement, subtract the mean, and divide by the standard deviation. Mathematically, the formula for that process is the following:

The further away an observation’s Z-score is from zero, the more unusual it is. A standard cut-off value for finding outliers are Z-scores of +/-3 or further from zero. The probability distribution below displays the distribution of Z-scores in a standard normal distribution. Z-scores beyond +/- 3 are so extreme you can barely see the shading under the curve.

In a population that follows the normal distribution, Z-score values more extreme than +/- 3 have a probability of 0.0027 (2 * 0.00135), which is about 1 in 370 observations. However, if your data don’t follow the normal distribution, this approach might not be accurate.

### Z-scores and Our Example Dataset

In our example dataset below, I display the values in the example dataset along with the Z-scores. This approach identifies the same observation as being an outlier.

Note that Z-scores can be misleading with small datasets because the maximum Z-score is limited to (*n*−1) / √* n.** Indeed, our Z-score of ~3.6 is right near the maximum value for a sample size of 15. Sample sizes of 10 or fewer observations cannot have Z-scores that exceed a cutoff value of +/-3.

Also, note that the presence of the outlier throws off the Z-scores because it inflates the mean and standard deviation as we saw earlier. Notice how all the Z-scores are negative except the outlier’s value. If we calculated Z-scores without the outlier, they’d be different! Be aware that if your dataset contains outliers, Z-values are biased such that they appear to be less extreme (i.e., closer to zero).

**Related posts**: Normal Distribution and Understanding Probability Distributions

## Using the Interquartile Range to Create Outlier Fences

You can use the interquartile range (IQR), several quartile values, and an adjustment factor to calculate boundaries for what constitutes minor and major outliers. Minor and major denote the unusualness of the outlier relative to the overall distribution of values. Major outliers are more extreme. Analysts also refer to these categorizations as mild and extreme outliers.

The IQR is the middle 50% of the dataset. It’s the range of values between the third quartile and the first quartile (Q3 – Q1). We can take the IQR, Q1, and Q3 values to calculate the following outlier fences for our dataset: lower outer, lower inner, upper inner, and upper outer. These fences determine whether data points are outliers and whether they are mild or extreme.

Values that fall inside the two inner fences are not outliers. Let’s see how this method works using our example dataset.

**Related post**: Percentiles: Interpretations and Calculations

### Calculating the Outlier Fences Using the Interquartile Range

Using statistical software, I can determine the interquartile range along with the Q1 and Q3 values for our example dataset. We’ll need these values to calculate the “fences” for identifying minor and major outliers. The output below indicates that our Q1 value is 1.714 and the Q3 value is 1.936. Our IQR is 1.936 – 1.714 = 0.222.

To calculate the outlier fences, do the following:

- Take your IQR and multiply it by 1.5 and 3. We’ll use these values to obtain the inner and outer fences. For our example, the IQR equals 0.222. Consequently, 0.222 * 1.5 = 0.333 and 0.222 * 3 = 0.666. We’ll use 0.333 and 0.666 in the following steps.
- Calculate the inner and outer lower fences. Take the Q1 value and subtract the two values from step 1. The two results are the lower inner and outer outlier fences. For our example, Q1 is 1.714. So, the lower inner fence = 1.714 – 0.333 = 1.381 and the lower outer fence = 1.714 – 0.666 = 1.048.
- Calculate the inner and outer upper fences. Take the Q3 value and add the two values from step 1. The two results are the upper inner and upper outlier fences. For our example, Q3 is 1.936. So, the upper inner fence = 1.936 + 0.333 = 2.269 and the upper outer fence = 1.936 + 0.666 = 2.602.

### Using the Outlier Fences with Our Example Dataset

For our example dataset, the values for these fences are the following: 1.048, 1.381, 2.269, 2.602. Almost all of our data should fall between the inner fences, which are 1.381 and 2.269. At this point, we look at our data values and determine whether any qualify as being major or minor outliers. 14 out of the 15 data points fall inside the inner fences—they are not outliers. The 15^{th} data point falls outside the upper outer fence—it’s a major or extreme outlier.

The IQR method is helpful because it uses percentiles, which do not depend on a specific distribution. Additionally, percentiles are relatively robust to the presence of outliers compared to the other quantitative methods.

Boxplots use the IQR method for calculating the inner fences. Typically, I’ll use boxplots rather than calculating the fences myself when I want to use this approach. Of the quantitative approaches in this post, this is my preferred method.

## Finding Outliers with Hypothesis Tests

You can use hypothesis tests to find outliers. Many outlier tests exist, but I’ll focus on one to illustrate how they work. In this post, I demonstrate Grubbs’ test, which tests the following hypotheses:

**Null**: All values in the sample were drawn from a single population that follows the same normal distribution.**Alternative**: One value in the sample was not drawn from the same normally distributed population as the other values.

If the p-value for this test is less than your significance level, you can reject the null and conclude that one of the values is an outlier. The analysis identifies the value in question.

Let’s perform this hypothesis test using our sample dataset. Grubbs’ test assumes your data are drawn from a normally distributed population, and it can detect only one outlier. If you suspect you have additional outliers, use a different test.

Grubbs’ outlier test produced a p-value of 0.000. Because it is less than our significance level, we can conclude that our dataset contains an outlier. The output indicates it is the high value we found before.

If you use Grubbs’ test and find an outlier, don’t remove that outlier and perform the analysis again. That process can cause you to remove values that are not outliers.

## Challenges of Using Outlier Hypothesis Tests: Masking and Swamping

When performing an outlier test, you either need to choose a procedure based on the number of outliers or specify the number of outliers for a test. Grubbs’ test checks for only one outlier. However, other procedures, such as the Tietjen-Moore Test, require you to specify the number of outliers. That’s hard to do correctly! After all, you’re performing the test to find outliers! Masking and swamping are two problems that can occur when you specify the incorrect number of outliers in a dataset.

Masking occurs when you specify too few outliers. The additional outliers that exist can affect the test so that it detects no outliers. For example, if you specify one outlier, and there are two, the test can miss both outliers.

Conversely, swamping occurs when you specify too many outliers. In this case, the test identifies too many data points as being outliers. For example, if you specify two outliers, and there is only one, the test might determine that there are two outliers.

Because of these problems, I’m not a big fan of outlier tests. More on this in the next section!

## My Philosophy about Finding Outliers

As you saw, there are many ways to identify outliers. My philosophy is that when analyzing data, you must go into the analysis with in-depth knowledge about all the variables. Part of this knowledge is knowing what values are typical, unusual, and impossible.

I find that when you have this in-depth knowledge, it’s best to use the more straightforward, visual methods. At a glance, data points that are potential outliers will pop out under your knowledgeable gaze. Consequently, I’ll often use boxplots, histograms, and good old-fashioned data sorting! These simple tools provide enough information for me to find unusual data points for further investigation.

Typically, I don’t use Z-scores and hypothesis tests to find outliers because of their various complications. Using outlier tests can be challenging because they usually assume your data follow the normal distribution, and then there’s masking and swamping. Additionally, the existence of outliers makes Z-scores less extreme. It’s ironic, but these methods for identifying outliers are actually sensitive to the presence of outliers! Fortunately, as long as researchers use a simple method to display unusual values, a knowledgeable analyst is likely to know which values need further investigation.

In my view, the more formal statistical tests and calculations are overkill because they can’t definitively identify outliers. Ultimately, analysts must investigate unusual values and use their expertise to determine whether they are legitimate data points. Statistical procedures don’t know the subject matter or the data collection process and can’t make the final determination. You should not include or exclude an observation based entirely on the results of a hypothesis test or statistical measure.

At this stage of the analysis, we’re only identifying potential outliers for further investigation. It’s just the first step in handling them. If we err, we want to err on the side of investigating too many values rather than too few.

In my next post, I’ll explain what you’re looking for when investigating outliers and how that helps you determine whether to remove them from your dataset. Not all outliers are bad and some should not be deleted. In fact, outliers can be very informative about the subject-area and data collection process. It’s important to understand how outliers occur and whether they might happen again as a normal part of the process or study area.

Read my Guidelines for Removing and Handling Outliers.

### Reference

Ronald E. Shiffler (1988) Maximum Z Scores and Outliers, *The American Statistician*, 42:1, 79-80, DOI: 10.1080/00031305.1988.10475530

Suruchi Sarvate says

Hi Jim,

I have a dataset with 11 columns and I have written a common function detect_outliers() to find outliers in the columns.

For first 6 columns, the function is working out but for rest of the 5 outliers , function returns empty list though the columns have outliers. U can see the code written below:

################

def detect_outliers(data):

outliers = []

threshold=3

mean = np.mean(data)

std = np.std(data)

for i in data:

z_score = (i-mean)/std

print(z_score)

if np.abs(z_score) > threshold:

outliers.append(i)

return outliers

################

As you can see, if I have taken the value of “threshold = 3”. For first six columns, the function is working out as z_score>3 for outliers.

But for rest of the columns, z_score for outliers is greater than 1 (z_score>1), so the threshold should be taken 1 for rest of the six coulmns.

Here I have 11 columns only in the dataset. But what if I have 1000 columns in my dataset. In that case,I can’t check the threshold for each and every column.

Please!!!! help me and reply at the earliest.

Jim Frost says

Hi Suruchi,

Why would you use a Z-score of 1 to detect outliers? I’m not sure why it’s not working but with such a low threshold you should have more detections. How many observations per column?

Rutvij says

Hi Jim, Thanks for sharing details on outliers. I have one question, happy if you can advise me.

My questions is, I am building a MachineLearning model, I have traning dataset and testing dataset. I removed outliers from traning dataset and building ML model with good efficient level. Now, I did have large amount of outliers in testing dataset (which I have to submit as it is).

Now, in that my ML model is less efficient when I applied on unseen test dataset (with outliers).

Can you please advice me, how shall I achive more efficiency on test dataset. Plus, I don’t want to loose any observed values in the test dataset.

Thanks.

Rutvij

Narasimha Patro says

How to treat outliers ??? Please help me

Jim Frost says

Hi Narasimha,

Read my follow up post to this one: Guidelines for Removing and Handling Outliers.

Denny Fernandez del Viso says

I usually use a Q-Q plot to detect outliers – just a visualization of what you suggest as using the Z-score.

Jim Frost says

Hi Denny,

Thanks for the suggestion. Unusual Z-scores might stand out more in a plot than a list. Just be aware of the constraints on Z-scores in small samples and the fact that Z-scores themselves are sensitive to outliers.

Brion Hurley says

I haven’t seen this formula before related to Z-scores: (n−1) / √ n

Can you share more details about where this comes from? It’s not intuitive to me at first glance

Jim Frost says

Hi Brion,

I’ve added a reference to this post for this formula. The referenced article discusses this limitation in the context of finding outliers and it includes references to other sources where the limit was derived. In a nutshell, maximizing Z-scores depends on minimizing the standard deviation (or variance). As I showed earlier in this post, the outlier is far from the mean score. While it increases the mean, it drastically increases the standard deviation. The net result of both increases is that it limits the maximum Z-value. In small samples, this limitation is even greater and severely constrains the maximum absolute Z-scores.

In general, an outlier pulls the mean towards it and inflates the standard deviation. Both effects reduce it’s Z-score. Indeed, our outlier’s Z-score of ~3.6 is greater than 3, but just barely. The Z-score seems to indicate that the value is just across the boundary for being outlier. However, it’s truly a severe outlier when you observe how unusual it truly is. Both the boxplot and IQR method make this clear. And, simply observing the value compared to reasonable values, it very far beyond legitimately possible values for human height.

The article uses an example of a dataset with 5 values {0, 0, 0, 0, 1 million}. The Z-score for the value of 1 million is only 1.789! Not an outlier using Z-scores!

To quote the article, “The concept of a Z score as a measure of a value’s position within a data set in terms of standard deviations is intuitively appealing. Unfortunately, the behavior of Z is quite constrained for small data sets.”

To illustrate this constraint, I’m including the table below that lists the maximum absolute Z-scores by sample size. Note how absolute Z-scores can exceed 3 only when the sample size is 11 and greater.

I hope this helps!

Fernando Augusto Deheza Zambrana says

Excellent work. Congratulations