Robust statistics provide valid results across a broad variety of conditions, including assumption violations, the presence of outliers, and various other problems. The term “robust statistic” applies both to a statistic (i.e., median) and statistical analyses (i.e., hypothesis tests and regression).
Huber (1982) defined these statistics as being “distributionally robust and outlier-resistant.”
Conversely, non-robust statistics are sensitive to to less than ideal conditions.
In this post, learn about robust statistics and analyses.
Robust Statistics
The mean, median, standard deviation, and interquartile range are sample statistics that estimate their corresponding population values. Ideally, the sample values will be relatively close to the population value and will not be systematically too high or too low (i.e., unbiased).
Unfortunately, outliers and extreme values in the long tail of a skewed distribution can cause some sample statistics to become biased, poor quality estimates. What does that mean? The sample statistics will be systematically too high or too low and move further away from the correct value.
Conversely, a robust statistic will be efficient, have only a slight bias, and be asymptomatically unbiased as sample size increases when there are outliers and extreme values in long-tails.
In plain English, when outliers and long-tails are present, robust statistics will be reasonably close to the correct value given your sample size, and it will not systematically over- or under-estimate the population value. Additionally, as the sample size increases, the statistic approaches becoming fully unbiased.
Robust statistics resist the influence of outliers and long-tails. They work well in a wide variety of probability distributions, particularly non-normal distributions.
Related post: How to Identify the Distribution of Your Data
The Breakdown Point and Robustness
An intuitive way to understand the robustness of a statistic is to consider how many data points in a sample you can replace with artificial outliers before the sample statistic becomes a poor estimate.
Statisticians refer to this as the breakdown point. That’s the maximum percentage of observations you can replace with outliers before causing unbounded changes in the estimate. Higher breakdown points correspond to more robust statistics.
Let’s work through examples with the mean and median.
The calculations for the mean involve all data points. Consequently, a single outlier can drastically affect the mean. Imagine we have the following dataset: 50, 52, 55, 56, 59, 59, 60. If we change one of the values to 1000, it’ll have a huge impact on the mean! Theoretically, the effect is unbounded because we could force the mean to be any value we choose by adjusting one value in the dataset. The breakdown point for the mean is 1/n. The mean is not a robust statistic.
Conversely, the median is a robust statistic because it has a breakdown point of 50%. You can alter up to 50% of the observations before producing unbounded changes. Using the same dataset: 50, 52, 55, 56, 59, 59, 60, if we changed the 60 to 1000, the median is entirely unaffected. It’s still 56 in both cases.
Consequently, the median is a robust statistic for central tendency while the mean is not. In graph below, notice that the median is near the most common values while the mean is getting pulled away by the long tail of the skewed distribution.
The trimmed mean is a measure of central tendency that is more robust than the regular mean but less robust than the median. Learn more about when and how to use it in my article, Trimmed Mean: Definition, Calculating & Benefits.
Related posts: Measures of Central Tendency and Five Ways to Find Outliers
Robust Statistics for Variation
There are several common measures of variability, including the standard deviation, range, and interquartile range. Which statistics are robust?
The standard deviation is similar to the mean because its calculations include all values in the data set. A single outlier can drastically affect this statistic. Therefore, it is not robust.
The range is the difference between the highest and lowest value in the dataset. If you have a single unusually high or low value, it can greatly impact the range. It’s also not robust.
The interquartile range (IQR) is the middle half of your dataset. It is similar to the median in that you can replace many values without altering the IQR. It has a breakdown point of 25%. Consequently, of these three measures, the interquartile range is the most robust statistic.
Related post: Measures of Variability
What are Robust Statistical Analyses?
Robust statistical analyses can produce valid results even when the ideal conditions do not exist with real-world data. These analyses perform well when the sample data follow a variety of distributions and have unusual values. In other words, you can trust the results even when the assumptions are not fully satisfied.
For example, parametric hypothesis tests that assess the mean, such as t-tests and ANOVA, assume the data follow a normal distribution. However, these tests are robust to deviations from the normal distribution when your sample size per group is sufficiently large, thanks to the central limit theorem.
Similarly, nonparametric analyses assess the median and are robust distributionally because they don’t assume the data follow any particular distribution. Additionally, like the median, nonparametric analyses resist the effects of outliers.
Related post: Nonparametric vs. Parametric Analyses
Robust regression is a type of regression analysis that statisticians designed to avoid problems associated with ordinary least squares (OLS). Outliers can invalidate OLS results, while robust regression can handle them. It can also deal with heteroscedasticity, which occurs when the residuals have a non-constant variance.
Related posts: OLS Assumptions and Heteroscedasticity in Regression
Be sure to know for which properties each statistical analysis is robust. For example, while traditional t-tests and ANOVAs can manage violations of the normality assumption, they cannot resist the effects of outliers. Nonparametric tests don’t require a specific distribution, but the various groups in your analysis must have the same dispersion. Hence, nonparametric tests are not robust to violations of the equal variances assumption.
Robust statistical analyses might be resistant to particular assumption violations and yet be sensitive to other breaches.
Reference
Huber P. J. 1982: Robust Statistics. Wiley, New York, 308 p
I would like to know if an outlier test needs to be applied to a data set where I wish to use the median as a robust median. In other words, if I want to use the median can I just include all the data including those that are potential outliers?
Hi Allan,
To start, I’m not a big fan of outlier tests as a way of finding them. To understand why, read my post How to Identify Outliers.
But on to your question! First, you should reorder your process. Don’t start with the median in mind and then use it to make decisions about handling outliers. Instead, make decisions about handling outliers then determine whether the median or mean is appropriate.
There are basically two scenarios here:
If any of these outliers are valid data that are just unusual, then, yes, you can and should include all of them when calculating the median.
If, the outliers are invalid data, which is causing them to be unusual, then you should remove them from the dataset even when you’re finding the median. However, the reason for removing them is because they’re invalid and should not be used for anything, including the median.
If ALL the outliers are invalid and you remove all of them, then you might consider using the mean instead of the median. But, if all or some of the outliers are valid data points, then consider using the median.
Generally, when you’re working with outliers, that’s the determination you need to make–do the extreme values represent valid data that are unusual or are they invalid? From there, you can decide what to do. If at least some are valid and you need to leave them in, consider finding the median. If they’re ALL invalid and you need to remove all of them, consider finding the mean.
I write more about Guidelines for Removing and Handling Outliers, which I think will be helpful.
Great posts and the comments section helped me to get a clarity on the doubts that arose in my mind when I read the post.
Sorry Jim, I also consider another question that when i carry out a statistical analyze I usually fall in the feeling of lost something, means some time I don’t know what exactly I need to do the following. Do you have a full process of analyze (step-be-step) ? I think a full process will help me a lot in analyzing. Thanks!!
Hey Jim,
Thanks for the post. It sounds like the median is always more robust than the mean—so what’s the point of using the mean then? It almost seems like there’s no downsides of using the median, whereas the mean there could be several downsides if your data is skewed.
Hi Michael,
That’s a great question!
If you’re just reporting the mean or median for a sample or population, I’d say you’re right. You won’t go wrong with the median. However, most statisticians prefer the mean when it is a valid measure because its calculations involve all the data points whereas the median throws most of them out! I’d lean in your direction though. The downside is that reporting the median might confuse some people. I guess the takeaway is, even though the median always works, stick with the mean when it’s valid because its the convention and it’s an equally good measure. And, it’s easy enough to tell when it’s valid, just graph your data using a histogram.
However, in other situations, the mean does have advantages. When it comes to hypothesis testing, parametric tests for means have advantages over nonparametric tests for the median. Consequently, many studies and journal articles will need to talk about means rather than medians because they’re performing specific hypothesis tests.
hi Jim
I’m considered that you say “In other words, you can trust the results even when the assumptions are not fully satisfied.” but “Be sure to know for which properties each statistical analysis is robust” . I’m really confused in here. Please give me more information (sorry if i lost somewhere )
Hi Nam,
A statistical analysis can be robust against one assumption violation but not another. For example, the traditional F-test ANOVA is robust against violations of the normality assumption when your group sample sizes are large enough. However, it is NOT robust against violations of the equal variances assumptions even with very large samples. Consequently, it’s important to know how an analysis is robust because it might not be robust against everything!
Hei Jim,
I just want to thank you for your posts. I am working in fetal medicine and a 50 percent research employment. These posts opened my eyes several times and stimulated me to read more to push my knowledge further.
Of all mails coming inn yours are the ones I really look forward to get. For me even repetition of themes with only small extras are nice to read. I might for once even feel statistically advanced. So thank you for sharing, stay safe and please keep it going👍
Hi Alexander,
Thanks so much for the kind words. They keep me motivated to keep going! 🙂
Great thanks on sharing statistical clarity. I am always reading your sharing service. STAY SAFE
Thank you so much, Berhanu! You stay safe as well.