Robust statistics provide valid results across a broad variety of conditions, including assumption violations, the presence of outliers, and various other problems. The term “robust statistic” applies both to a statistic (i.e., median) and statistical analyses (i.e., hypothesis tests and regression).
Huber (1982) defined these statistics as being “distributionally robust and outlier-resistant.”
Conversely, non-robust statistics are sensitive to to less than ideal conditions.
In this post, learn about robust statistics and analyses.
The mean, median, standard deviation, and interquartile range are sample statistics that estimate their corresponding population values. Ideally, the sample values will be relatively close to the population value and will not be systematically too high or too low (i.e., unbiased).
Unfortunately, outliers and extreme values in the long tail of a skewed distribution can cause some sample statistics to become biased, poor quality estimates. What does that mean? The sample statistics will be systematically too high or too low and move further away from the correct value.
Conversely, a robust statistic will be efficient, have only a slight bias, and be asymptomatically unbiased as sample size increases when there are outliers and extreme values in long-tails.
In plain English, when outliers and long-tails are present, robust statistics will be reasonably close to the correct value given your sample size, and it will not systematically over- or under-estimate the population value. Additionally, as the sample size increases, the statistic approaches becoming fully unbiased.
Robust statistics resist the influence of outliers and long-tails. They work well in a wide variety of probability distributions, particularly non-normal distributions.
Related post: How to Identify the Distribution of Your Data
The Breakdown Point and Robustness
An intuitive way to understand the robustness of a statistic is to consider how many data points in a sample you can replace with artificial outliers before the sample statistic becomes a poor estimate.
Statisticians refer to this as the breakdown point. That’s the maximum percentage of observations you can replace with outliers before causing unbounded changes in the estimate. Higher breakdown points correspond to more robust statistics.
Let’s work through examples with the mean and median.
The calculations for the mean involve all data points. Consequently, a single outlier can drastically affect the mean. Imagine we have the following dataset: 50, 52, 55, 56, 59, 59, 60. If we change one of the values to 1000, it’ll have a huge impact on the mean! Theoretically, the effect is unbounded because we could force the mean to be any value we choose by adjusting one value in the dataset. The breakdown point for the mean is 1/n. The mean is not a robust statistic.
Conversely, the median is a robust statistic because it has a breakdown point of 50%. You can alter up to 50% of the observations before producing unbounded changes. Using the same dataset: 50, 52, 55, 56, 59, 59, 60, if we changed the 60 to 1000, the median is entirely unaffected. It’s still 56 in both cases.
Consequently, the median is a robust statistic for central tendency while the mean is not. In graph below, notice that the median is near the most common values while the mean is getting pulled away by the long tail of the skewed distribution.
Robust Statistics for Variation
There are several common measures of variability, including the standard deviation, range, and interquartile range. Which statistics are robust?
The standard deviation is similar to the mean because its calculations include all values in the data set. A single outlier can drastically affect this statistic. Therefore, it is not robust.
The range is the difference between the highest and lowest value in the dataset. If you have a single unusually high or low value, it can greatly impact the range. It’s also not robust.
The interquartile range (IQR) is the middle half of your dataset. It is similar to the median in that you can replace many values without altering the IQR. It has a breakdown point of 25%. Consequently, of these three measures, the interquartile range is the most robust statistic.
Related post: Measures of Variability
What are Robust Statistical Analyses?
Robust statistical analyses can produce valid results even when the ideal conditions do not exist with real-world data. These analyses perform well when the sample data follow a variety of distributions and have unusual values. In other words, you can trust the results even when the assumptions are not fully satisfied.
For example, parametric hypothesis tests that assess the mean, such as t-tests and ANOVA, assume the data follow a normal distribution. However, these tests are robust to deviations from the normal distribution when your sample size per group is sufficiently large, thanks to the central limit theorem.
Similarly, nonparametric analyses assess the median and are robust distributionally because they don’t assume the data follow any particular distribution. Additionally, like the median, nonparametric analyses resist the effects of outliers.
Related post: Nonparametric vs. Parametric Analyses
Robust regression is a type of regression analysis that statisticians designed to avoid problems associated with ordinary least squares (OLS). Outliers can invalidate OLS results, while robust regression can handle them. It can also deal with heteroscedasticity, which occurs when the residuals have a non-constant variance.
Be sure to know for which properties each statistical analysis is robust. For example, while traditional t-tests and ANOVAs can manage violations of the normality assumption, they cannot resist the effects of outliers. Nonparametric tests don’t require a specific distribution, but the various groups in your analysis must have the same dispersion. Hence, nonparametric tests are not robust to violations of the equal variances assumption.
Robust statistical analyses might be resistant to particular assumption violations and yet be sensitive to other breaches.