• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • My Store
  • Glossary
  • Home
  • About Me
  • Contact Me

Statistics By Jim

Making statistics intuitive

  • Graphs
  • Basics
  • Hypothesis Testing
  • Regression
  • ANOVA
  • Probability
  • Time Series
  • Fun

What are Robust Statistics?

By Jim Frost 10 Comments

Robust statistics provide valid results across a broad variety of conditions, including assumption violations, the presence of outliers, and various other problems. The term “robust statistic” applies both to a statistic (i.e., median) and statistical analyses (i.e., hypothesis tests and regression).

Huber (1982) defined these statistics as being “distributionally robust and outlier-resistant.”

Conversely, non-robust statistics are sensitive to to less than ideal conditions.

In this post, learn about robust statistics and analyses.

Robust Statistics

The mean, median, standard deviation, and interquartile range are sample statistics that estimate their corresponding population values. Ideally, the sample values will be relatively close to the population value and will not be systematically too high or too low (i.e., unbiased).

Unfortunately, outliers and extreme values in the long tail of a skewed distribution can cause some sample statistics to become biased, poor quality estimates. What does that mean? The sample statistics will be systematically too high or too low and move further away from the correct value.

Conversely, a robust statistic will be efficient, have only a slight bias, and be asymptomatically unbiased as sample size increases when there are outliers and extreme values in long-tails.

In plain English, when outliers and long-tails are present, robust statistics will be reasonably close to the correct value given your sample size, and it will not systematically over- or under-estimate the population value. Additionally, as the sample size increases, the statistic approaches becoming fully unbiased.

Robust statistics resist the influence of outliers and long-tails. They work well in a wide variety of probability distributions, particularly non-normal distributions.

Related post: How to Identify the Distribution of Your Data

The Breakdown Point and Robustness

An intuitive way to understand the robustness of a statistic is to consider how many data points in a sample you can replace with artificial outliers before the sample statistic becomes a poor estimate.

Statisticians refer to this as the breakdown point. That’s the maximum percentage of observations you can replace with outliers before causing unbounded changes in the estimate. Higher breakdown points correspond to more robust statistics.

Let’s work through examples with the mean and median.

The calculations for the mean involve all data points. Consequently, a single outlier can drastically affect the mean. Imagine we have the following dataset: 50, 52, 55, 56, 59, 59, 60. If we change one of the values to 1000, it’ll have a huge impact on the mean! Theoretically, the effect is unbounded because we could force the mean to be any value we choose by adjusting one value in the dataset. The breakdown point for the mean is 1/n. The mean is not a robust statistic.

Conversely, the median is a robust statistic because it has a breakdown point of 50%. You can alter up to 50% of the observations before producing unbounded changes. Using the same dataset: 50, 52, 55, 56, 59, 59, 60, if we changed the 60 to 1000, the median is entirely unaffected. It’s still 56 in both cases.

Consequently, the median is a robust statistic for central tendency while the mean is not. In graph below, notice that the median is near the most common values while the mean is getting pulled away by the long tail of the skewed distribution.

Graph showing how the median is a robust statistic because it finds the center better than the mean in a skewed distribution..

Related posts: Measures of Central Tendency and Five Ways to Find Outliers

Robust Statistics for Variation

There are several common measures of variability, including the standard deviation, range, and interquartile range. Which statistics are robust?

The standard deviation is similar to the mean because its calculations include all values in the data set. A single outlier can drastically affect this statistic. Therefore, it is not robust.

The range is the difference between the highest and lowest value in the dataset. If you have a single unusually high or low value, it can greatly impact the range. It’s also not robust.

The interquartile range (IQR) is the middle half of your dataset. It is similar to the median in that you can replace many values without altering the IQR. It has a breakdown point of 25%. Consequently, of these three measures, the interquartile range is the most robust statistic.

Related post: Measures of Variability

What are Robust Statistical Analyses?

Robust statistical analyses can produce valid results even when the ideal conditions do not exist with real-world data. These analyses perform well when the sample data follow a variety of distributions and have unusual values. In other words, you can trust the results even when the assumptions are not fully satisfied.

For example, parametric hypothesis tests that assess the mean, such as t-tests and ANOVA, assume the data follow a normal distribution. However, these tests are robust to deviations from the normal distribution when your sample size per group is sufficiently large, thanks to the central limit theorem.

Similarly, nonparametric analyses assess the median and are robust distributionally because they don’t assume the data follow any particular distribution. Additionally, like the median, nonparametric analyses resist the effects of outliers.

Related post: Nonparametric vs. Parametric Analyses

Robust regression is a type of regression analysis that statisticians designed to avoid problems associated with ordinary least squares (OLS). Outliers can invalidate OLS results, while robust regression can handle them. It can also deal with heteroscedasticity, which occurs when the residuals have a non-constant variance.

Related posts: OLS Assumptions and Heteroscedasticity in Regression

Be sure to know for which properties each statistical analysis is robust. For example, while traditional t-tests and ANOVAs can manage violations of the normality assumption, they cannot resist the effects of outliers. Nonparametric tests don’t require a specific distribution, but the various groups in your analysis must have the same dispersion. Hence, nonparametric tests are not robust to violations of the equal variances assumption.

Robust statistical analyses might be resistant to particular assumption violations and yet be sensitive to other breaches.

Reference

Huber P. J. 1982: Robust Statistics. Wiley, New York, 308 p

Share this:

  • Tweet

Related

Filed Under: Basics Tagged With: conceptual

Reader Interactions

Comments

  1. Jereesh K Elias says

    December 16, 2021 at 11:32 am

    Great posts and the comments section helped me to get a clarity on the doubts that arose in my mind when I read the post.

    Reply
  2. Nam Duong says

    September 15, 2021 at 1:35 am

    Sorry Jim, I also consider another question that when i carry out a statistical analyze I usually fall in the feeling of lost something, means some time I don’t know what exactly I need to do the following. Do you have a full process of analyze (step-be-step) ? I think a full process will help me a lot in analyzing. Thanks!!

    Reply
  3. Michael says

    September 14, 2021 at 7:52 pm

    Hey Jim,

    Thanks for the post. It sounds like the median is always more robust than the mean—so what’s the point of using the mean then? It almost seems like there’s no downsides of using the median, whereas the mean there could be several downsides if your data is skewed.

    Reply
    • Jim Frost says

      September 14, 2021 at 8:09 pm

      Hi Michael,

      That’s a great question!

      If you’re just reporting the mean or median for a sample or population, I’d say you’re right. You won’t go wrong with the median. However, most statisticians prefer the mean when it is a valid measure because its calculations involve all the data points whereas the median throws most of them out! I’d lean in your direction though. The downside is that reporting the median might confuse some people. I guess the takeaway is, even though the median always works, stick with the mean when it’s valid because its the convention and it’s an equally good measure. And, it’s easy enough to tell when it’s valid, just graph your data using a histogram.

      However, in other situations, the mean does have advantages. When it comes to hypothesis testing, parametric tests for means have advantages over nonparametric tests for the median. Consequently, many studies and journal articles will need to talk about means rather than medians because they’re performing specific hypothesis tests.

      Reply
  4. Nam Duong says

    September 10, 2021 at 8:55 am

    hi Jim

    I’m considered that you say “In other words, you can trust the results even when the assumptions are not fully satisfied.” but “Be sure to know for which properties each statistical analysis is robust” . I’m really confused in here. Please give me more information (sorry if i lost somewhere )

    Reply
    • Jim Frost says

      September 12, 2021 at 12:43 am

      Hi Nam,

      A statistical analysis can be robust against one assumption violation but not another. For example, the traditional F-test ANOVA is robust against violations of the normality assumption when your group sample sizes are large enough. However, it is NOT robust against violations of the equal variances assumptions even with very large samples. Consequently, it’s important to know how an analysis is robust because it might not be robust against everything!

      Reply
  5. ALEXANDER VIETHEER says

    September 10, 2021 at 2:33 am

    Hei Jim,
    I just want to thank you for your posts. I am working in fetal medicine and a 50 percent research employment. These posts opened my eyes several times and stimulated me to read more to push my knowledge further.
    Of all mails coming inn yours are the ones I really look forward to get. For me even repetition of themes with only small extras are nice to read. I might for once even feel statistically advanced. So thank you for sharing, stay safe and please keep it going👍

    Reply
    • Jim Frost says

      September 12, 2021 at 12:44 am

      Hi Alexander,

      Thanks so much for the kind words. They keep me motivated to keep going! 🙂

      Reply
  6. Berhanu Gebo says

    September 8, 2021 at 1:47 am

    Great thanks on sharing statistical clarity. I am always reading your sharing service. STAY SAFE

    Reply
    • Jim Frost says

      September 8, 2021 at 11:05 pm

      Thank you so much, Berhanu! You stay safe as well.

      Reply

Comments and Questions Cancel reply

Primary Sidebar

Meet Jim

I’ll help you intuitively understand statistics by focusing on concepts and using plain English so you can concentrate on understanding your results.

Read More...

Buy My Introduction to Statistics Book!

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Buy My Hypothesis Testing Book!

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Buy My Regression Book!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Subscribe by Email

Enter your email address to receive notifications of new posts by email.

    I won't send you spam. Unsubscribe at any time.

    Follow Me

    • FacebookFacebook
    • RSS FeedRSS Feed
    • TwitterTwitter

    Top Posts

    • How to Interpret P-values and Coefficients in Regression Analysis
    • How To Interpret R-squared in Regression Analysis
    • Mean, Median, and Mode: Measures of Central Tendency
    • Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
    • How to Interpret the F-test of Overall Significance in Regression Analysis
    • Choosing the Correct Type of Regression Analysis
    • How to Find the P value: Process and Calculations
    • Interpreting Correlation Coefficients
    • How to do t-Tests in Excel
    • Z-table

    Recent Posts

    • Fishers Exact Test: Using & Interpreting
    • Percent Change: Formula and Calculation Steps
    • X and Y Axis in Graphs
    • Simpsons Paradox Explained
    • Covariates: Definition & Uses
    • Weighted Average: Formula & Calculation Examples

    Recent Comments

    • Dave on Control Variables: Definition, Uses & Examples
    • Jim Frost on How High Does R-squared Need to Be?
    • Mark Solomons on How High Does R-squared Need to Be?
    • John Grenci on Normal Distribution in Statistics
    • Jim Frost on Normal Distribution in Statistics

    Copyright © 2023 · Jim Frost · Privacy Policy