Welch’s ANOVA is an alternative to the traditional analysis of variance (ANOVA) and it offers some serious benefits. One-way ANOVA determines whether differences between the means of at least three groups are statistically significant. For decades, introductory statistics classes have taught the classic Fishers one-way ANOVA that uses the F-test. It’s a standard statistical analysis, and you might think it’s pretty much set in stone by now. Surprise, there’s a significant change occurring in the world of one-way ANOVA!

There is a new kid on the ANOVA block. Well, not a new kid, but an old kid who’s gaining in popularity.

Let me acquaint you with Welch’s ANOVA. You use it for the same reasons as the classic statistical test, to assess the means of three or more groups. However, Welch’s ANOVA provides critical benefits and protections because you can use it even when your groups have unequal variances. In fact, you read it here first; Welch’s ANOVA might knock out the classic version.

In this post, I’ll explain the dangers of using the Classic ANOVA with unequal variances, the benefits of using Welch’s ANOVA, and I interpret a Welch’s ANOVA example with the Games-Howell post hoc test.

## One-Way ANOVA Assumptions

Welch’s ANOVA enters the discussion because it can help you get out of a tricky situation with an assumption. Like all statistical tests, one-way ANOVA has some assumptions. If you fail to satisfy the assumptions, you might not be able to trust the results. Simulation studies have been crucial in revealing which assumptions are strict requirements and which are more lenient.

The Classic one-way test assumes that all groups share a common standard deviation (or variance) even when their means are different. Unfortunately, simulation studies find that this assumption is a strict requirement. If your groups have unequal variances, your results can be incorrect if you use the classic test. On the other hand, Welch’s ANOVA isn’t sensitive to unequal variances.

Before I delve into the importance of this assumption, I’ll briefly describe how the simulation study tested it.

## Comparing Welch’s ANOVA to Fisher’s

For all hypothesis tests, you specify the significance level. Ideally, the significance level equals the probability of rejecting a null hypothesis that is true (Type I error). This error is basically a false positive because the test results (a small p-value) lead you to believe *incorrectly* that some of the group means are different. When tests produce valid results, the Type I error rate equals the significance level. For example, if your significance level is 0.05, then 5% of tests should have this error when the null is true.

The investigators who perform a simulation study know when the null hypothesis is true or false. They can use this knowledge to determine whether the proportion of tests with a Type I error matches the significance level, which is the target. The researchers can generate data that violate an assumption to determine whether it affects the results. The larger the difference between the significance level and the Type I error rate, the more critical it becomes to satisfy the assumption.

## Simulation Results for Unequal Variances

The simulation study assessed 50 different conditions related to unequal variances. For each state, the computer drew 10,000 random samples and statistically analyzed them using both Welch’s ANOVA and the traditional one-way test.

For the Classic ANOVA, the simulation study found that unequal standard deviations cause the Type I error rate to shift away from the significance level target. If the group sizes are equal and the significance level is 0.05, the actual error rate falls between 0.02 and 0.08. However, if the groups have different sizes, the error rates can be as large as 0.22!

## Welch’s ANOVA to the Rescue

If you determine that your groups have standard deviations that are unequal, what should you do? Use Welch’s ANOVA! The same simulation study found that Welch’s ANOVA is unaffected by unequal variances. In fact, Welch’s ANOVA explicitly does not assume that the variances are equal.

Let’s compare the simulation study results for the two types of ANOVA when standard deviations are unequal, and the significance level is 0.05.

- Classic ANOVA error rates extend from 0.02 to 0.22.
- Welch’s ANOVA error rates have a much smaller range of 0.046 to 0.054.

In fact, it’s fine to use Welch’s ANOVA even when your groups *do* have equal variances because its statistical power is nearly equivalent to that of the Classic test. Welch’s ANOVA is an excellent analysis that you can use *all* the time for one-way ANOVA. It completely wipes away the need to worry about the assumption of homogeneous variances.

## Welch’s ANOVA Example with a Post Hoc Test

In this example, our data are the ground reaction forces that are generated by jumping from steps of different heights. You can download the CSV data file for the WelchsANOVAExample.

First, I’ll graph the data to give us a good sense of the situation. The chart below is an interval plot that displays the group means and 95% confidence intervals.

The ranges are based on the individual standard deviations for each group, and they look different. So, Welch’s ANOVA is a good choice for these data.

Next, I’ll perform the hypothesis test. Depending on your statistical software, the Welch’s procedure might be a separate command, or you may need to tell the software to not assume equal variances. The Welch’s ANOVA output is below.

The output for Welch’s ANOVA is relatively similar to the Classic test. Although, you’ll notice that it does not contain the usual analysis of variance table. Like interpreting any hypothesis test, compare the p-value to your significance level to determine whether the differences between the means are statistically significant. For our example results, the very low P-value indicates that these results are statistically significant.

For analysis of variance, significant results indicate that not all group means are equal. However, these results don’t tell you precisely which group means are different. To identify statistically significant differences between specific groups, you need to perform a pairwise comparisons post hoc test. When you use Welch’s ANOVA, you can use the Games-Howell multiple comparisons method.

The Games-Howell post hoc test is most like Tukey’s method for Classic ANOVA. Both procedures do the following:

- Control the joint error rate for the entire series of comparisons.
- Compare all possible pairs of groups within a collection of groups.

The Games-Howell post hoc test, like Welch’s ANOVA, does not require the groups to have equal standard deviations. Conversely, Tukey’s method does require equal standard deviations.

The Games-Howell post hoc test results are below:

None of the confidence intervals for the differences between group means contain zero. Consequently, these confidence intervals indicate that the differences between all pairs of groups are statistically significant.

I hope you’ll consider using Welch’s ANOVA anytime you need to perform a one-way test of the means!

Michael Thornton says

Hi Jim,

Thanks for this very straight-forward explanation of Welch’s ANOVA. What are the assumptions for this test? I have 3 samples; one is negatively skewed while the others are positively skewed. The former is a large sample with large expected cell sizes; the two latter are small samples, with some expected cell sizes below 5. Resampling isn’t possible. Any suggestions? Thanks for your thoughts, and for the great site.

Jim Frost says

Hi Michael, I think the small cell sizes will be a problem. Skewness isn’t necessarily a problem for parametric analyses, like ANOVA, when you have a sufficiently large sample size. However, some of yours are below that threshold. I also think that bootstrapping wouldn’t be such a good approach, particularly based on a the small sample sizes. Unfortunately, I think the cells < 5 would be also a problem for nonparametric alternatives because that's their minimum threshold. I wish I had a better answer for you but I don't know of any solutions offhand.Here is a post where I write about the sample size guidelines and normality for one-way ANOVA and other tests in a post about parametric vs nonparametric analyses.

Michael Thornton says

Jim – I see that I wasn’t quite clear about resampling, so allow me to restate my comment: I cannot collect new samples from the field. I can, however, use computerized resampling methods (e.g., bootstrapping).

Woojae Kim says

People often forget the theoretical reasoning behind the equal variance assumption in ANOVA and regression analysis. It is there because it makes comparison of population means sensible and parsimonious. Welch’s test certainly corrects the bias in testing results due to unequal variances, but it is ultimately at the a researcher’s discretion whether comparing population means with unequal variances makes sense to her/him. If variances are very different, even if the test tells significance (e.g., due to a large sample), simply stating that the two means are statistically significant will not comprise the whole story. As always recommended, any statistical technique requires background, theoretical understandings to be used properly.

Jim Frost says

Hi Woojae,

Thank you for raising an excellent point. I agree 100% with everything you say. Understanding the subject-area, the data in all of its details, and the specifics of the analyses are crucial for performing meaningful statistical analysis. Analysts should tell the full story.

I think Welch’s ANOVA is an excellent analysis when analysts have data with unequal variances and they want to determine whether the differences between the means are statistically significant.