Welch’s ANOVA is an alternative to the traditional analysis of variance (ANOVA) and it offers some serious benefits. One-way ANOVA determines whether differences between the means of at least three groups are statistically significant. For decades, introductory statistics classes have taught the classic Fishers one-way ANOVA that uses the F-test. It’s a standard statistical analysis, and you might think it’s pretty much set in stone by now. Surprise, there’s a significant change occurring in the world of one-way ANOVA!

There is a new kid on the ANOVA block. Well, not a new kid, but an old kid who’s gaining in popularity.

Let me acquaint you with Welch’s ANOVA. You use it for the same reasons as the classic statistical test, to assess the means of three or more groups. However, Welch’s ANOVA provides critical benefits and protections because you can use it even when your groups have unequal variances. In fact, you read it here first; Welch’s ANOVA might knock out the classic version.

In this post, I’ll explain the dangers of using the Classic ANOVA with unequal variances, the benefits of using Welch’s ANOVA, and I interpret a Welch’s ANOVA example with the Games-Howell post hoc test.

## One-Way ANOVA Assumptions

Welch’s ANOVA enters the discussion because it can help you get out of a tricky situation with an assumption. Like all statistical tests, one-way ANOVA has some assumptions. If you fail to satisfy the assumptions, you might not be able to trust the results. Simulation studies have been crucial in revealing which assumptions are strict requirements and which are more lenient.

The Classic one-way test assumes that all groups share a common standard deviation (or variance) even when their means are different. Unfortunately, simulation studies find that this assumption is a strict requirement. If your groups have unequal variances, your results can be incorrect if you use the classic test. On the other hand, Welch’s ANOVA isn’t sensitive to unequal variances.

Before I delve into the importance of this assumption, I’ll briefly describe how the simulation study tested it.

## Comparing Welch’s ANOVA to Fisher’s

For all hypothesis tests, you specify the significance level. Ideally, the significance level equals the probability of rejecting a null hypothesis that is true (Type I error). This error is basically a false positive because the test results (a small p-value) lead you to believe *incorrectly* that some of the group means are different. When tests produce valid results, the Type I error rate equals the significance level. For example, if your significance level is 0.05, then 5% of tests should have this error when the null is true.

The investigators who perform a simulation study know when the null hypothesis is true or false. They can use this knowledge to determine whether the proportion of tests with a Type I error matches the significance level, which is the target. The researchers can generate data that violate an assumption to determine whether it affects the results. The larger the difference between the significance level and the Type I error rate, the more critical it becomes to satisfy the assumption.

## Simulation Results for Unequal Variances

The simulation study assessed 50 different conditions related to unequal variances. For each state, the computer drew 10,000 random samples and statistically analyzed them using both Welch’s ANOVA and the traditional one-way test.

For the Classic ANOVA, the simulation study found that unequal standard deviations cause the Type I error rate to shift away from the significance level target. If the group sizes are equal and the significance level is 0.05, the actual error rate falls between 0.02 and 0.08. However, if the groups have different sizes, the error rates can be as large as 0.22!

## Welch’s ANOVA to the Rescue

If you determine that your groups have standard deviations that are unequal, what should you do? Use Welch’s ANOVA! The same simulation study found that Welch’s ANOVA is unaffected by unequal variances. In fact, Welch’s ANOVA explicitly does not assume that the variances are equal.

Let’s compare the simulation study results for the two types of ANOVA when standard deviations are unequal, and the significance level is 0.05.

- Classic ANOVA error rates extend from 0.02 to 0.22.
- Welch’s ANOVA error rates have a much smaller range of 0.046 to 0.054.

In fact, it’s fine to use Welch’s ANOVA even when your groups *do* have equal variances because its statistical power is nearly equivalent to that of the Classic test. Welch’s ANOVA is an excellent analysis that you can use *all* the time for one-way ANOVA. It completely wipes away the need to worry about the assumption of homogeneous variances.

## Welch’s ANOVA Example with a Post Hoc Test

In this example, our data are the ground reaction forces that are generated by jumping from steps of different heights. You can download the CSV data file for the WelchsANOVAExample.

First, I’ll graph the data to give us a good sense of the situation. The chart below is an interval plot that displays the group means and 95% confidence intervals.

The ranges are based on the individual standard deviations for each group, and they look different. So, Welch’s ANOVA is a good choice for these data.

Next, I’ll perform the hypothesis test. Depending on your statistical software, the Welch’s procedure might be a separate command, or you may need to tell the software to not assume equal variances. The Welch’s ANOVA output is below.

The output for Welch’s ANOVA is relatively similar to the Classic test. Although, you’ll notice that it does not contain the usual analysis of variance table. Like interpreting any hypothesis test, compare the p-value to your significance level to determine whether the differences between the means are statistically significant. For our example results, the very low P-value indicates that these results are statistically significant.

For analysis of variance, significant results indicate that not all group means are equal. However, these results don’t tell you precisely which group means are different. To identify statistically significant differences between specific groups, you need to perform a pairwise comparisons post hoc test. When you use Welch’s ANOVA, you can use the Games-Howell multiple comparisons method.

The Games-Howell post hoc test is most like Tukey’s method for Classic ANOVA. Both procedures do the following:

- Control the joint error rate for the entire series of comparisons.
- Compare all possible pairs of groups within a collection of groups.

The Games-Howell post hoc test, like Welch’s ANOVA, does not require the groups to have equal standard deviations. Conversely, Tukey’s method does require equal standard deviations.

The Games-Howell post hoc test results are below:

None of the confidence intervals for the differences between group means contain zero. Consequently, these confidence intervals indicate that the differences between all pairs of groups are statistically significant.

I hope you’ll consider using Welch’s ANOVA anytime you need to perform a one-way test of the means!

Michael Thornton says

Hi Jim,

Thanks for this very straight-forward explanation of Welch’s ANOVA. What are the assumptions for this test? I have 3 samples; one is negatively skewed while the others are positively skewed. The former is a large sample with large expected cell sizes; the two latter are small samples, with some expected cell sizes below 5. Resampling isn’t possible. Any suggestions? Thanks for your thoughts, and for the great site.

Jim Frost says

Hi Michael, I think the small cell sizes will be a problem. Skewness isn’t necessarily a problem for parametric analyses, like ANOVA, when you have a sufficiently large sample size. However, some of yours are below that threshold. I also think that bootstrapping wouldn’t be such a good approach, particularly based on a the small sample sizes. Unfortunately, I think the cells < 5 would be also a problem for nonparametric alternatives because that's their minimum threshold. I wish I had a better answer for you but I don't know of any solutions offhand.Here is a post where I write about the sample size guidelines and normality for one-way ANOVA and other tests in a post about parametric vs nonparametric analyses.

Michael Thornton says

Jim – I see that I wasn’t quite clear about resampling, so allow me to restate my comment: I cannot collect new samples from the field. I can, however, use computerized resampling methods (e.g., bootstrapping).

Woojae Kim says

People often forget the theoretical reasoning behind the equal variance assumption in ANOVA and regression analysis. It is there because it makes comparison of population means sensible and parsimonious. Welch’s test certainly corrects the bias in testing results due to unequal variances, but it is ultimately at the a researcher’s discretion whether comparing population means with unequal variances makes sense to her/him. If variances are very different, even if the test tells significance (e.g., due to a large sample), simply stating that the two means are statistically significant will not comprise the whole story. As always recommended, any statistical technique requires background, theoretical understandings to be used properly.

Jim Frost says

Hi Woojae,

Thank you for raising an excellent point. I agree 100% with everything you say. Understanding the subject-area, the data in all of its details, and the specifics of the analyses are crucial for performing meaningful statistical analysis. Analysts should tell the full story.

I think Welch’s ANOVA is an excellent analysis when analysts have data with unequal variances and they want to determine whether the differences between the means are statistically significant.

ruwini says

Hello sir,

can you describe me the equation of the welch test in simple manner with the comparison of ANOVA equation

Kevin says

I’ll give you a verbal explanation/breakdown, step by step, of what the equation represents. It will be a little wordy, but I want to be clear, so bear with me.

1) Determine a weight for each group by dividing each group’s sample size by its respective variance.

2) Multiply each group’s mean by its corresponding weight as determined in step 1, add these products together, and divide this total by the sum of all the group weights. The result is the mean square (MS).

3) Take the difference between each group’s mean and the MS, square this difference, and multiply this result by the corresponding group weight from step 1. Add all these results together, and divide by (number of groups – 1). The result is the numerator for Welch’s F.

4) For each group, calculate (group weight from step 1/total of all group weights). Subtract this fraction from 1, square the difference, and divide this square by (group size – 1). Add these results together from each group and call the final number A.

5) Multiply A by 2, then by the fraction (k-2 / k^2 – 1), where k is the number of groups compared. Finally, add 1 to this result. The ending number is the denominator of Welch’s F.

6) Do the division to calculate Welch’s F. As in the standard ANOVA, the numerator degrees of freedom remain at (# of groups minus 1). For Welch’s ANOVA, the denominator degrees of freedom are calculated as (k^2 – 1)/(3A), where k is the number of groups compared and A is defined above in step 4.

kim behr says

I used a Welch’s Anova since three groups have unequal variances. Can I safely say in the discussion that heteroscedastity could cause increased Type I error, but this has been corrected because Welchs Anova was used?

Kevin says

Hi Kim,

I’m not intending for this reply to speak for Jim, this is just my take on it. In my view, what you said is pretty accurate. The Welch procedure adjusts both the F ratio and the denominator degrees of freedom to protect against Type I error, so in that sense it is considered to be a conservative test, where the alpha level may actually end up being a bit lower than what you specify. I would say that in most circumstances this would come with a loss of statistical power as a tradeoff, but apparently the Welch F procedure has a negligible difference in power compared to the standard ANOVA. About the only disadvantage I can think of with Welch’s ANOVA, besides the reduced denominator degrees of freedom, is that the test generally does not perform as well if the data are strongly skewed or otherwise non-normal. But in that situation, would comparing means necessarily make much sense anyway?

Jim Frost says

Hi Kevin,

I worked the folks who did the simulation study. Their conclusion was that analysts should always use Welch’s ANOVA! Apparently there is only a very slight loss of power.

As for nonnormal data, if you have a large enough sample size per group, that’s not a problem in terms of it being a valid test. The group sizes are the same as those I describe in my post about parametric vs nonparametric hypothesis tests. Although, as you mention, if the data are sufficiently skewed, the mean might not represent the center of the groups adequately.

Thanks for your thoughts on this!

Jim

Jim Frost says

Hi Kim,

Basically, yes you can state that. There is one minor change it. In the simulation studies, they found that if you set alpha to 0.05, the actual Type I error rate could be higher or lower than that–0.02 to 0.08 with equal sized groups, or 0.02 to 0.22 with unequal sized groups. So, the Type I error rate was not necessarily increased because sometimes it was lower. A lower rate sounds good until you realize that it reduces the power of your test!

But, it is entirely accurate to say that use Welch’s ANOVA produces an actual Type I error rate that is very close to the target value (alpha) even when the variances are unequal.

I hope this helps!

Jim

Kevin says

Just a curiosity of mine, Jim – maybe you have an answer. In the standard ANOVA, the F ratio is basically a “signal to noise” ratio – a measure of variation between groups due to a treatment effect (signal) to random variation within groups (noise). If we define our weight as (group size/group variance) in Welch ANOVA, then I can see the numerator of Welch F as being more or less a weighted average of the total variation between groups, so that makes sense. I thought I read somewhere that the denominator of Welch F represented not so much a measure of variation within groups, but more of a correction factor of sorts, based on the “expected value” of the variance or something similar. Do you have any insight on that?