Welch’s ANOVA is an alternative to the traditional analysis of variance (ANOVA) and it offers some serious benefits. One-way analysis of variance determines whether differences between the means of at least three groups are statistically significant. For decades, introductory statistics classes have taught the classic Fishers one-way ANOVA that uses the F-test. It’s a standard statistical analysis, and you might think it’s pretty much set in stone by now. Surprise, there’s a significant change occurring in the world of one-way analysis of variance!

There is a new kid on the ANOVA block. Well, not a new kid, but an old kid who’s gaining in popularity.

Let me acquaint you with Welch’s ANOVA. You use it for the same reasons as the classic statistical test, to assess the means of three or more groups. However, Welch’s analysis of variance provides critical benefits and protections because you can use it even when your groups have unequal variances. In fact, you read it here first; Welch’s ANOVA might knock out the classic version.

In this post, I’ll explain the dangers of using the Classic analysis of variance with unequal variances, the benefits of using Welch’s ANOVA, and I interpret a Welch’s ANOVA example with the Games-Howell post hoc test.

## One-Way ANOVA Assumptions

Welch’s ANOVA enters the discussion because it can help you get out of a tricky situation with an assumption. Like all statistical tests, one-way ANOVA has some assumptions. If you fail to satisfy the assumptions, you might not be able to trust the results. Simulation studies have been crucial in revealing which assumptions are strict requirements and which are more lenient.

The Classic one-way test assumes that all groups share a common standard deviation (or variance) even when their means are different. Unfortunately, simulation studies find that this assumption is a strict requirement. If your groups have unequal variances, your results can be incorrect if you use the classic test. On the other hand, Welch’s ANOVA isn’t sensitive to unequal variances.

Before I delve into the importance of this assumption, I’ll briefly describe how the simulation study tested it.

## Comparing Welch’s ANOVA to Fisher’s

For all hypothesis tests, you specify the significance level. Ideally, the significance level equals the probability of rejecting a null hypothesis that is true (Type I error). This error is basically a false positive because the test results (a small p-value) lead you to believe *incorrectly* that some of the group means are different. When tests produce valid results, the Type I error rate equals the significance level. For example, if your significance level is 0.05, then 5% of tests should have this error when the null is true.

The investigators who perform a simulation study know when the null hypothesis is true or false. They can use this knowledge to determine whether the proportion of tests with a Type I error matches the significance level, which is the target. The researchers can generate data that violate an assumption to determine whether it affects the results. The larger the difference between the significance level and the Type I error rate, the more critical it becomes to satisfy the assumption.

**Related post**: Types of Error in Hypothesis Testing

## Simulation Results for Unequal Variances

The simulation study assessed 50 different conditions related to unequal variances. For each state, the computer drew 10,000 random samples and statistically analyzed them using both Welch’s ANOVA and the traditional one-way test.

For the Classic ANOVA, the simulation study found that unequal standard deviations cause the Type I error rate to shift away from the significance level target. If the group sizes are equal and the significance level is 0.05, the actual error rate falls between 0.02 and 0.08. However, if the groups have different sizes, the error rates can be as large as 0.22!

## Welch’s ANOVA to the Rescue

If you determine that your groups have standard deviations that are unequal, what should you do? Use Welch’s ANOVA! The same simulation study found that Welch’s analysis of variance is unaffected by unequal variances. In fact, Welch’s ANOVA explicitly does not assume that the variances are equal.

Let’s compare the simulation study results for the two types of analysis of variance when standard deviations are unequal, and the significance level is 0.05.

- Classic ANOVA error rates extend from 0.02 to 0.22.
- Welch’s ANOVA error rates have a much smaller range of 0.046 to 0.054.

In fact, it’s fine to use Welch’s ANOVA even when your groups *do* have equal variances because its statistical power is nearly equivalent to that of the Classic test. Welch’s analysis of variance is an excellent analysis that you can use *all* the time for one-way analysis of variance. It completely wipes away the need to worry about the assumption of homogeneous variances.

## Welch’s ANOVA Example

In this example, our data are the ground reaction forces that are generated by jumping from steps of different heights. You can download the CSV data file for the WelchsANOVAExample.

First, I’ll graph the data to give us a good sense of the situation. The chart below is an interval plot that displays the group means and 95% confidence intervals.

The ranges are based on the individual standard deviations for each group, and they look different. So, Welch’s analysis of variance is a good choice for these data.

Next, I’ll perform the hypothesis test. Depending on your statistical software, the Welch’s procedure might be a separate command, or you may need to tell the software to not assume equal variances. The Welch’s ANOVA output is below.

The output for Welch’s ANOVA is relatively similar to the Classic test. Although, you’ll notice that it does not contain the usual analysis of variance table. Like interpreting any hypothesis test, compare the p-value to your significance level to determine whether the differences between the means are statistically significant. For our example results, the very low P-value indicates that these results are statistically significant. Our sample evidence provides sufficient evidence to conclude that the means of all groups are not equal in the population.

## Using Post Hoc Tests with Welch’s ANOVA

While the overall results above indicate that not all group means are equal, we don’t know which differences between group means are statistically significant. To identify significant differences between specific groups, you need to perform a pairwise comparisons post hoc test. When you use Welch’s ANOVA, you can use the Games-Howell multiple comparisons method.

For more information about this process, read my post about Using Post Hoc Tests with ANOVA.

The Games-Howell post hoc test is most like Tukey’s method for Classic ANOVA. Both procedures do the following:

- Control the joint error rate for the entire series of comparisons.
- Compare all possible pairs of groups within a collection of groups.

The Games-Howell post hoc test, like Welch’s analysis of variance, does not require the groups to have equal standard deviations. Conversely, Tukey’s method does require equal standard deviations.

The Games-Howell post hoc test results are below:

None of the confidence intervals for the differences between group means contain zero. Consequently, these confidence intervals indicate that the differences between all pairs of groups are statistically significant.

I hope you’ll consider using Welch’s ANOVA anytime you need to perform a one-way test of the means!

Ariel Balter says

Hi Sarah. I just posted a comment with a link to a master’s thesis that did similar simulations. Not quite peer-reviewed, but might be helpful.

Ariel Balter says

Great post as always. Here is master’s thesis that did similar simulations to the ones you mentioned, and arrived and similar conclusions.

https://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=5026&context=etd

Personally, given that my nice but not extraordinary laptop could probably perform all of those simulations overnight, why don’t we have volumes of these sorts of simulation-based assessments of statistical tests, their weaknesses, strengths, limitations, etc.? It seems like a no-brainer and would provide clear answers to all of those myriad “this or that” posts on cross-validated, Quora, Research Gate, etc.

Jim Frost says

Hi Ariel,

Thanks for sharing the link to that excellent thesis! I think there are more and more simulation studies being conducted all the time. And, they’ve yielded very interesting results. For example, the normality assumption used to be king but now it’s widely recognized that many parametric tests are robust to departures from normality. I cover that as one of the points in my post about parametric vs. nonparametric tests. I also found a neat study that answers a longstanding question about whether you should use a 2-sample t-test or a Mann-Whitney test to analyze Likert scale data. Turns out that it doesn’t matter in most cases!

I think this type of information will continue to grow and spread. I love that type of simulation study!

Shawn J says

Hi Jim,

Your blog is one of my favorites and always a great read! Just picked up my copy of your new book. Any chance of it in hardcopy, perhaps a print-on-demand option?

This post and the replies of your commentators got me questioning assumptions from when I learned about ANOVA & t-tests.

Regarding the assumption of normality–I was originally taught (way back when) to examine the dependent variable (DV) for a normal distribution, which could be eye-ball inspected with a histogram (not ideal) or a QQ plot (better). We could also use computational checks like the Kolmogorov-Smirnov or the Shapiro-Wilk Test. Yet, many things around us on not normally distributed, which troubles that assumption check.

As I explored more stats resources, many authors say we should examine the normality of the residuals, requiring we model the ANOVA first then use one of several ways to examine the residuals. While others extend the normality of the DV principle to ensure we check for a normal distribution for each group level–so QQ plots or KS & SW tests for each group. Yet another site went so far as to pretty much say don’t worry about it since the sampling distribution is normal.

Where do you stand on this and how do you suggest we check this assumption?

Thanks,

Shawn

Some source links:

https://stats.stackexchange.com/questions/6350/anova-assumption-normality-normal-distribution-of-residuals

http://www2.psychology.uiowa.edu/faculty/mordkoff/GradStats/part%201/I.07%20normal.pdf

https://stats.stackexchange.com/questions/12262/what-if-residuals-are-normally-distributed-but-y-is-not

https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide-2.php

Jim Frost says

Hi Shawn,

I’m so happy that you’ve found by blog to be helpful. And thanks for buying my regression ebook. I appreciate that greatly!

In regression analysis, the assumption of normality actually applies to the distribution of the residuals. And, satisfying that assumption is technically optional. You only need to satisfy that assumption when you want to use hypothesis testing (e.g., coefficient p-values and confidence intervals). Which, ok, is most all the time! The idea is that when the residuals are normally distributed, the sampling distributions for the coefficients are also normally distributed. Hence, it’s ok to trust the p-values. If the residuals are not normally distributed, you can’t assume that those sampling distributions are either. They’re connected in that fashion, and it explains why the check the distribution of the residuals. However, I’d expect the central limit theorem would come into play with a large enough sample size, but I don’t have good numbers/information about that on hand.

While the regression assumption applies to the distribution of the residuals rather than any of the variables (DV or the IVs), it is harder, but not impossible, to obtain normal residuals when the DV is severely non-normal. Sometimes with a severely non-normal DV, you’ll need to transform it. So, that’s how the distribution of the DV often gets thrown into this discussion. But, it’s the distribution of the residuals that really matters. If you have a nonnormal DV but the residuals are normally distributed, you’re good! You’ll see more about this in my ebook! 🙂

For ANOVA, there is more attention placed on the distribution of the groups themselves rather than just the overall residuals. That’s a little different than in regression. The distribution of the groups is a factor both for parametric tests (t-tests and ANOVA) and nonparametric tests (e.g., Kruskal Wallis). Fortunately, given relatively small sample sizes, the parametric analyses are robust to departures from normality thanks to the central limit theorem. That must be the basis behind the site saying not worry about it due to the normal sampling distributions! I’d agree but with caveat that you need to meet minimum sample size requirements before the CLT can kick in! Meanwhile the nonparametric tests have different assumptions that can be more stringent that those for the parametric version. For more information about all of that, read my post about parametric vs. nonparametric tests!

And, yes, I 100% agree that QQ plots (which I refer to as probability plots) are much better for assessing normality than histograms. I write about this in a post that compares histograms to probability plots for assessing normality, although not in the context of regression residuals.

Sarah F. says

Thank you for this helpful blog. I have a note from a reviewer asking for a citation to support the use of Welch’s anova over other non-parametric methods. Do you know of anything that would be a good citation for this? Thank you!

Jim Frost says

Hi Sarah,

Thank you so much for your kind words!

Unfortunately, I don’t have published reference for you. However, in this post, I link to a white paper produced by a software company that compares Welch’s ANOVA to the traditional F-test ANOVA. I know the people who performed the analysis behind that paper and it’s solid. However, I’m not sure if that will satisfy your reviewer.

On another note, you mention nonparametric analyses. Both the traditional ANOVA and Welch’s ANOVA are parametric analyses. If you’re using Welch’s because your variances are unequal, you probably cannot use most nonparametric tests. While nonparametric tests don’t assume that the data follow a particular distribution, most assume that the groups within the analysis follow distributions that have the same shape and spread. This assumption isn’t too widely known. Textbooks that cover nonparametric analyses in-depth should provide a good reference for it. For more information about that aspect and more, see my post about parametric vs. nonparametric hypothesis tests.

Consequently, if your group variances are dramatically different, you should not use nonparametric analyses. Welch’s is your best bet.

Best of luck with your study/article!

Sreeja says

How to read coefficient table in Regression analysis

Jim Frost says

Hi, please read this blog post to understand how to interpret regression coefficients and the p-values.

Nuri says

can I learn where the example comes from? Thank yoy

Jim Frost says

Hi Nuri,

These data are from a university experiment that I worked on long ago. The goal of our study was to determine whether an exercise intervention could increase the bone mineral density (BMD) in middle school aged girls. The idea was that increasing BMD at a younger age might decrease the chances of osteoporosis later in life. The intervention was jumping off steps to product a target impact of 6 body weights. These data come from our efforts to determine how high the steps should be to achieve the targeted impact. I think we went with 18 inches, but I’m not sure.

The data for this example represents the impacts in Newtons by different height steps. Unsurprisingly, higher steps produce higher impacts. The results aren’t particularly illuminating. However, because the variability also increases with the height, it’s a good dataset for illustrating when to use Welch’s ANOVA.

Dr. Sreeja Sukumar K. says

Thank you Jim

MaríaR says

Hi Jim, Thank you your your answer. My case is the second one, so I’ll convert my data to obtain clearest results. Thank you again!

Jim Frost says

You’re welcome! Hopefully a data transformation does the trick for you!

MaríaR says

Hello, congratulations for the blog! I have a question..I am comparing the cellular behavior over three different ceramic material. I have 4 variables of cellular metric parameters; N=49. None of them are parametric, and 3 have heteroscedasticity (so I applied Welch’s ANOVA) but the 4th variable has homoscedasticity but does not have the same distribution over the 3 groups, or in other words, can’t apply Kruskal Wallis.. Which statistical analysis should I apply? How can I calculate effect size? Thank you very much.

Jim Frost says

Hi Maria,

If you have 49 observations per group, you’re in luck. For one-way ANOVA (including Welch’s) with four groups, when you have at least 15 per group, nonnormal data are not a problem. And, you’re using Welch’s, which handles the heteroscedasticity. Read my post about nonparametric vs. parametric analyses for more information about that.

However, if you’re talking about 49 observations for your entire study that are divided between the four groups, you’re a bit below that threshold. And, as you note, the distributions need to have the same shape to use nonparametric methods. In that case, you might try a data transformation to resolve the nonnormality and then continue with the Welch’s ANOVA.

If that doesn’t work, I’m afraid you’ll not some real dedicated statistical assistance to figure out how to proceed.

DIV says

Kim behr asked: “Can I safely say in the discussion that heteroscedastity could cause increased Type I error, but this has been corrected because Welchs Anova was used?”

Personally I would not use the word “corrected” in connection with Welch’s ANOVA. I would say something more like: “Heteroscedastity can cause increased Type I error with conventional ANOVA, but this has been AVOIDED/MINIMISED because Welch’s ANOVA was used.”

In my thinking a “correction” would be something like subtracting a known (or well-estimated) bias, or perhaps applying a (well-justified) variance transformation.

Chris V. says

Dear Jim,

So, I’ve read a variety of opinions on just what constitutes a situation where variances between data sets are unequal. Obviously, standard deviations or variances are never identical between groups, so clearly some level of difference is allowed between groups in which the variances are still considered the same. In your opinion, what is that cut off? Thanks.

Jim Frost says

Hi Chris,

First, I’ll assume that when you state “variances between data sets are unequal” that you’re actually referring to the groups within your analysis. That is, they’re the groups within your dataset.

The answer to your question depends on whether your groups sizes are equal or unequal. A good rule of thumb for equal sized groups is that if the difference between the standard deviation of one group is twice that of another group, you should start to worry. With equal sized groups, problems are small but starting to show up at this point.

With unequal sized groups, you probably already have severe problems if you have one group with twice the SD as another group. You’d need a smaller cutoff than twice the SD with unequal groups. And, the problems can be exacerbated depending on whether a smaller group is one that has an unusual SD. Consequently, with unequal groups, it’s a bit more complicated than just having a single cutoff value.

If in doubt, just use Welch’s ANOVA, and you don’t have to worry it!

raji says

Hi,

How do you calculate effect size of the difference when Welch’s ANOVA is used?

Jim Frost says

Hi Raji,

It’s the same as regular ANOVA. The effect size is simply the difference between group means. You can also use post-hoc tests, such as Games-Howell, to create confidence intervals of those differences (CIs of effect sizes).

Paola says

Hi Jim, I have a question. In case of 2 independent groups (10 subjects in each one) in which I want to measure 1 variable in 3 different times, the Welch ANOVA coulb be suitable (less than 3 groups)?

Kevin says

Just a curiosity of mine, Jim – maybe you have an answer. In the standard ANOVA, the F ratio is basically a “signal to noise” ratio – a measure of variation between groups due to a treatment effect (signal) to random variation within groups (noise). If we define our weight as (group size/group variance) in Welch ANOVA, then I can see the numerator of Welch F as being more or less a weighted average of the total variation between groups, so that makes sense. I thought I read somewhere that the denominator of Welch F represented not so much a measure of variation within groups, but more of a correction factor of sorts, based on the “expected value” of the variance or something similar. Do you have any insight on that?

kim behr says

I used a Welch’s Anova since three groups have unequal variances. Can I safely say in the discussion that heteroscedastity could cause increased Type I error, but this has been corrected because Welchs Anova was used?

Kevin says

Hi Kim,

I’m not intending for this reply to speak for Jim, this is just my take on it. In my view, what you said is pretty accurate. The Welch procedure adjusts both the F ratio and the denominator degrees of freedom to protect against Type I error, so in that sense it is considered to be a conservative test, where the alpha level may actually end up being a bit lower than what you specify. I would say that in most circumstances this would come with a loss of statistical power as a tradeoff, but apparently the Welch F procedure has a negligible difference in power compared to the standard ANOVA. About the only disadvantage I can think of with Welch’s ANOVA, besides the reduced denominator degrees of freedom, is that the test generally does not perform as well if the data are strongly skewed or otherwise non-normal. But in that situation, would comparing means necessarily make much sense anyway?

Jim Frost says

Hi Kevin,

I worked the folks who did the simulation study. Their conclusion was that analysts should always use Welch’s ANOVA! Apparently there is only a very slight loss of power.

As for nonnormal data, if you have a large enough sample size per group, that’s not a problem in terms of it being a valid test. The group sizes are the same as those I describe in my post about parametric vs nonparametric hypothesis tests. Although, as you mention, if the data are sufficiently skewed, the mean might not represent the center of the groups adequately.

Thanks for your thoughts on this!

Jim

Jim Frost says

Hi Kim,

Basically, yes you can state that. There is one minor change it. In the simulation studies, they found that if you set alpha to 0.05, the actual Type I error rate could be higher or lower than that–0.02 to 0.08 with equal sized groups, or 0.02 to 0.22 with unequal sized groups. So, the Type I error rate was not necessarily increased because sometimes it was lower. A lower rate sounds good until you realize that it reduces the power of your test!

But, it is entirely accurate to say that use Welch’s ANOVA produces an actual Type I error rate that is very close to the target value (alpha) even when the variances are unequal.

I hope this helps!

Jim

Kevin says

I’ll give you a verbal explanation/breakdown, step by step, of what the equation represents. It will be a little wordy, but I want to be clear, so bear with me.

1) Determine a weight for each group by dividing each group’s sample size by its respective variance.

2) Multiply each group’s mean by its corresponding weight as determined in step 1, add these products together, and divide this total by the sum of all the group weights. The result is the mean square (MS).

3) Take the difference between each group’s mean and the MS, square this difference, and multiply this result by the corresponding group weight from step 1. Add all these results together, and divide by (number of groups – 1). The result is the numerator for Welch’s F.

4) For each group, calculate (group weight from step 1/total of all group weights). Subtract this fraction from 1, square the difference, and divide this square by (group size – 1). Add these results together from each group and call the final number A.

5) Multiply A by 2, then by the fraction (k-2 / k^2 – 1), where k is the number of groups compared. Finally, add 1 to this result. The ending number is the denominator of Welch’s F.

6) Do the division to calculate Welch’s F. As in the standard ANOVA, the numerator degrees of freedom remain at (# of groups minus 1). For Welch’s ANOVA, the denominator degrees of freedom are calculated as (k^2 – 1)/(3A), where k is the number of groups compared and A is defined above in step 4.

ruwini says

Hello sir,

can you describe me the equation of the welch test in simple manner with the comparison of ANOVA equation

Woojae Kim says

People often forget the theoretical reasoning behind the equal variance assumption in ANOVA and regression analysis. It is there because it makes comparison of population means sensible and parsimonious. Welch’s test certainly corrects the bias in testing results due to unequal variances, but it is ultimately at the a researcher’s discretion whether comparing population means with unequal variances makes sense to her/him. If variances are very different, even if the test tells significance (e.g., due to a large sample), simply stating that the two means are statistically significant will not comprise the whole story. As always recommended, any statistical technique requires background, theoretical understandings to be used properly.

Jim Frost says

Hi Woojae,

Thank you for raising an excellent point. I agree 100% with everything you say. Understanding the subject-area, the data in all of its details, and the specifics of the analyses are crucial for performing meaningful statistical analysis. Analysts should tell the full story.

I think Welch’s ANOVA is an excellent analysis when analysts have data with unequal variances and they want to determine whether the differences between the means are statistically significant.

Michael Thornton says

Jim – I see that I wasn’t quite clear about resampling, so allow me to restate my comment: I cannot collect new samples from the field. I can, however, use computerized resampling methods (e.g., bootstrapping).

Michael Thornton says

Hi Jim,

Thanks for this very straight-forward explanation of Welch’s ANOVA. What are the assumptions for this test? I have 3 samples; one is negatively skewed while the others are positively skewed. The former is a large sample with large expected cell sizes; the two latter are small samples, with some expected cell sizes below 5. Resampling isn’t possible. Any suggestions? Thanks for your thoughts, and for the great site.

Jim Frost says

Hi Michael, I think the small cell sizes will be a problem. Skewness isn’t necessarily a problem for parametric analyses, like ANOVA, when you have a sufficiently large sample size. However, some of yours are below that threshold. I also think that bootstrapping wouldn’t be such a good approach, particularly based on a the small sample sizes. Unfortunately, I think the cells < 5 would be also a problem for nonparametric alternatives because that's their minimum threshold. I wish I had a better answer for you but I don't know of any solutions offhand. Here is a post where I write about the sample size guidelines and normality for one-way ANOVA and other tests in a post about parametric vs nonparametric analyses.