Welch’s ANOVA is an alternative to the traditional analysis of variance (ANOVA) and it offers some serious benefits. One-way analysis of variance determines whether differences between the means of at least three groups are statistically significant. For decades, introductory statistics classes have taught the classic Fishers one-way ANOVA that uses the F-test. It’s a standard statistical analysis, and you might think it’s pretty much set in stone by now. Surprise, there’s a significant change occurring in the world of one-way analysis of variance!

There is a new kid on the ANOVA block. Well, not a new kid, but an old kid who’s gaining in popularity.

Let me acquaint you with Welch’s ANOVA. You use it for the same reasons as the classic statistical test, to assess the means of three or more groups. However, Welch’s analysis of variance provides critical benefits and protections because you can use it even when your groups have unequal variances. In fact, you read it here first; Welch’s ANOVA might knock out the classic version.

In this post, I’ll explain the dangers of using the Classic analysis of variance with unequal variances, the benefits of using Welch’s ANOVA, and I interpret a Welch’s ANOVA example with the Games-Howell post hoc test.

## One-Way ANOVA Assumptions

Welch’s ANOVA enters the discussion because it can help you get out of a tricky situation with an assumption. Like all statistical tests, one-way ANOVA has some assumptions. If you fail to satisfy the assumptions, you might not be able to trust the results. Simulation studies have been crucial in revealing which assumptions are strict requirements and which are more lenient.

The Classic one-way test assumes that all groups share a common standard deviation (or variance) even when their means are different. Unfortunately, simulation studies find that this assumption is a strict requirement. If your groups have unequal variances, your results can be incorrect if you use the classic test. On the other hand, Welch’s ANOVA isn’t sensitive to unequal variances.

Before I delve into the importance of this assumption, I’ll briefly describe how the simulation study tested it.

## Comparing Welch’s ANOVA to Fisher’s

For all hypothesis tests, you specify the significance level. Ideally, the significance level equals the probability of rejecting a null hypothesis that is true (Type I error). This error is basically a false positive because the test results (a small p-value) lead you to believe *incorrectly* that some of the group means are different. When tests produce valid results, the Type I error rate equals the significance level. For example, if your significance level is 0.05, then 5% of tests should have this error when the null is true.

The investigators who perform a simulation study know when the null hypothesis is true or false. They can use this knowledge to determine whether the proportion of tests with a Type I error matches the significance level, which is the target. The researchers can generate data that violate an assumption to determine whether it affects the results. The larger the difference between the significance level and the Type I error rate, the more critical it becomes to satisfy the assumption.

**Related post**: Types of Error in Hypothesis Testing

## Simulation Results for Unequal Variances

The simulation study assessed 50 different conditions related to unequal variances. For each state, the computer drew 10,000 random samples and statistically analyzed them using both Welch’s ANOVA and the traditional one-way test.

For the Classic ANOVA, the simulation study found that unequal standard deviations cause the Type I error rate to shift away from the significance level target. If the group sizes are equal and the significance level is 0.05, the actual error rate falls between 0.02 and 0.08. However, if the groups have different sizes, the error rates can be as large as 0.22!

## Welch’s ANOVA to the Rescue

If you determine that your groups have standard deviations that are unequal, what should you do? Use Welch’s ANOVA! The same simulation study found that Welch’s analysis of variance is unaffected by unequal variances. In fact, Welch’s ANOVA explicitly does not assume that the variances are equal.

Let’s compare the simulation study results for the two types of analysis of variance when standard deviations are unequal, and the significance level is 0.05.

- Classic ANOVA error rates extend from 0.02 to 0.22.
- Welch’s ANOVA error rates have a much smaller range of 0.046 to 0.054.

In fact, it’s fine to use Welch’s ANOVA even when your groups *do* have equal variances because its statistical power is nearly equivalent to that of the Classic test. Welch’s analysis of variance is an excellent analysis that you can use *all* the time for one-way analysis of variance. It completely wipes away the need to worry about the assumption of homogeneous variances.

## Welch’s ANOVA Example

In this example, our data are the ground reaction forces that are generated by jumping from steps of different heights. You can download the CSV data file for the WelchsANOVAExample.

First, I’ll graph the data to give us a good sense of the situation. The chart below is an interval plot that displays the group means and 95% confidence intervals.

The ranges are based on the individual standard deviations for each group, and they look different. So, Welch’s analysis of variance is a good choice for these data.

Next, I’ll perform the hypothesis test. Depending on your statistical software, the Welch’s procedure might be a separate command, or you may need to tell the software to not assume equal variances. The Welch’s ANOVA output is below.

The output for Welch’s ANOVA is relatively similar to the Classic test. Although, you’ll notice that it does not contain the usual analysis of variance table. Like interpreting any hypothesis test, compare the p-value to your significance level to determine whether the differences between the means are statistically significant. For our example results, the very low P-value indicates that these results are statistically significant. Our sample evidence provides sufficient evidence to conclude that the means of all groups are not equal in the population.

## Using Post Hoc Tests with Welch’s ANOVA

While the overall results above indicate that not all group means are equal, we don’t know which differences between group means are statistically significant. To identify significant differences between specific groups, you need to perform a pairwise comparisons post hoc test. When you use Welch’s ANOVA, you can use the Games-Howell multiple comparisons method.

For more information about this process, read my post about Using Post Hoc Tests with ANOVA.

The Games-Howell post hoc test is most like Tukey’s method for Classic ANOVA. Both procedures do the following:

- Control the joint error rate for the entire series of comparisons.
- Compare all possible pairs of groups within a collection of groups.

The Games-Howell post hoc test, like Welch’s analysis of variance, does not require the groups to have equal standard deviations. Conversely, Tukey’s method does require equal standard deviations.

The Games-Howell post hoc test results are below:

None of the confidence intervals for the differences between group means contain zero. Consequently, these confidence intervals indicate that the differences between all pairs of groups are statistically significant.

I hope you’ll consider using Welch’s ANOVA anytime you need to perform a one-way test of the means!

Matt says

good to receive some feedback and know I am thinking along the correct lines (because of your blog). thanks again. matt

Matt says

Thanks for the input, Jim. I was thinking that at this scale (within reef comparisons) are not really worth a statistic because of the different distributions, and as you said, puts me in a tough spot with KW.

So, I will just discuss the densities at within reef scales and relate density to the to habitat complexity. I will omit any statistic- it is only a short comm., so it needs to be brief.

Between reef comparisons I pooled the 20 transects (density data) and used Welch’s ANOVA because in almost all situations one of the 5 groups will fail either normality and or equal variances. Your blog helped me come to this conclusion for the pooled density site comparisons. Thanks again. Matt

Jim Frost says

That sounds like a good approach!

Matt says

Hi Jim, thanks for responding on FB. I have a fairly simple problem, I think anyway. I am studying some urchin densities within reef scales to assess habitat use. I measured and counted urchins along 20 band- transects (10x2m) spatially distributed across the coral community in 4 different habitat, that is 5 transects per habitat (2 nearshore and 2 back of the lagoon habitats; depth is not a factor).

So, that is 5 transects randomly stratified per habitat (4 per reef) and the normal distribution and or equal variance tests fail in most groups, which I expected given the small amount of data for comparisons.

to compare habitat a, b, c, d, urchin densities – with only 5 samples (i.e., densities from transects) with 4 habitats across the reef, would it be best to run the Welch’s ANOVA (Games-Howell post hoc) or the Kruskal Wallis (Dunn’s post hoc)? but I am actually interested in the mean – and overall, I think it reflects the true urchin densities because some transects had 0 urchins and thus low means. Any suggestions would be greatly appreciated.

Thanks Matt

Jim Frost says

Hi Matt,

Welch’s ANOVA makes failing the equal variances assumption irrelevant. Unfortunately, the normality assumption is still a concern. If you have > 15 samples per group, the central limit theorem can take care of that for you. But you are pretty short of that sample size guideline.

Kruskal Wallis doesn’t assume any particular distribution but the groups must have the same distribution. If they have different distribution, then the K-W results are invalid. So, that’s something to check.

If the groups have different distributions, then you’re in kind of a tough spot. You might be able to do some sort of bootstrap test. That’s not my forte but they can be more flexible. Click the link to read my introduction on that topic if you’re not familiar with it.

Hopefully, the different groups have similar shaped distributions. If so, you can use K-W. Unfortunately, the nonnormality combined with the small sample size rules out Welch’s (or the regular) ANOVA.

I hope that helps!

Ariel Balter says

In general, I wonder if switching to a linear model would be helpful. Would this be helpful?

https://journal.r-project.org/archive/2017/RJ-2017-049/index.html

https://journal.r-project.org/archive/2017/RJ-2017-049/index.html

Jim Frost says

Hi Ariel, thanks for sharing! Yes, I think that would be helpful! Although technically both Welch’s and the F-test are linear models. This package seems to offer the functionality Juri needs.

Chloe You says

Hi Jims, thank you for your amazing explanation!

I am curious on how the simulation study has done. Do we have an official paper or other kind of resource?

I am also wondering how the investigators who perform a simulation study know when the null hypothesis is true or false. I mean, the data were generated randomly, right? Then how could we know whether the null hypothesis is true or false? I thought the randomness will influence the truth of hypothesis. Maybe this thought was wrong? It is such a difficult problem for me that I can not figure out by my own.

Thank you so much for reading this!

Jim Frost says

Hi Chloe,

In this post, I have a link to the simulation study results white paper. Look for the section in it titled “Simulation Results for Unequal Variances” That describes their simulation process. I noticed the link was broken and fixed it–in case you tried it before and it didn’t work.

Here’s how they describe their method:

For your question about the randomly generated data and null hypothesis, keep in mind that when generating the random data, they set the population parameters. For example, they might define one of their populations as having a mean of 100 and standard deviation of 15 and the other one having mean=100 and sdv=25. In this case the means are equal but the variances are unequal, which is the condition they’re testing.

Randomly generated data in this context means that they draw simulated samples from populations with those parameters. What they did was set the mean for both populations to be the same, so they know for sure that the null hypothesis is always true. However, type I errors will still occur, and they should occur at a rate equal to your significance level. If there’s a difference, you know the test isn’t performing correctly.

By the way, I used to work with the people who did this study, and I can vouch for their work!

Hafiz Saad says

Hi Jim,

Thanks for the explanation. I have a question regarding ANOVA.

Levene test for homogeneity of variances – is it a prerequisite test before deciding to proceed for classic Anova or Welch Anova test?

Thank you

Jim Frost says

Hi Hafiz,

It probably seems natural to use a homogeneity test to test that ANOVA assumption. However, it’s not usually necessary and can, in fact, lead you astray!

First, here are the potential problems with the test. Like any hypothesis test, sample size affects its power to detect unequal variances. If you have a large sample, the test will have more statistical power and it’ll flag small differences in variances as being statistically significant. In these cases, the degree of unequalness can be too small to be problematic. So, the test isn’t helpful in that case.

On the other side of the coin, if you have a small sample size, the test will have low power and not be able to detect unequal variances that are problematic.

So, testing isn’t always helpful.

Instead, I recommend looking at the group variances and if any are double that of another group, use Welch’s. The variances don’t have to be perfectly equal. Problems don’t start to appear until one group has twice the variance of another. And if you’re in doubt, just use Welch’s regardless. There’s really no penalty for using Welch’s ANOVA even when your variances are fairly equal. Then you don’t even have to worry about that problem. I’d almost go so far as to say Welch’s ANOVA should be the default, which is one my points in this blog post. Welch’s ANOVA makes it so it’s a non-issue. Just one less thing to worry about! 🙂

Scotty Craig says

Thank you for this well articulated description of one-way ANOVA. I have been attending to find better methods for more advanced ANOVA techniques (mixed ANOVA, Factorial ANOVA) when variances are not equal. I have always been told to perform log transformations to normalize the variance. Do you know of anything similar to Welch for them?

Jim Frost says

Hi Scotty,

Log transformations can normalize the variance. However, that complicates the interpretation because then your analysis uses transformed data units. I’d highly recommend using Welch’s ANOVA where it’s not necessary to transform the data. It keeps everything in the natural data units.

RABIA NOUSHEEN says

Hi Jim

I greatly appreciate this nice and very useful post. I have a question that can we use tukey instead of Games howell as a post hoc for Welch ANOVA?

I am getting NaNs in Games howell post hoc and looking for alternative.

Thank you

Sagar says

Hi Jim,

If the data violates the assumptions of normality and homogeneity then what is the alternative test for anova?

Jim Frost says

Hi Sagar,

The answer to your question depends on your sample sizes for your groups and whether you’re violating one or both assumptions, as I discuss below.

If you’re only violating the normality assumption, ANOVA is robust to violations when the sample sizes in all groups are large enough. To see those samples sizes, and the alternatives to ANOVA, read my post about Nonparametric vs Parametric tests. There’s a table with the sample size requirements. And another with alternate tests.

For violations of homogeneity only, you can use Welch’s ANOVA, as I discuss in this post.

If you’re violating both assumptions and you have sufficiently large group sizes, you can still use Welch’s ANOVA.

If you’re violating both assumptions and your data do not satisfy the group sizes, you cannot use the normal ANOVA, Welch’s ANOVA, or the nonparametric alternatives. I’m not sure what you can use. Perhaps a bootstrap method, but I’m not sure. Probably your best bet for this worst-case scenario would be to try transforming your data to achieve normality that way. That might also fix your unequal variances too. Transforming your data should be the last resort.

Itzhak Yogev says

Thank you very much Jim! It is really awesome that you find the time to respond to all of us.

You got it correctly, she has three types of problems. I’m not sure I understand the meaning of not having a good estimate for one category, when the goal is studying if (or to what extent) the categorization itself has an explaining power. I guess I’m stuck here on something pretty basic. Can you help me with that also?

Thanks a lot

Itzhak

Itzhak Yogev says

Hi Jim,

I just found your site, and it is thrilling! Many years ago, I had the privilege to learn statistics on intuitive level, and since then I only find statistics explained either highly technically, or over simplified. Your teaching of intuitions without compromising depth and accuracy is impressive!

I’m trying to help my wife with her research project. She tries to study what makes multiplication problems harder (or easier) to adults. The difficulty is a continuous variable. She found a very cool categorization of the problems, which appears to explain a good deal of the variance in difficulty. I say “appears to” because she only has three categories, and the number of problems that fall into each is very different. So she can’t do regression with only three values of the IV, nor ANOVA with the different number of problems under each category. Can you suggest a proper test? She would also like to do something like two-way ANOVA (to see, for example, if the categorization she comes up with works also under pressure, so she will have with or without pressure as another IV).

Thank you so much!

Jim Frost says

Hi Itzhak,

Thanks so much for your kind words! They particularly mean a lot to me because I strive to both have simple, intuitive explanations yet have them be 100% and not gloss over the details. That can be hard to do. So, I really appreciate you taking the time to write that! 🙂

I’m not sure that I fully understand your wife’s research. I understand the continuous DV of difficulty. Does she have a categorical IV with three categories? Three types of problems?

If that’s the case, it’s ok to have differing numbers of observations in each category. It’s not the most efficient in terms of statistical power but it doesn’t necessarily invalidate your results. If you have a very small number in one category, you won’t get a good estimate for that category.

If what I describe is correct, she could use either ANOVA or regression. If she includes Type of Problem and Pressure as IVs, it would be two-way ANOVA or multiple regression.

I wish her (and you) the best of luck with the research!

Brittany K says

Did you ever figure out how to plot Games-Howell comparisons via R?

shawnjanzen says

@Sukanya, maybe this can be of help. I wasn’t able to find a plot option for the Games-Howell results, but I managed to tweak the TukeyHSD function code to work with Games-Howell results.

First, I had to install the userfriendlyscience package to access the posthocTGH function: https://rpubs.com/aaronsc32/games-howell-test

Then, I obtained the TukeyHSD code from: https://rpubs.com/brouwern/plotTukeyHSD2

I modified the TukeyHSD to extract the pair-names, means, and CI values from the Games-Howell results and it seems to make similar post-hoc plots.

Here is a link to my modified code: https://github.com/shawnjanzen/plot-for-games-howell-anova-post-hoc-test/blob/master/games-howell-post-hoc-plot-custom-function.R

Thanks @Jim for the great posts and chance to engage on something I never done before.

Jim Frost says

Hi Shawn, thanks so much for posting that information! 🙂

Sukanya Ravinder says

Please tell me how did you plot The Games-Howell post hoc test results. I have to do in R. Thank you

Jim Frost says

Hi Sukanya, sorry, I don’t know how to do it in R. I used Minitab to graph it.

Kevin says

Chris,

http://Www.vassarstats.net has a calculator for the studentized range statistic q, and it only requires the total number of groups and degrees of freedom.

Juri Tori says

Hi thank you for the explanations about Welch ANOVA. Much, much clearer.

Is there a way to incorporate Welch in a mixed model that have within group repeated measurements?

I have 3 groups with repeated measurements. One of the groups does not result from a randomized assignment and has a unequal variance (expected) compared to the two other groups that have been randomized and have equal variance.

Jim Frost says

Hi Juri,

I haven’t used it that way myself, but I believe the answer is yes. Theoretically, you should be able to perform most things using Welch’s ANOVA that you can with the F-test version. The trick is finding the software/code that performs it (and, unfortunately, I don’t know the answer to that part).

mili says

Dear Mr. Frost, I really like your “Introduction to statistics” book that helped me get interested in statistics 🙂 I have a simple question: is there a test that would be immune to linear modifications of my data?if I have a matrix with 3 columns of identical data and I run analysis of variance test on it, sure it fails to reject H_null. But if I multiply one column by a constant factor, or add a constant to it -> the outcome could be that H_null is rejected. Is there a test that could still recognise these data as coming from the same distribution just linearly modified?

Jim Frost says

Hi Mili,

Thanks so much for supporting my ebook. I really appreciate it! And, I’m so glad that it was helpful!

I don’t totally understand what you need. But, if you add constant to all the values in a dataset, the standard deviation won’t change. You’ll see that it has the same value. If you multiply the values by a number, the standard deviation is also a multiple of the original standard deviation. I don’t think there’s a test that will check that but you can see if the actual standard deviations are the same or a multiple of the original.

Kevin says

Chris,

Here’s how you would do the Games-Howell test. You’d need access to a table for critical values of the “q” statistic, also called the studentized range statistic. First, since we are assuming unequal variances, we have to calculate our degrees of freedom using the ugly Satterthwaite formula you may be familiar with. For each pairwise comparison of groups A and B, take (variance A/group A size + variance B/group B size)^2 as the numerator, and (variance A/group size A)^2/(group size A-1) + (variance B/group size B)^2/(group size B-1) as the denominator. In practice, this would be rounded to the nearest whole number to give the desired degrees of freedom.

Next, look in the q-table for the desired degrees of freedom, significance level, and TOTAL number of groups being compared (not just the 2 being pairwise tested) to get the critical q-value. Now, for groups A and B being compared, take (group A variance/group A size) + (group B variance/group B size), divide this by 2, and take the square root. Multiply the answer by the critical q-value found to give the minimum significant difference for the two groups; that is, if the mean difference between groups A and B meets or exceeds this value, you can say the two groups differ at the chosen alpha level.

SPSS can calculate Welch and Games-Howell as standard options, so it may be worth a shot for you, as the calculations can be very tedious even on a spreadsheet. Hope this helps!

Chris says

Hi Jim and Kevin,

Thank you both! I was able to run the Welch’s ANOVA. I really appreciate it. I was wondering if you can give me a step by step for Games Howell post hoc. That’s the last piece I need and I’ll be good to go. Thank you both for your help!

Chris

Kevin says

Hi Chris,

If you have the means, variances, and sample sizes for each group, then that’s all you will need. I posted a reply about a year or so ago that broke down the process for calculating the Welch F step by step – you should see it if you scroll down. Let me know if that helps.

Chris says

Hi Jim, Thank you so much for this incredible information, you have no idea how much it’s helping me. I need to run a Welch’s ANOVA. How do I do that with means and standard deviations? I don’t have all the raw data, just the summary data. I would love to be pointed to a program or website.

Thank you so much!

Chris

Jim Frost says

Hi Chris,

As Kevin mentioned, he posted a method for calculating Welch F. Here’s a link to Kevin’s comment to make it easier to find!

Kevin, thanks so much posting the step by step procedure earlier! 🙂

Ariel Balter says

Hi Sarah. I just posted a comment with a link to a master’s thesis that did similar simulations. Not quite peer-reviewed, but might be helpful.

Ariel Balter says

Great post as always. Here is master’s thesis that did similar simulations to the ones you mentioned, and arrived and similar conclusions.

https://scholarscompass.vcu.edu/cgi/viewcontent.cgi?article=5026&context=etd

Personally, given that my nice but not extraordinary laptop could probably perform all of those simulations overnight, why don’t we have volumes of these sorts of simulation-based assessments of statistical tests, their weaknesses, strengths, limitations, etc.? It seems like a no-brainer and would provide clear answers to all of those myriad “this or that” posts on cross-validated, Quora, Research Gate, etc.

Jim Frost says

Hi Ariel,

Thanks for sharing the link to that excellent thesis! I think there are more and more simulation studies being conducted all the time. And, they’ve yielded very interesting results. For example, the normality assumption used to be king but now it’s widely recognized that many parametric tests are robust to departures from normality. I cover that as one of the points in my post about parametric vs. nonparametric tests. I also found a neat study that answers a longstanding question about whether you should use a 2-sample t-test or a Mann-Whitney test to analyze Likert scale data. Turns out that it doesn’t matter in most cases!

I think this type of information will continue to grow and spread. I love that type of simulation study!

Shawn J says

Hi Jim,

Your blog is one of my favorites and always a great read! Just picked up my copy of your new book. Any chance of it in hardcopy, perhaps a print-on-demand option?

This post and the replies of your commentators got me questioning assumptions from when I learned about ANOVA & t-tests.

Regarding the assumption of normality–I was originally taught (way back when) to examine the dependent variable (DV) for a normal distribution, which could be eye-ball inspected with a histogram (not ideal) or a QQ plot (better). We could also use computational checks like the Kolmogorov-Smirnov or the Shapiro-Wilk Test. Yet, many things around us on not normally distributed, which troubles that assumption check.

As I explored more stats resources, many authors say we should examine the normality of the residuals, requiring we model the ANOVA first then use one of several ways to examine the residuals. While others extend the normality of the DV principle to ensure we check for a normal distribution for each group level–so QQ plots or KS & SW tests for each group. Yet another site went so far as to pretty much say don’t worry about it since the sampling distribution is normal.

Where do you stand on this and how do you suggest we check this assumption?

Thanks,

Shawn

Some source links:

https://stats.stackexchange.com/questions/6350/anova-assumption-normality-normal-distribution-of-residuals

http://www2.psychology.uiowa.edu/faculty/mordkoff/GradStats/part%201/I.07%20normal.pdf

https://stats.stackexchange.com/questions/12262/what-if-residuals-are-normally-distributed-but-y-is-not

https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide-2.php

Jim Frost says

Hi Shawn,

I’m so happy that you’ve found by blog to be helpful. And thanks for buying my regression ebook. I appreciate that greatly!

In regression analysis, the assumption of normality actually applies to the distribution of the residuals. And, satisfying that assumption is technically optional. You only need to satisfy that assumption when you want to use hypothesis testing (e.g., coefficient p-values and confidence intervals). Which, ok, is most all the time! The idea is that when the residuals are normally distributed, the sampling distributions for the coefficients are also normally distributed. Hence, it’s ok to trust the p-values. If the residuals are not normally distributed, you can’t assume that those sampling distributions are either. They’re connected in that fashion, and it explains why the check the distribution of the residuals. However, I’d expect the central limit theorem would come into play with a large enough sample size, but I don’t have good numbers/information about that on hand.

While the regression assumption applies to the distribution of the residuals rather than any of the variables (DV or the IVs), it is harder, but not impossible, to obtain normal residuals when the DV is severely non-normal. Sometimes with a severely non-normal DV, you’ll need to transform it. So, that’s how the distribution of the DV often gets thrown into this discussion. But, it’s the distribution of the residuals that really matters. If you have a nonnormal DV but the residuals are normally distributed, you’re good! You’ll see more about this in my ebook! 🙂

For ANOVA, there is more attention placed on the distribution of the groups themselves rather than just the overall residuals. That’s a little different than in regression. The distribution of the groups is a factor both for parametric tests (t-tests and ANOVA) and nonparametric tests (e.g., Kruskal Wallis). Fortunately, given relatively small sample sizes, the parametric analyses are robust to departures from normality thanks to the central limit theorem. That must be the basis behind the site saying not worry about it due to the normal sampling distributions! I’d agree but with caveat that you need to meet minimum sample size requirements before the CLT can kick in! Meanwhile the nonparametric tests have different assumptions that can be more stringent that those for the parametric version. For more information about all of that, read my post about parametric vs. nonparametric tests!

And, yes, I 100% agree that QQ plots (which I refer to as probability plots) are much better for assessing normality than histograms. I write about this in a post that compares histograms to probability plots for assessing normality, although not in the context of regression residuals.

Sarah F. says

Thank you for this helpful blog. I have a note from a reviewer asking for a citation to support the use of Welch’s anova over other non-parametric methods. Do you know of anything that would be a good citation for this? Thank you!

Jim Frost says

Hi Sarah,

Thank you so much for your kind words!

Unfortunately, I don’t have published reference for you. However, in this post, I link to a white paper produced by a software company that compares Welch’s ANOVA to the traditional F-test ANOVA. I know the people who performed the analysis behind that paper and it’s solid. However, I’m not sure if that will satisfy your reviewer.

On another note, you mention nonparametric analyses. Both the traditional ANOVA and Welch’s ANOVA are parametric analyses. If you’re using Welch’s because your variances are unequal, you probably cannot use most nonparametric tests. While nonparametric tests don’t assume that the data follow a particular distribution, most assume that the groups within the analysis follow distributions that have the same shape and spread. This assumption isn’t too widely known. Textbooks that cover nonparametric analyses in-depth should provide a good reference for it. For more information about that aspect and more, see my post about parametric vs. nonparametric hypothesis tests.

Consequently, if your group variances are dramatically different, you should not use nonparametric analyses. Welch’s is your best bet.

Best of luck with your study/article!

Sreeja says

How to read coefficient table in Regression analysis

Jim Frost says

Hi, please read this blog post to understand how to interpret regression coefficients and the p-values.

Nuri says

can I learn where the example comes from? Thank yoy

Jim Frost says

Hi Nuri,

These data are from a university experiment that I worked on long ago. The goal of our study was to determine whether an exercise intervention could increase the bone mineral density (BMD) in middle school aged girls. The idea was that increasing BMD at a younger age might decrease the chances of osteoporosis later in life. The intervention was jumping off steps to product a target impact of 6 body weights. These data come from our efforts to determine how high the steps should be to achieve the targeted impact. I think we went with 18 inches, but I’m not sure.

The data for this example represents the impacts in Newtons by different height steps. Unsurprisingly, higher steps produce higher impacts. The results aren’t particularly illuminating. However, because the variability also increases with the height, it’s a good dataset for illustrating when to use Welch’s ANOVA.

Dr. Sreeja Sukumar K. says

Thank you Jim

MaríaR says

Hi Jim, Thank you your your answer. My case is the second one, so I’ll convert my data to obtain clearest results. Thank you again!

Jim Frost says

You’re welcome! Hopefully a data transformation does the trick for you!

MaríaR says

Hello, congratulations for the blog! I have a question..I am comparing the cellular behavior over three different ceramic material. I have 4 variables of cellular metric parameters; N=49. None of them are parametric, and 3 have heteroscedasticity (so I applied Welch’s ANOVA) but the 4th variable has homoscedasticity but does not have the same distribution over the 3 groups, or in other words, can’t apply Kruskal Wallis.. Which statistical analysis should I apply? How can I calculate effect size? Thank you very much.

Jim Frost says

Hi Maria,

If you have 49 observations per group, you’re in luck. For one-way ANOVA (including Welch’s) with four groups, when you have at least 15 per group, nonnormal data are not a problem. And, you’re using Welch’s, which handles the heteroscedasticity. Read my post about nonparametric vs. parametric analyses for more information about that.

However, if you’re talking about 49 observations for your entire study that are divided between the four groups, you’re a bit below that threshold. And, as you note, the distributions need to have the same shape to use nonparametric methods. In that case, you might try a data transformation to resolve the nonnormality and then continue with the Welch’s ANOVA.

If that doesn’t work, I’m afraid you’ll not some real dedicated statistical assistance to figure out how to proceed.

DIV says

Kim behr asked: “Can I safely say in the discussion that heteroscedastity could cause increased Type I error, but this has been corrected because Welchs Anova was used?”

Personally I would not use the word “corrected” in connection with Welch’s ANOVA. I would say something more like: “Heteroscedastity can cause increased Type I error with conventional ANOVA, but this has been AVOIDED/MINIMISED because Welch’s ANOVA was used.”

In my thinking a “correction” would be something like subtracting a known (or well-estimated) bias, or perhaps applying a (well-justified) variance transformation.

Chris V. says

Dear Jim,

So, I’ve read a variety of opinions on just what constitutes a situation where variances between data sets are unequal. Obviously, standard deviations or variances are never identical between groups, so clearly some level of difference is allowed between groups in which the variances are still considered the same. In your opinion, what is that cut off? Thanks.

Jim Frost says

Hi Chris,

First, I’ll assume that when you state “variances between data sets are unequal” that you’re actually referring to the groups within your analysis. That is, they’re the groups within your dataset.

The answer to your question depends on whether your groups sizes are equal or unequal. A good rule of thumb for equal sized groups is that if the difference between the standard deviation of one group is twice that of another group, you should start to worry. With equal sized groups, problems are small but starting to show up at this point.

With unequal sized groups, you probably already have severe problems if you have one group with twice the SD as another group. You’d need a smaller cutoff than twice the SD with unequal groups. And, the problems can be exacerbated depending on whether a smaller group is one that has an unusual SD. Consequently, with unequal groups, it’s a bit more complicated than just having a single cutoff value.

If in doubt, just use Welch’s ANOVA, and you don’t have to worry it!

raji says

Hi,

How do you calculate effect size of the difference when Welch’s ANOVA is used?

Jim Frost says

Hi Raji,

It’s the same as regular ANOVA. The effect size is simply the difference between group means. You can also use post-hoc tests, such as Games-Howell, to create confidence intervals of those differences (CIs of effect sizes).

Paola says

Hi Jim, I have a question. In case of 2 independent groups (10 subjects in each one) in which I want to measure 1 variable in 3 different times, the Welch ANOVA coulb be suitable (less than 3 groups)?

Kevin says

Just a curiosity of mine, Jim – maybe you have an answer. In the standard ANOVA, the F ratio is basically a “signal to noise” ratio – a measure of variation between groups due to a treatment effect (signal) to random variation within groups (noise). If we define our weight as (group size/group variance) in Welch ANOVA, then I can see the numerator of Welch F as being more or less a weighted average of the total variation between groups, so that makes sense. I thought I read somewhere that the denominator of Welch F represented not so much a measure of variation within groups, but more of a correction factor of sorts, based on the “expected value” of the variance or something similar. Do you have any insight on that?

kim behr says

I used a Welch’s Anova since three groups have unequal variances. Can I safely say in the discussion that heteroscedastity could cause increased Type I error, but this has been corrected because Welchs Anova was used?

Kevin says

Hi Kim,

I’m not intending for this reply to speak for Jim, this is just my take on it. In my view, what you said is pretty accurate. The Welch procedure adjusts both the F ratio and the denominator degrees of freedom to protect against Type I error, so in that sense it is considered to be a conservative test, where the alpha level may actually end up being a bit lower than what you specify. I would say that in most circumstances this would come with a loss of statistical power as a tradeoff, but apparently the Welch F procedure has a negligible difference in power compared to the standard ANOVA. About the only disadvantage I can think of with Welch’s ANOVA, besides the reduced denominator degrees of freedom, is that the test generally does not perform as well if the data are strongly skewed or otherwise non-normal. But in that situation, would comparing means necessarily make much sense anyway?

Jim Frost says

Hi Kevin,

I worked the folks who did the simulation study. Their conclusion was that analysts should always use Welch’s ANOVA! Apparently there is only a very slight loss of power.

As for nonnormal data, if you have a large enough sample size per group, that’s not a problem in terms of it being a valid test. The group sizes are the same as those I describe in my post about parametric vs nonparametric hypothesis tests. Although, as you mention, if the data are sufficiently skewed, the mean might not represent the center of the groups adequately.

Thanks for your thoughts on this!

Jim

Jim Frost says

Hi Kim,

Basically, yes you can state that. There is one minor change it. In the simulation studies, they found that if you set alpha to 0.05, the actual Type I error rate could be higher or lower than that–0.02 to 0.08 with equal sized groups, or 0.02 to 0.22 with unequal sized groups. So, the Type I error rate was not necessarily increased because sometimes it was lower. A lower rate sounds good until you realize that it reduces the power of your test!

But, it is entirely accurate to say that use Welch’s ANOVA produces an actual Type I error rate that is very close to the target value (alpha) even when the variances are unequal.

I hope this helps!

Jim

Kevin says

I’ll give you a verbal explanation/breakdown, step by step, of what the equation represents. It will be a little wordy, but I want to be clear, so bear with me.

1) Determine a weight for each group by dividing each group’s sample size by its respective variance.

2) Multiply each group’s mean by its corresponding weight as determined in step 1, add these products together, and divide this total by the sum of all the group weights. The result is the mean square (MS).

3) Take the difference between each group’s mean and the MS, square this difference, and multiply this result by the corresponding group weight from step 1. Add all these results together, and divide by (number of groups – 1). The result is the numerator for Welch’s F.

4) For each group, calculate (group weight from step 1/total of all group weights). Subtract this fraction from 1, square the difference, and divide this square by (group size – 1). Add these results together from each group and call the final number A.

5) Multiply A by 2, then by the fraction (k-2 / k^2 – 1), where k is the number of groups compared. Finally, add 1 to this result. The ending number is the denominator of Welch’s F.

6) Do the division to calculate Welch’s F. As in the standard ANOVA, the numerator degrees of freedom remain at (# of groups minus 1). For Welch’s ANOVA, the denominator degrees of freedom are calculated as (k^2 – 1)/(3A), where k is the number of groups compared and A is defined above in step 4.

ruwini says

Hello sir,

can you describe me the equation of the welch test in simple manner with the comparison of ANOVA equation

Woojae Kim says

People often forget the theoretical reasoning behind the equal variance assumption in ANOVA and regression analysis. It is there because it makes comparison of population means sensible and parsimonious. Welch’s test certainly corrects the bias in testing results due to unequal variances, but it is ultimately at the a researcher’s discretion whether comparing population means with unequal variances makes sense to her/him. If variances are very different, even if the test tells significance (e.g., due to a large sample), simply stating that the two means are statistically significant will not comprise the whole story. As always recommended, any statistical technique requires background, theoretical understandings to be used properly.

Jim Frost says

Hi Woojae,

Thank you for raising an excellent point. I agree 100% with everything you say. Understanding the subject-area, the data in all of its details, and the specifics of the analyses are crucial for performing meaningful statistical analysis. Analysts should tell the full story.

I think Welch’s ANOVA is an excellent analysis when analysts have data with unequal variances and they want to determine whether the differences between the means are statistically significant.

Michael Thornton says

Jim – I see that I wasn’t quite clear about resampling, so allow me to restate my comment: I cannot collect new samples from the field. I can, however, use computerized resampling methods (e.g., bootstrapping).

Michael Thornton says

Hi Jim,

Thanks for this very straight-forward explanation of Welch’s ANOVA. What are the assumptions for this test? I have 3 samples; one is negatively skewed while the others are positively skewed. The former is a large sample with large expected cell sizes; the two latter are small samples, with some expected cell sizes below 5. Resampling isn’t possible. Any suggestions? Thanks for your thoughts, and for the great site.

Jim Frost says

Hi Michael, I think the small cell sizes will be a problem. Skewness isn’t necessarily a problem for parametric analyses, like ANOVA, when you have a sufficiently large sample size. However, some of yours are below that threshold. I also think that bootstrapping wouldn’t be such a good approach, particularly based on a the small sample sizes. Unfortunately, I think the cells < 5 would be also a problem for nonparametric alternatives because that's their minimum threshold. I wish I had a better answer for you but I don't know of any solutions offhand. Here is a post where I write about the sample size guidelines and normality for one-way ANOVA and other tests in a post about parametric vs nonparametric analyses.