Analysis of variance (ANOVA) uses F-tests to statistically assess the equality of means when you have three or more groups. In this post, I’ll answer several common questions about the F-test.

- How do F-tests work?
- Why do we analyze
*variances*to test*means*?

I’ll use concepts and graphs to answer these questions about F-tests in the context of a one-way ANOVA example. I’ll use the same approach that I use to explain how t-tests work. If you need a primer on the basics, read my hypothesis testing overview.

## Introducing F-tests and F-statistics!

The term F-test is based on the fact that these tests use the F-statistic to test the hypotheses. An F-statistic is the ratio of two variances and it was named after Sir Ronald Fisher. Variances measure the dispersal of the data points around the mean. Higher variances occur when the individual data points tend to fall further from the mean.

It’s difficult to interpret variances directly because they are in squared units of the data. If you take the square root of the variance, you obtain the standard deviation, which is easier to interpret because it uses the data units. While variances are hard to interpret directly, some statistical tests use them in their equations.

An F-statistic is the ratio of two variances, or technically, two mean squares. Mean squares are simply variances that account for the degrees of freedom (DF) used to estimate the variance.

Think of it this way. Variances are the sum of the squared deviations from the mean. If you have a bigger sample, there are more squared deviations to add up. The result is that the sum becomes larger and larger as you add in more observations. By incorporating the DF, mean squares account for the differing numbers of measurements for each estimate of the variance. Otherwise, the variances are not comparable, and the ratio for the F-statistic is meaningless.

Given that F-tests evaluate the ratio of two variances, you might think it’s only suitable for determining whether the variances are equal. Actually, it can do that and a lot more! F-tests are surprisingly flexible because you can include different variances in the ratio to test a wide variety of properties. F-tests can compare the fits of different models, test the overall significance in regression models, test specific terms in linear models, and determine whether a set of means are all equal.

**Related post**: Measures of Variability: Range, Interquartile Range, Variance, and Standard Deviation

## The F-test in One-Way ANOVA

We want to determine whether a set of means are all equal. To evaluate this with an F-test, we need to use the proper variances in the ratio. Here’s the F-statistic ratio for one-way ANOVA.

To see how F-tests work, I’ll go through a one-way ANOVA example. You can download the CSV data file: OneWayExample. The numeric results are below, and I’ll reference them as I illustrate how the test works. This one-way ANOVA assesses the means of four groups.

## F-test Numerator: Between-Groups Variance

The one-way ANOVA procedure calculates the average of each of the four groups: 11.203, 8.938, 10.683, and 8.838. The means of these groups spread out around the global mean (9.915) of all 40 data points. The further the groups are from the global mean, the larger the variance in the numerator becomes.

It’s easier to say that the group means are different when they are further apart. That’s pretty self-evident, right? In our F-test, this corresponds to having a higher variance in the numerator.

The dot plot illustrates how this works by comparing two sets of group means. This graph represents each group mean with a dot. The between-group variance increases as the dots spread out.

Looking back at the one-way ANOVA output, which statistic do we use for the between-group variance? The value we use is the adjusted mean square for Factor (Adj MS 15.540). The meaning of this number is not intuitive because it is the sum of the squared distances from the global mean divided by the factor DF. The relevant point is that this number increases as the group means spread further apart.

## F-test Denominator: Within-Groups Variance

Now we move on to the denominator of the F-test, which factors in the variances within each group. This variance measures the distance between each data point and its group mean. Again, it is the sum of the squared distances divided by the error DF.

This variance is small when the data points within each group are closer to their group mean. As the data points within each group spread out further from their group mean, the within-group variance increases.

The graph compares low within-group variability to high within-group variability. The distributions represent how tightly the data points within each group cluster around the group mean. The F-statistic denominator, or the within-group variance, is higher for the right panel because the data points tend to be further from the group average.

To conclude that the group means are not equal, you want low within-group variance. Why? The within-group variance represents the variance that the model does not explain. Statisticians refer to this as random error. As the error increases, it becomes more likely that the observed differences between group means are caused by the error rather than by actual differences at the population level. Obviously, you want low amounts of error!

Let’s refer to the ANOVA output again. The within-group variance appears in the output as the adjusted mean squares for error (Adj MS for Error): 4.402.

## The F-Statistic: Ratio of Between-Groups to Within-Groups Variances

F-statistics are the ratio of two variances that are approximately the same value when the null hypothesis is true, which yields F-statistics near 1.

We looked at the two different variances used in a one-way ANOVA F-test. Now, let’s put them together to see which combinations produce low and high F-statistics. In the graphs, look at how the spread of the group means compares to the spread of the data points within each group.

**Low F-value graph**: The group means cluster together more tightly than the within-group variability. The distance between the means is small relative to the random error within each group. You can’t conclude that these groups are truly different at the population level.**High F-value graph**: The group means spread out more than the variability of the data within groups. In this case, it becomes more likely that the observed differences between group means reflect differences at the population level.

## How to Calculate our F-value

Going back to our example output, we can use our F-ratio numerator and denominator to calculate our F-value like this:

To be able to conclude that not all group means are equal, we need a large F-value to reject the null hypothesis. Is ours large enough?

A tricky thing about F-values is that they are a unitless statistic, which makes them hard to interpret. Our F-value of 3.30 indicates that the between-groups variance is 3.3 times the size of the within-group variance. We know that the ratio of variances doesn’t equal the null hypothesis value because this F-value doesn’t equal one. Is our F-value large enough to reject the null hypothesis?

We don’t know exactly how uncommon our F-value is if the null hypothesis is correct. To interpret individual F-values, we need to place them in a larger context. F-distributions provide this broader context and allow us to calculate probabilities.

## How F-tests Use F-distributions to Test Hypotheses

A single F-test produces a single F-value. However, imagine we perform the following process.

First, let’s assume that the null hypothesis is true for the population. At the population level, all four group means are equal. Now, we repeat our study many times by drawing many random samples from this population using the same one-way ANOVA design (four groups with 10 samples per group). Next, we perform one-way ANOVA on all of the samples and plot the distribution of the F-values. This distribution is known as a sampling distribution, which is a type of probability distribution.

**Related post**: Understanding Probability Distributions

If we follow this procedure, we produce a graph that displays the distribution of F-values for a population where the null hypothesis is true. We use sampling distributions to calculate probabilities for how unlikely our sample statistic is if the null hypothesis is true. F-tests use the F-distribution.

Fortunately, we don’t need to go to the trouble of collecting numerous random samples to create this graph! Statisticians understand the properties of F-distributions so we can estimate the sampling distribution using the F-distribution and the details of our one-way ANOVA design.

Our goal is to evaluate whether our sample F-value is so rare that it justifies rejecting the null hypothesis for the entire population. We’ll calculate the probability of obtaining an F-value that is at least as high as the value that our study obtained (3.3).

This probability has a name—the P value! A low probability indicates that our sample data are unlikely when the null hypothesis is true.

## Graphing the F-test for Our One-Way ANOVA Example

For one-way ANOVA, the degrees of freedom in the numerator and the denominator define the F-distribution for a design. There is a different F-distribution for each study design. I’ll create a probability distribution plot based on the DF indicated in the statistical output example. Our study has 3 DF in the numerator and 36 in the denominator.

**Related post**: Degrees of Freedom in Statistics

The distribution curve displays the likelihood of F-values for a population where the four group means are equal at the population level. I shaded the region that corresponds to F-values that are greater than or equal to our study’s F-value (3.3). When the null hypothesis is true, F-values fall in this area approximately 3.1% of the time. Using a significance level of 0.05, our sample data are unusual enough to warrant rejecting the null hypothesis. The sample evidence suggests that not all of the group means are equal.

Learn how to interpret P values correctly and avoid a common mistake.

## Why We Analyze Variances to Test Means

Let’s return to the question about why we analyze variances to determine whether the group means are different. Focus on the “means are different” aspect. This part explicitly involves the variation of the group means. If there is no variation in the means, they can’t be different, right? Similarly, the larger the differences between the means, the more variation must be present.

ANOVA and F-tests assess the amount of variability between the group means in the context of the variation within groups to determine whether the mean differences are statistically significant. While statistically significant ANOVA results indicate that not all means are equal, it doesn’t identify which particular differences between pairs of means are significant. To make that determination, you’ll need to use post hoc tests to supplement the ANOVA results.

If you’d like to learn about t-tests using the same general approach, read:

- How t-Tests Work: 1-Sample, 2-Sample, and Paired t-Tests
- How t-Tests Work: t-Values, t-Distributions, and Probabilities

To see an alternative to traditional hypothesis testing that does not use probability distributions and test statistics, learn about bootstrapping in statistics!

Gwen B says

Hi Jim!

How many F statistics do we examine and compare to critical values in a two-way ANOVA? Two-way seems more complicated than one-way ANOVA

Jim Frost says

Hi Gwen,

You’d have up to three F-tests. You’d definitely have two because it’s two-way ANOVA, one for each categorical variable [A B]. And, if you fit an interaction effect, you’d have a third F-test [A*B].

It does get more complicated, but the ideas are similar.

For the F-test for variable A, the F-ratio is: MS between groups for A/MS within groups.

For variable B, the F-ratio is: MS between groups for B/MS within groups.

For the interaction term: MS (A*B)/MS within groups.

For each test, you’d figure out the degrees of freedom for the numerator and the denominator using the same principles as in one-way ANOVA.

It’s really the same process, but obviously has to cover the additional terms in the model.

I hope this helps. Thanks for writing!

Dr. Sreeja Sukumar K. says

Sir,

How do we evaluate ranks? For example if we want to examine the brands of mobile phones which is most in demand and the attributes people give importance to like brand name,quality,price, etc…Where the attributes are ranked as 1,2,3,4,by the respondents.

Jim Frost says

Hi,

To follow the assumptions of the analyses exactly, you should use a nonparametric test. Many nonparametric tests can accept rank/ordinal data. I talk about this as one of the factors for deciding between parametric and nonparametric analyses.

However, a simulation study that has assessed using T-tests vs Mann-Whitney tests for 5 point Likert scale data suggest that it’s OK to use either test for that type of ordinal data. If your ordinal data are different and/or you’re not using a t-test, these results might not apply to your study. If in doubt, use the appropriate nonparametric test.

Best of luck with your analysis!

MARISETTY GOPI Kishore says

Hi Jim,

Thanks for your clear posts.

Could you please throw some light on ANCOVA. If you already explained it then I am sorry please guide me to that post.

Thanks

Gopi

Jim Frost says

Hi Gopi,

Unfortunately, I haven’t written a post on that topic, but that’s a great one for the future!

ANCOVA is basically ANOVA but adds in at least one covariate. The acronym is short for analysis of covariance. A covariate is a continuous predictor. An ANCOVA model will have at least one categorical factor and a continuous variable to model changes in the Response variable. Use ANCOVA when you want to determine whether the response variable means differs across the categorical levels while controlling for a continuous variable. In an experiment, a covariate is often a nuisance variable that you can’t control but you want to account for it.

For example, suppose you measure the quality of a product in an experiment that compares three manufacturing techniques. The response is a measure of quality. Manufacturing process is the categorical factor that has three levels. However, you know that humidity in the production area affects the quality of output but you can’t control it. Consequently, you include humidity in the model as a covariate.

ANCOVA can give your experiment more statistical power by reducing the amount of error (unexplained variance) within the groups. In this post, imagine that you’re reducing the value in the denominator of the F ratio, which produces a larger F-value.

I hope this helps!

Sue A. says

Thanks Jim! I saw some studies using GLM. I used to use Tukey’s in graduate school 🙂 It’s been a long while. Glad you mentioned it. Thank you so much!

Sue A. says

Hi Jim, I work with healthcare patient data. I’m looking at an Outcome M by variables X and Y. Outcome M is a ratio = 1/p, 0<p<=1. Which test would you recommend to use? Is CI (Confidence Interval) better than ANOVA, etc.? Thank you.

Jim Frost says

Hi Sue,

Unfortunately, I don’t have experience fitting a model to ratio data. I did some quick looking around and it appears that you might need to use generalized linear model, but I’m afraid I don’t know more than that off-hand.

As for CIs vs ANOVA, that’s not a decision you need to make! ANOVA provides p-values and differences between means. Then, you can use post-hoc analysis (Tukey’s, etc), which are commonly used in conjunction with ANOVA to see how the groups compare, to obtain CIs for the differences between the means. Typically, I find CIs provide much more information than a simple point estimate of the effect.

Virendra says

Sir, Thank you very much for such beautiful clarification.

Sue A. says

Hi Jim, I work with healthcare patient data. I’m looking at an Outcome M by variables X and Y. Outcome M is a ratio = 1/p, 0<p<=1. Which test would you recommend to use? Is CI (Confidence Interval) better than ANOVA, etc.? Thank you.

Virendra says

Sir, Thank you for responding me nicely.

I have 5 parameters, each with five levels. As i work on Response Surface Methodology in Minitab, response surface design uses ANOVA. I want to know which ANOVA is being used by Minitab although i have neither opted ONE WAY ANOVA nor TWO WAY ANOVA . Does Minitab decide itself ?

Jim Frost says

Hi Virendra,

Response surface uses linear models. It is more akin to regression than ANOVA because you can use continuous and/or categorical predictor variables. It can also model curvature, which ANOVA cannot do with categorical levels. Typically, you’ll perform factorial DOE first and then use response surface designs when your factorial models suggest there is curvature present in the data. So, typically, you will have data that you want to treat as numerical levels to fit the curvature–which is not done in ANOVA.

When you say “5 parameters, each with five levels”, I assume you actually mean you have five categorical factors and each one has five factor levels. If you did not create an experiment design using Minitab, you must first import your existing design into Minitab. If you want to do this, choose Define Custom Response Surface Design in the DOE > Response Surface menu path.

You can also analyze it using Fit Regression Model in the Regression menu without importing the design. You’ll need to identify which predictors are categorical versus numeric. You can fit curvature only with the numeric predictors.

As for the one-way versus two-way question, there seems to be a big misunderstanding about the differences between those analyses. They are essentially the same type of analysis (ordinary least squares linear models). The difference is the number of independent variables/categorical factors that you have. In one-way ANOVA, you can have only one. In two-way, you can only have two. Because you have 5, you can’t use either type but you can use either regression or GLM in ANOVA to fit that type of model. Those analyses use the same methods as one-way and two-way but allow you to include more predictors.

virendra says

Sir, in Mini-tab, which ANOVA is used, ONE WAY or TWO WAY ANOVA ? Why ?

Jim Frost says

Hi Virendra,

Minitab can perform a variety of different types of ANOVA including both one-way and two-way ANOVA. One-way ANOVA is a separate item in the menu under ANOVA. To perform two-way ANOVA, you’ll need to use General Linear Model (GLM), which is also in the ANOVA menu. Both one-way and two-way ANOVA are actually forms of the same type of linear model. The proper ANOVA to use depends on your data.

Keen Amy says

Hello,

Thank you for your clear explanation. I have one question: I know that the larger the F-value is, the more my results tend to be explained by the model (our intervention). Is there a range for the F-value? I have a test where my F-value is around 500 000. My groups are quite different, but is this F-value problematic?

Thank you

Jim Frost says

Hi Keen,

There is no theoretical upper-limit to the F-value. F-values can range from zero to positive infinity. The range for the critical region depends on your design, sample size, and significance level. For this example, the critical region extends from F = 3.3 to positive infinity.

By itself, an extremely large F-value doesn’t indicate a problem. But, you can graph your data, rethink the data collection method, etc just to be sure there isn’t a problem somewhere along the way. There are legitimate reasons for obtaining F-values that large but it could also represent something like a data entry error or outlier.

Typically, you don’t interpret the F-values directly but instead assess the p-value. Check to see that it is less than your significance level.

I hope this helps!

Amartya Prem says

SUMMARY

Groups Count Sum Average Variance

Boutique 135 774560.6747 5737.486479 169745173.8

Corporate 22 176771.9202 8035.087282 244695275.4

Institutional 41 311245.7625 7591.360061 122256354.8

ANOVA

Source of Variation SS df MS F P-value F crit

Between Groups 176157005.5 2 88078502.73 0.524041523 0.5929541252 3.042229897

Within Groups 32774708260 195 168075427

Sir, here F<F critical that means it follows null hypothesis, but here the averages of groups are not equal.Is this situation possible (or) is there any error that i might have committed.

Jim Frost says

Hi Amartya,

For the sake of argument, let’s assume that the

populationmeans for your three groups are equal. Now, suppose you draw random samples from these three populations. Even though the population means are equal, you’d expect that the sample means will be at least somewhat different thanks to sampling error. How different are the sample means likely to be? That depends on the sample size and the variability in the data.For your data, the insignificant P-value is telling you that it would not be surprising to observe these sample differences between the groups even if the population means are actually equal. In other words, these observed differences between the sample means are not large enough to suggest that the population means are different.

So, there’s not necessarily any error. In fact, the insignificant p-value protects you from jumping to conclusions based on these observed differences in your sample data.

I hope this helps!

James Simoko Phiri says

Jim, I’ve just run through and you seem to present statistics in simpler way

I’ll give further feedback

Jim Frost says

Thank you, James! I always strive to make statistical concepts as easy and intuitive to understand as possible.

Shaarang says

The concept is crystal clear to me now, thanks to your simple explanation.

Chiara Gasperoni says

Hello,

in “Graphing the F-test for Our One-Way ANOVA Example” how do you obtain the value 0.03116? (With which table)

Thank you

Jim Frost says

Hi Chiara,

The p-value of 0.03116 doesn’t come from a table for this example. Although, the normal one-way ANOVA output would provide this value. Instead, I use the F-distribution to plot a probability distribution for the F-value for our design. On a probability distribution plot, the portion of the area under the curve that corresponds to a specific range of values represents the probability that the value will fall within that range. On the graph, I had the software shade and calculate the probability for F-values that are greater than or equal to 3.3. The software indicates that for the F-distribution that corresponds to our design, this area equals 0.03116.

Typically, you’d just get the p-value from the ANOVA table, but I wanted to show how this test works graphically because I think it helps make it more intuitive.

In the section How F-tests Use F-distributions to Test Hypotheses, I include a link to a post I wrote about probability distributions. That post provides more information about how probability distributions work. That background information should make the example in this post easier to understand.

I hope this helps!

Sewinet Belete says

ho do you know 3 DF in the numerator and 36 in the denominator.

Jim Frost says

Hi Sewinet,

That’s a great question!

In one-way ANOVA, the degrees of freedom for the numerator are for the between group variation and equals (k-1), where k equals the number of factor levels. The design in this blog post has four factor levels, hence the degrees of freedom for the numerator is 4 – 1 = 3. The degrees of freedom for the denominator are for the within group variation and equals (N-k), were N equals the total sample size across all groups and k again equals the number of factor levels. Our design has 40 observations and 4 factor levels, hence the denominator DF is 40 – 4 = 36.

As you can see, the degrees of freedom, and hence the shape of the F-distribution, varies based on the specifics of your design.

Other types of analyses calculate degrees of freedom differently. In the section where I graph the F-distribution for this example, I include a link degrees of freedom where you can learn more about this concept.

I hope this helps!

Omkar says

Hi sir, I’m not getting, how the variance talks about significance, i.e. what is the relationship between variance and significance?

Jim Frost says

Hi Omkar, the F-test in ANOVA is testing to determine whether the means are different. So, the more different the means are, the stronger the evidence. A different way to state “the more different the means are” is “a higher variance amongst the group means.” So, for significant results you want the group means to be different, or a high variance amongst the means. The variance amongst the means is the denominator in the F-test. Consequently, a large value tends to produce larger F-values. You need a sufficiently large F-value to obtain significant results. The precise F-value you require depends on your design.

I hope this clarifies matters!

Jim