In my house, we love the Mythbusters TV show on the Discovery Channel. The Mythbusters conduct scientific investigations in their quest to test myths and urban legends. In the process, the show provides some fun examples of when and how you should use statistical hypothesis tests to analyze data.

Previously, I’ve written about how the Mythbusters should have used hypothesis tests for their study about the contagiousness of yawning. Spoiler alert! They didn’t use any hypothesis tests, and that led them to an incorrect conclusion. They also didn’t have a good understanding of the amount of data required to test their theory.

In this post, I look at a different set of myths that the Mythbusters studied as the basis for more examples of using hypothesis tests. The episode “Battle of the Sexes – Round Two” provides great examples of when you should use statistical hypothesis tests. The battle of the sexes episode occurs later in the series than the yawning episode, and I want to see if their thoughts about data analysis have improved. A good sign happens during this episode when Adam Savage says:

Sample size is everything in science; the more you have, the better your results.

To paraphrase the show, I don’t just talk about the hypotheses; I put them to the test!

We’ll be testing the following hypotheses statistically:

- Women multitask better than men.
- Men parallel park better than women.

The Mythbusters present the data they collect, and I’ve put them in this CSV data file: BattleSexes.

## Hypothesis Test Example: Do Women Multitask Better Than Men?

For our first example of a hypothesis test, we’ll test the myth that women multitask better than men. To determine whether this is true, ten men and ten women perform a standard set of tasks that require multitasking. The Mythbusters create a scoring system that measures how well each subject performs the tasks. The scores can range from 0 to 100.

The average score for women is 72 and for men it is 64. The Mythbusters state that this 8 point difference confirms that women are better at multitasking.

This study is an excellent example of where you should use a statistical hypothesis test to draw conclusions from data—except they didn’t use one. Let’s see if the proper hypothesis test agrees with their conclusion!

## Why We Should Use Hypothesis Tests

Let’s take a step back for a moment and restate why we should use a hypothesis test rather than just use the sample means to decide. Samples are a subset of the entire population. Sample statistics, such as the mean, are unlikely to equal the actual population values. The difference between the sample value and the population value is called sample error.

An effect observed in any sample can actually be caused by sample error rather than representing a true difference between the population means. For this test, we need to determine whether the 8 point difference between women and men is sample error or a real effect. If the difference is due to sample error, the next time someone performs this experiment the results could be different.

Hypothesis tests incorporate estimates of the sampling error to help us make the correct decision.

For more details about the role of hypothesis tests, I’ve written these posts:

- Hypothesis Testing Overview
- How to Interpret P-values Correctly
- How Hypothesis Tests Work: Significance Levels and P-values

## Choosing the Correct Hypothesis Test

The multitasking study seems to call for a 2-sample t-test because we want to compare the means of two groups. However, I performed a normality test, which indicates that these data don’t follow the normal distribution.

**Related post**: How to Identify the Distribution of Your Data

Nonnormal data aren’t always a show-stopper for parametric tests like the t-tests. However, because the sample size is less than 15 per group, nonnormal data are a problem for us. We can’t trust the results from a 2-sample t-test. To learn more, read my post about how to choose between parametric and nonparametric tests.

Instead, we’ll need to use a nonparametric test to compare the medians. The Mann-Whitney test allows us to compare the medians for two groups. I’ll use a one-tailed test because we want to determine whether women have a higher median multitasking score than men.

## The Mann-Whitney Hypothesis Test Results

The p-value is 0.1271 and the confidence interval contains zero. Both conditions indicate that the test results are not significant. We have insufficient evidence to conclude that the women’s median score is greater than the men’s. The confidence interval contains negative values which tell us that we should not be surprised if a replicate study found that men had a higher median!

The Mythbusters saw the 8 point difference between the sample means and “confirmed” the myth. Unfortunately, the hypothesis test reveals that the sample evidence is not strong enough to draw this conclusion. The effect is not large enough to be distinguishable from random sample error.

## Power Analysis to Estimate a Good Sample Size

The Mythbusters recruited ten subjects per group. Let’s use a power and sample size analysis to estimate a good sample size. I’ll perform the power analysis using the following assumptions:

- The mean difference should be at least 10 points to be meaningful in a practical sense.
- If a real difference of 10 points exists, I want the test to have an 80% probability of detecting it (80% power).
- The sample standard deviation is the estimate of the population standard deviation.
- I’ll specify a 1-tailed hypothesis test to determine whether women are better than men at multitasking.

The statistical output is below.

To have a reasonable likelihood of identifying a difference of 10 points, we’d need 29 per group, for a total of 58 subjects. With this sample size, we can use a 2-sample t-test even if the data are nonnormally distributed.

**Related post**: Estimating a Good Sample Size for Your Study Using Power Analysis

## Hypothesis Test Example: Are Men Better at Parallel Parking?

For our next example of a hypothesis test, we’ll assess the myth that men are better at parallel parking. The Mythbusters again have ten subjects per group and use a test that produces scores between 0 and 100. This scenario sounds very similar to the multitasking example. It appears that we might need to use a hypothesis test to compare means or medians. However, the descriptive statistics table shows that the means and medians are nearly equal between genders. The differences are not significant using any hypothesis test.

There is something else going on with this study. While the Mythbusters were testing the subjects, the hosts observed that the parallel parking skills of women appear to be much more variable than the men’s. The graph below shows how women are either great or abysmal at parallel parking. Men are in between those extremes. We need to assess the variability of these samples.

**Related post**: Measures of Variability

Let’s use a hypothesis test to determine whether the sample provides sufficient evidence to conclude that women are more variable at parallel parking than men. To test this, we’ll use the two variances test. The two variances test is a hypothesis test with the following hypotheses:

- Null hypothesis: Both groups have equal variances.
- Alternative hypothesis: The variances for the two groups are not equal.

A small p-value indicates that we can reject the null hypothesis.

The p-value is 0.000, which is less than any reasonable significance level. The sample evidence is strong enough to reject the null hypothesis and conclude that women are more variable at parallel parking than men.

The Mythbusters made the right call and busted this myth that men are better at parallel parking. More accurately, we’d state that there is insufficient evidence to conclude that they are better.

However, thanks to hypothesis testing, we can state that men are more consistent parallel parkers than women.

Learn how to perform a Two-Sample Variances Test in Excel.

## Science is Hard to do Correctly

I hope these examples illustrate the importance of using hypothesis tests to draw conclusions from your data. To their credit, Adam and Jamie explain in an online video that they appreciate the importance of having a large sample size. Adam indicates that they put more energy into creating experimental conditions that produce valid results. I have to admit it; they do a great job at that. They go to great lengths to reduce variation, take valid measurements, etc.

Adam goes on to explain that, as a TV show, they don’t have the resources to obtain large sample sizes. I can accept that … for a television show.

However, for real scientists performing real science, you need to devote sufficient time, attention, and resources to *all* of the following:

- Design and implement a valid experiment
- Obtain a sufficiently large sample size
- Collect valid measurements
- Use the correct hypothesis test

If you miss the mark on any of these points, you won’t be able to trust your results. Science is difficult to get right!

Jason Miller says

Thank you very much Professor Jim!

Jim Frost says

You’re very welcome, Jason!

Jason Miller says

Thank you, Professor Frost! In the example of sexes, I have a study n=150 with three groups m, f, nb (yet only has 4) – would I exclude them from analysis? I want to analyze scores on two Likert scales (continuous score) and show if m,f, or nb have higher scores. Is this a T-Test still? And do you have a suggestion of doing the same for racial identity?

Jim Frost says

Hi Jason, unfortunately you really should exclude the group with only 4. That’s just too few.

As for which test to use, there is some debate over whether you can use a 2-sample t-test or use a nonparametric alternative. Read this post about that decision: Analyzing Likert Scale Data.

John Xie says

Don’t we know that male and female are different cohorts by definition/by nature ? Hence, the null hypothesis, Ho: no difference between male and female, is false by definition. In this case, p-value is zero! Statistical data analysis should answer the question like how male and female differs rather than whether they are truly different. Actually, hypothesis testing is unable to answer such question as agreed by the majority of statisticians. More hypothesis testing in statistical data analysis practice makes less statistical thinking in scientific research. It is statistical thinking, NOT hypothesis testing, makes good science.

Bongani Ndlovu says

Hi Jim, what software are you using and how can I replicate the statistical power calculations in Python?

Jim Frost says

Hi Bongani,

I’m using Minitab statistical software to perform the power and sample size analysis. Unfortunately, I don’t know how to do that using Python.

Rakesh says

Thanks

Robert says

Thank you Jim! Nice article to help one slip into the mindset of scientific analysis. I will refer back to this on occasion.

Jim Frost says

Thanks! I’m glad you found it helpful!

K.V.S.Sarma says

Very nicely explained. The conventional ‘p-value’ lovers would understand the true spirit of tests of hypothesis. Appreciation to Jim.

Jim Frost says

Thank you!