When it comes to hypothesis testing, statistics help you avoid opinions about when an effect is large and how many samples you need to collect. Feelings about these things can be *way* off—even among those who regularly perform experiments and collect data! These hunches can lead you to incorrect conclusions. Always perform the correct hypothesis tests so you understand the strength of your evidence.

In my house, we’re all big fans of the Mythbusters. This fun show tests whether different myths and urban legends could have really happened. Along the way, they perform experiments in a controlled and repeatable manner and collect data. This involves lots of planning, custom equipment, reducing potential sources of variation, and a large number of explosions. All good stuff. However, they’re not always the best when it comes to statistical analysis and hypothesis testing.

Don’t get me wrong. I think the Mythbusters are great because they make science fun and place a high value on using data to make decisions. It’s a great way to bring science to life for kids! Unfortunately, they occasionally draw incorrect conclusions from their data because they don’t use statistics.

One of the things I love about statistics is that hypothesis testing helps you objectively evaluate the evidence. You set the significance level before the study, analyze the data, and then decide using the p-value. You don’t have to worry about a subjective assessment about whether an effect appears large enough while simultaneously trying to factor in the sample size and sample variability!

In this post, I’ll detail their investigation into the myth that yawns are contagious and show how they would have benefited by using hypothesis testing and estimating an adequate sample size.

## Are Yawns Contagious?

I think we’ve all heard that yawns are contagious. If you see someone yawn, it sure seems like you’re more likely to yawn too. The Mythbusters decided they were going to test this myth. They recruited 50 people under the pretense that they were looking for people to appear on the show.

The recruiter spoke to each subject one-on-one and intentionally either yawned or did not yawn during the session. After listening to the recruiter, the subjects sat in a small room for a fixed amount of time. The Mythbusters secretly observed the subjects and recorded whether they yawned.

The Mythbusters recorded these data:

- Recruiter did not yawn (control group): 4 out of 16 (25%) of the subjects yawned.
- Recruiter did yawn (treatment group): 10 out of 34 (29%) of the subjects yawned.

When it came time to determine the results of their experiment, Jamie Hyneman said that the data confirmed the myth. Yawns are contagious. He stated that the difference of 4% is significant thanks to the large sample size (n=50). Unfortunately, Jaime based this conclusion on intuition rather than a statistical test. I’m going to analyze this more meticulously to see if hypothesis testing agrees with Jamie!

## Using the Two Proportions Hypothesis Test to Assess Yawns

The data contain proportions for two groups, so we’ll use the two proportions hypothesis test. Specifically, we’ll use a one-tailed test to determine whether the treatment group has a higher proportion than the control group. You can do this in your own preferred statistical software without a dataset. Just use the summary statistics for the two groups—10/34 and 4/16.

The two proportions hypothesis test produces the following results for the yawn data:

There are two P values and we’ll use the one for Fisher’s exact test. This test is for small samples and the note indicates that our sample is small. The P value of 0.513 is well above any standard significance level.

We fail to reject the null hypothesis. The sample does **not** contain sufficient evidence to conclude that the subjects exposed to yawns tended to yawn more frequently. Additionally, the output indicates that the sample size is **small! **When working with categorical data, you often need larger sample sizes than is typical for continuous data.

Unfortunately, Jamie was wrong about both the statistical significance and having a large sample size!

## Assess Statistical Power to Estimate the Correct Sample Size

When the Mythbusters conclude that a myth isn’t true, they often find the extreme conditions that can force the myth to occur. Usually, this involves an explosion. I’d love to include an explosion in this blog post, but I don’t want to damage your device!

Instead, I’ll produce a figurative bang by estimating how many subjects the Mythbusters should have recruited. I’ll perform a power and sample size calculation to determine the sample size necessary for the test to have a decent chance of detecting an effect if one actually exists. Hint: The answer is bound to prompt Adam Savage’s to wave his arms around in his characteristic manner!

In many fields, a good benchmark power value to aim for is 0.8. At this level, a hypothesis test has an 80% probability of detecting a difference if it exists.

The study estimated an effect of 0.04, which was not statistically significant. For the power analysis, I’m going to find the sample size that yields a statistical power of 0.8 for a difference of 0.10 (rather than 0.04) for a two proportions hypothesis test. After all, if the difference really is 0.04, that’s so tiny that it’s not practically significant in the real world even if a study found it to be statistically significant. I’ll calculate power using a one-tailed test.

If the actual population difference between the groups is 10 percentage points (25% vs. 35%), the Mythbusters need to recruit 329 subjects per group (648 total)! Well, they were only off by 600!

The sample size is so large because the effect size is still fairly small and hypothesis tests for categorical data require larger samples than tests for continuous data.

**Related post**: Estimating a Good Sample Size for Your Study Using Power Analysis

## The Mythbusters Need Statistics and Hypothesis Testing!

Using the two proportions hypothesis test and power calculation, we learned a couple of things:

- The sample data do not support the hypothesis that yawns are contagious.
- The sample size was too small to provide adequate statistical power.

I have a lot of research experience working in labs at a university. Based on this experience, I don’t see the Mythbusters experiment as a failure at all. Instead, I see it as a pilot study. For an experiment, you often need to conduct a small pilot study to work the kinks out and develop the initial estimates. It helps you avoid costly mistakes by not going straight to a large-scale experiment where things might not go as planned.

That’s how the scientific method works. You state the hypothesis, design and set up the controlled conditions for an experiment, and then evaluate the data with a statistical hypothesis test. You assess those results and, if necessary, make adjustments to improve the next study.

If this study occurred in the research arena, the researchers would ask themselves whether it’s worth conducting additional research on the subject. Are the potential benefits worth the costs? In this case, the benefits of learning whether yawns are contagious are small compared to the costs associated with a study of 650 subjects. It’s probably not going to happen!

Even though the results of this study are not statistically significant, we still learned something important!

We are still big fans of the Mythbusters! This study just reconfirms that science, research, and statistical analysis are tricky. Sometimes your intuition can lead you astray. Statistics can help keep you grounded by providing an objective assessment of your data with hypothesis testing. After all, the Mythbusters went to a lot of effort to collect their data. They ought to know what the data are really telling them!

Do you have any stories of surprising results or tricky data?

Be sure to read my other Mythbusters related post where I use hypothesis tests to bust myths about the battle of the sexes!

Jagar Omar Doski says

Dear Jim

How to detect where the significant changes occur in a series of readings over different times. example, if I want to know which part of my lecture is most boring by calculating the number of audience leaving the hall every 10 minutes.

Time (minutes) Audience Leavers

0 100 0

10 99 1

20 96 3

30 92 4

40 86 6

50 66 20

60 63 3

Jim Frost says

Hi Jagar,

I’m not sure that this a case for inferential statistics. After all, you have the entire population in your class. I suppose you could consider this particular class as a random sample of all possible classes that could happen going forward. You can clearly say that for this particular class, you have a spike at 50 minutes.

If you do try to use inferential statistics and calculate a significance point, you’d need to set a null value. Would that be zero? As in you’re for the first significant difference from no one leaving. My first thought would be a series of Poisson rate tests. However, you then need to worry about family-wise error rate because you’re performing a series of test, so perhaps use a Bonferroni correction with that.

Another possibility would be a P chart, which is a type of control chart for count data. However, those usually use a subgroup randomly drawn from the population. The difference in your case is that you’re drawing without replacement from the same original sample (if you count your class as one sample from the population of all possible students). I think issue would also apply to the other possibility. (Thinking more about the “without replacement” aspect, I’m wondering if you need to use a test that is based on the hypergeometric distribution. However, I’m not aware of such a test.)

I’ll have to think about whether there is a better method, but that’s what I can come up with now!

One final point. Whatever test you go with wouldn’t be proving they’re leaving because your lectures are boring. A statistical test won’t tell you the reason they’re leaving unless you also record information about why they left. So, even if you use an appropriate test, it will just tell you when the amount leaving is significantly different from zero. Perhaps the spike near the end of the class is just people leaving early after the bulk of the information has been presented but they need to get a jump on getting to their next destination? Not necessarily boredom!

erol says

What is problematic about this study is also that the subjects might have suppressed their yawning since they were being recruited, and generally the effect should be observed immediately (maybe not minutes after the first yawning). Another thing is also the genuineness of the yawn, to my understanding the recruiters faked the yawning. We should also be defining the contagiousness well.