Repeated measures designs, also known as a within-subjects designs, can seem like oddball experiments. When you think of a typical experiment, you probably picture an experimental design that uses mutually exclusive, independent groups. These experiments have a control group and treatment groups that have clear divisions between them. Each subject is in only one of these groups.

These rules for experiments seem crucial, but repeated measures designs regularly violate them! For example, a subject is often in all the experimental groups. Far from causing problems, repeated measures designs can yield significant benefits.

In this post, I’ll explain how repeated measures designs work along with their benefits and drawbacks. Additionally, I’ll work through a repeated measures ANOVA example to show you how to analyze this type of design and interpret the results.

## Drawbacks of Independent Groups Designs

To understand the benefits of repeated measures designs, let’s first look at the independent groups design to highlight a problem. Suppose you’re conducting an experiment on drugs that might improve memory. In a typical independent groups design, each subject is in one experimental group. They’re either in the control group or one of the treatment groups. After the experiment, you score them on a memory test and then compare the group means.

In this design, you obtain only one score from each subject. You don’t know whether a subject scores higher or lower on the test because of an inherently better or worse memory. Some portion of the observed scores is based on the memory traits of the subjects rather than because of the drug. This example illustrates how people introduce an uncontrollable factor into the study.

Imagine that a person in the control group scores high while someone else in a treatment group scores low, not due to the treatment, but due to differing baseline memory capabilities. This “fuzziness” makes it harder to assess differences between the groups.

If only there were some way to know whether subjects tend to measure high or low. We need some way of incorporating each person’s variability into the model. Oh wait, that’s what we’re talking about—repeated measures designs!

## How Repeated Measures Designs Work

As the name implies, you need to measure each subject multiple times in a repeated measures design. Shocking! However, there’s more to it. The subjects usually experience all of the experimental conditions, which allow them to serve as experimental blocks or as their own control. What does that mean? Let me break this down one piece at a time.

The effects of the controllable factors in an experiment are what you really want to learn. However, as we saw in our example above, there can also be uncontrolled sources of variation that make it harder to learn about those things that we can control.

Experimental blocks explain some of the uncontrolled variability in an experiment. While you can’t control the blocks, you can include them in the model to reduce the amount of unexplained variability. By accounting for more of the uncontrolled variability, you can learn more about the controllable variables that are the entire point of your experiment.

Let’s go back to our drug test. We saw how subjects are an uncontrolled factor that makes it harder to assess the effects of the drugs. However, if we took multiple measurements from each person, we gain more information about their personal outcome measures under a variety of conditions. We might see that some subjects tend to score high or low on the memory tests. Then, we can compare their scores for each treatment group to their general baseline.

And, that’s how repeated measures designs work. You understand each person better so that you can place their personal reaction to each experimental condition into their particular context.

## Benefits of Repeated Measures Designs

In statistical terms, we say that experimental blocks reduce the variance and bias of the model’s error by controlling for factors that cause variability between subjects. The error term contains only the variability within-subjects and *not* the variability between subjects. The result is that the error term tends to be smaller, which produces the following benefits:

**Greater statistical power**: By controlling for differences between subjects, this type of design can have much more statistical power. If an effect exists, your statistical test is more likely to detect it.

**Requires a smaller number of subjects:** Because of the increased power, you can recruit fewer people and still have a good probability of detecting an effect that truly exists. If you’d need 20 people in each group for a design with independent groups, you might only need a total of 20 for repeated measures.

**Faster and less expensive: **The time and costs associated with administering repeated measures designs can be much lower because there are fewer people to recruit, train, and compensate.

**Time-related effects: **As we saw, an independent groups design collects only one measurement from each person. By collecting data from multiple points in time for each subject, repeated measures designs can assess effects over time. This tracking is particularly useful when there are potential time effects, such as learning or fatigue.

## Managing the Challenges of Repeated Measures Designs

Repeated measures designs have some great benefits, but there are a few drawbacks that you should consider. The largest downside is the problem of order effects, which can happen when you expose subjects to multiple treatments. These effects are associated with the treatment order but are not caused by the treatment.

Order effects can impede the ability of the model to estimate the effects correctly. For example, in a wine taste test, subjects might give a dry wine a lower score if they sample it after a sweet wine.

You can use different strategies to minimize this problem. These approaches include randomizing or reversing the treatment order and providing sufficient time between treatments. Don’t forget, using an independent groups design is an efficient way to eliminate order effects.

## Crossover Repeated Measures Designs

I’ve diagramed a crossover repeated measures design, which is a very common type of experiment. Study volunteers are assigned randomly to one of the two groups. Everyone in the study receives all of the treatments, but the order is reversed for the second group to reduce the problems of order effects. In the diagram, there are two treatments, but the experimenter can add more treatment groups.

Studies from a diverse array of subject areas use crossover designs. These areas include weight loss plans, marketing campaigns, and educational programs among many others. Even our theoretical memory pill study can use it.

Repeated measures designs come in many flavors, and it’s impossible to cover them all here. You need to look at your study area and research goals to determine which type of design best meets your requirements. Weigh the benefits and challenges of repeated measures designs to decide whether you can use one for your study.

## Repeated Measures ANOVA Example

Let’s imagine that we used a repeated measures design to study our hypothetical memory drug. For our study, we recruited five people, and we tested four memory drugs. Everyone in the study tried all four drugs and took a memory test after each one. We obtain the data below. You can also download the CSV file for the Repeated_measures_data.

In the dataset, you can see that each subject has an ID number so we can associate each person with all of their scores. We also know which drug they took for each score. Together, this allows the model to develop a baseline for each subject and then compare the drug specific scores to that baseline.

How do we fit this model? In your preferred statistical software package, you need to fit an ANOVA model like this:

- Score is the response variable.
- Subject and Drug are the factors,
- Subject should be a random factor.

Subject is a random factor because we randomly selected the subjects from the population and we want them to represent the entire population. If we were to include Subject as a fixed factor, the results would apply only to these five people and would not be generalizable to the larger population.

Drug is a fixed factor because we picked these drugs intentionally and we want to estimate the effects of these four drugs particularly.

## Repeated Measures ANOVA Results

After we fit the repeated measures ANOVA model, we obtain the following results.

The P-value for Drug is 0.000. This low P-value indicates that all four group means are not equal. Because the model includes Subjects, we know that the Drug effect and its P-value accounts for the variability between subjects.

Below is the main effects plot for Drug, which displays the fitted mean for each drug.

Clearly, drug 4 is the best. Tukey’s multiple comparisons (not shown) indicate that Drug 4 – Drug 3 and Drug 4 – Drug 2 are statistically significant.

Have you used a repeated measures design for your study?

Mara says

Hi Jim!

Thank you for another great post! I am doing a study protocol and the primary hypothesis is that a VR intervention will show improvement in postural control (4 CoP parameters), comparing the experimental and inactive control group (post-intervention). I was advised to use a repeated measures ANOVA to test the primary hypothesis but reading your post made me realize that might not be correct because my study subjects are not experiencing all the experimental conditions. Do you recommend another type of ANOVA?

Thanks in advance.

Jim Frost says

Hi Mara,

I should probably clarify this better in the post. The subject don’t have to experience all the treatment conditions, but many studies use these designs for this reason. But, it’s not a requirement. If you’ve measured your subjects multiple times, you probably do need to use a repeated measures design.

Laura says

Hi Jim,

Thank you so much for your helpful posts about statistics! I’ve tried doing a repeated measures analysis but have gotten a bit confused. I administered 3 different questionnaires on social behavior (all continuous outcomes, but on different scales [two ranging 0-50, the third 0-90]) on 4 different time points. The questionnaires are correlated to each other so I would prefer to put them in the same analysis. I was planning on doing this by making one within subject variable “time” and one within subject variable “questionnaire”. I would like to know what the effect is of time on social behavior and whether this effect is different depending on the specific questionnaire used. Is it ok to add these questionnaires in the same analysis even though they do not have the same range of scores or should I first center the total scores of the questionnaires?

Many thanks,

Laura

Jim Frost says

Hi Laura,

ANOVA can handle DVs that use different measurement units/scales without problems. However, if you want to determine which DV/survey is more important, you might consider standardizing them. Read more about that in my post about identifying the most important variables in your model. It discusses it in the regression context but the same applies to ANOVA.

You’ll obtain valid and consistent results using either standardized and unstandardized values. It just depends on what you want to learn.

I hope that helps!

Elijah says

Hi Jim,

thanks for your effort and time to make statics understandable to the wider public. Your style of teaching is quite simple.

I didn’t any questions nor responses for 2019 to data, but I hope you’re still there anyway.

I have this stat problem I need your opinion on.

There are 8 drinking water wells clustered at different distances around an injection well. To simulate direction and concentration of contaminant within subsurface around the well area, a contaminant was injected/pumped continuously into the subsurface through the injection well. This happened for 6 weeks; pH samples were taken from the 8 wells daily for the 6 weeks. I need to test for 2 things, namely:

1. Is there any significant statistical difference in pH within the wells within the 6 weeks (6 weeks as a single time period)

2. Is there any statistical significant difference in pH for each well within the weeks (6 weeks time step)

Which statistical test best captures this analysis? I think of repeated measure ANOVA, what do you think please?

Thanks in advance.

Jim Frost says

Yes, because you’re looking at the same subjects (wells) over time, you need repeated measures ANOVA.

Vidya Kulkarni says

Name: Vidya Kulkarni

Email: [email protected]

Comment: Shall appreciate a reply. My friend has performed experiments with rats in 3 groups by administering certain drug. Group 1 is not given any drug, Group 2 is given 50 mg and group 3 is given 100 mg. In each group there are 3 rats and for each of these rats their their tumor volume has been recorded for 9 consecutive days. Thus for each group we have 27 observations. We want to show the difference in their means is significantly different at some confidence level. Please let me know what statistical test should we use and if you can send a link to some similar example, that would be a great help. Looking forward to quick help. Thanks

Mary says

Hi Jim,

I wanted to tank you for your post! It was really helpful for me.

In my design I have 30 subjects with 10 readings (from different electrodes on the scalp) for each subject in two sessions (immediate test, post test). I used repeated measure anova and I found a significant main effect of sessions and also significant interaction of sessions and electrodes. Main effect means I have significant difference between session1 data and session2 data but I am not sure about the interaction effect. I would appreciate if you help me with that.

Thanks,

Mary

Jim Frost says

Hi Mary,

I’m not sure what your outcome variable is or what the electrodes variable measures precisely. But, here’s how you’d interpret the results generally.

The relationship between sessions and your outcome variable depends on the value of your electrodes variable. While there is a significant difference between sessions, that difference depends on the value of electrodes. If you create an interactions plot, it should be easier to see what is going on! For more information, see my post about interaction effects.

I hope that helps!

Elias ANDREADAKIS says

Hello Jim !

I am very pleased to meet you and I greatly appreciate your work !

The Repeated Measures ANOVA that I have encountered in my study is as follows :

A number of subject groups, of n – people each, selected e.g by age, are tested repeatedly for the same number of times all, with the same drug ! I.e there is only one drug !

The score is the effectiveness of the drug on a specific body parameter, e.g on blood pressure.

And the question is to assess the efectiveness of the drug.

Subjects group is not a random factor, as it is an age group

Score also is not an independent r.v as it reflects the effect of the previous day of the drug

Do you have any notes on this type of problems or recommend a literature I can access from web ?

My best regards

Elias

Athens / Greece

Jim Frost says

Hi Elias,

It’s OK to not have more than one drug. You just need to be able to compare the one drug to not taking the drug. You can do that both in a traditional control group/treatment group setting or by using repeated measures. However, given that you talk about repeated measures and everyone taking the drug, my guess is that it is some type of crossover design, which I describe in this post.

In this scenario, everyone would eventually take the same drug over the course of the study, but some subjects might start out by not taking the drug while the other subjects do. Then, the subjects switch.

You can include Subjects as a random factor if you randomly selected them from them population. Then, include Age as an additional Fixed factor if you’re specifying the age groups or as a covariate if you’re using their actual age (rather than dividing them into groups based on age ranges).

I hope this helps!

Jaime says

Hi Jim,

I am getting conflicting advice.

I ran a: pre-test, intervention, post-test study. Where I had 4 groups (3 experimental and one control). I tested hamstring strength. In my repeated measures ANOVA I had an effect of time but NO interaction effect. I have been told due to no interaction effect I do NOT run a post-hoc analysis. Is this correct as someone else has told me the complete opposite (I only run a post-hoc analysis when I do not have an interaction effect)?

Jim Frost says

Hi Jaime,

The correct action to do depends on the specifics of your study, which might be why you’re getting conflicting advice!

As a general statistical principle, it’s perfectly fine to perform post-hoc tests regardless of whether the interaction effect is significant or not. The only time that it makes no sense to perform a post hoc test is when no terms in your model are statistically significant. Although, even in that case, post hoc tests can sometimes detect statistical significance–but that’s another story. But, in a nutshell, you can perform post hoc tests whether or not your interaction term is significant.

However, I suspect that the real question is whether it makes sense the pre-test post-test nature of your study. You have measurements before and after the intervention. If the intervention is effective, you’d expect the differences to show up after the intervention but not before. Consequently, that is an interaction effect because it depends on the time of measurement. Read my blog post about interaction effects to see how these are “it depends” effects. So, if your interaction effect is not significant, it might not make sense to analyze your data further.

If the main effect for the treatment group variable is significant but not the interaction effect, it’s a bit difficult because it says that the treatment groups cause a difference between group means even in the pre-test measurement! That might represent only the differences between the subjects within those groups–it’s hard to say. You really want that interaction term to be significant!

If only the time effect is significant and nothing else, it’s probably not worth further investigation.

One thing I can say definitively is that the person who said that you can only perform a post-hoc analysis when the interaction is not significant is wrong! As a general principle, it’s OK to perform post-hoc analyses when an interaction term is significant. For your study, you particularly want a significant interaction term!