What is the Wilcoxon Signed Rank Test?
The Wilcoxon signed rank test is a nonparametric hypothesis test that can do the following:
- Evaluate the median difference between two paired samples.
- Compare a 1-sample median to a reference value.
In other words, it is the nonparametric alternative for both the 1-sample t-test and paired t-test.
To perform the 1-sample test, analyze the raw data values. For the paired version, calculate the differences between the paired values and analyze them.
Most frequently, analysts use the Wilcoxon signed rank test to evaluate paired samples, such as before and after treatment scores. For example, a medical study might assess medication effectiveness by comparing the pre-test and post-test median symptom scores.
Like all hypothesis tests, this one uses samples to draw conclusions about populations. Learn more about Populations vs. Samples.
If you need a nonparametric test for two independent groups, learn about the Mann Whitney U Test. For three or more groups, consider the Kruskal Wallis Nonparametric Test.
Learn more about Parametric vs. Nonparametric Tests and Hypothesis Testing Overview.
Wilcoxon Signed Rank Test Assumptions
Statisticians often use the Wilcoxon signed rank test when their data do not follow the normal distribution. However, it has other advantages over t-tests, including the ability to analyze ordinal data and reduce the impact of outliers.
While the data don’t need to be normally distributed, they must follow a symmetrical distribution. When using the paired form, the distribution of the differences between the paired values must be symmetrical.
If the distribution is asymmetric, consider using the sign test. This nonparametric test is like the Wilcoxon signed rank test but can handle asymmetric distributions. However, the sign test is less powerful.
Null and Alternative Hypotheses
Now, let’s delve into the hypotheses of the Wilcoxon signed rank test. There are two sets of hypotheses. Choosing the correct set depends on whether you perform the paired or one-sample test.
Depending on the form, you’ll either determine whether the median difference between paired observations differs from zero or determine whether the median differs from the benchmark value (one-sample).
Paired Test
The following are the hypotheses for the paired Wilcoxon signed rank test:
- Null hypothesis: The median of the paired differences equals zero in the population.
- Alternative hypothesis: The median of the paired differences does not equal zero in the population.
In the paired Wilcoxon signed rank test, a median difference of zero indicates no effect or difference between the paired observations. For example, when the pre-test and post-test medians are not significantly different, the treatment did not affect the outcomes.
However, if your p-value is less than or equal to your significance level, the results are statistically significant, and you reject the null hypothesis. You can conclude that the median difference is not zero. In other words, an effect exists in the population.
To better understand the paired test, read about the Paired T-Test, which evaluates the mean rather than the median.
One-Sample Test
The following are the hypotheses for the one-sample Wilcoxon signed rank test:
- Null hypothesis: The population median equals the benchmark value.
- Alternative hypothesis: The population median does not equal the benchmark value.
The one-sample form compares the sample median to a hypothetical population median. The hypothetical value can be a target or benchmark. When the results are statistically significant, reject the null hypothesis and conclude that the population median does not equal your benchmark value.
Consider this robust, nonparametric alternative when your data are misbehaving. Whether dealing with nonnormal distributions, ordinal data, or outliers, it’s a handy tool in your data analysis toolbox.
Ali says
Hi Jim,
I’m a graduate student working on my dissertation in the field of English. I administered a frighteningly large survey that I am now having a harder time analyzing than I had originally expected. (I greatly overestimated my memory of the many statistics classes I’ve taken 5 to 10 years ago.) I’m very sorry to bother you, but your website has been the most helpful resource I have found so far (and I’ve read three separate statistics textbooks, a training manual on SPSS, and innumerable websites); however, I have a series of questions to make sure I’m not messing things up catastrophically.
First, necessary background information: my survey asked several demographic questions then analysis questions. The analysis questions involved showing respondents a policy (Policy A), asking 2-5 questions about that policy, showing an alternative policy (Policy B), asking matching 2-5 questions about that policy, then asking respondents for their preference between Policy A or B. This was then repeated for several policies. The questions about the policies were almost always 5-point Likert scale questions (strongly disagree – strongly agree).
My questions:
1. For example, if question 1 asked respondents their level of agreement to the statement “Policy A is reasonable” and question 2 asked respondents their level of agreement to the statement “Policy B is reasonable” where Policy B is a revision of Policy A, these questions could be treated as a matched pair, yes? I should use the paired Wilcoxon signed rank test to determine if there is a significant difference in response between questions 1 and 2?
2. I’ll keep using the same example. If respondents are asked to choose their preference between Policy A and Policy B, I can compare how respondents’ preference for a specific policy relates to their perceptions of reasonableness for both policies, yes? (For example, did respondents who preferred Policy B find Policy A to be unreasonable and vice versa.) Would I still use Wilcoxon signed rank test or is another test more appropriate? I had originally intended to use chi-square tests for independence, but found “expected counts” were often too small for too many boxes (very few respondents found either policy very unreasonable).
3. For demographics analysis, I can and should use chi-square tests to compare most demographics (age, race, gender, native language for example) to policy preference (choice between Policy A and Policy B), yes? (If no, then what should I be using?)
4. For demographics compared to Likert scale responses, how do I test for interactions here? For example, comparing respondent gender and perception of reasonableness of Policy A? Cross tabulations are really interesting here (for example, women are tending to have stronger opinions – either strongly agreeing or strongly disagreeing – while men are more likely to disagree or agree), but I’m struggling with how to prove whether or not “interesting” equates to “significant” or is just interesting.
5. Similarly, can I just discuss findings in cross tabulations? For example, in one comparison (of respondents’ confidence in their ability to complete Task A or Alternative Task A), 10.6% of respondents “had concerns” about their ability to complete Task A, but, of those 10.6%, only 35.6% had similar concerns for Alternative Task A. Respondents who expected to struggle with Task A were often reasonably confident in their ability to successfully complete Alternative Task A (42.4% confident in their ability to complete the task after getting clarifying information and 20.3% confident they could complete the task without asking for clarifying information). Inversely, of the 15.5% of respondents who expected to struggle with Alternative Task A, 24.4% had the same amount of concern for Task A while 75.6% were more confident in their ability to complete Task A with clarifying information, yet no one in this group moved back to “I could definitely not complete this task” or forward to “I could complete Task A without clarifying information.” This is just a discussion of descriptive statistics, yes? Does there need to be some form of statistical significance for this? If so, how would I find it?
6. Would it be appropriate to group data to perform chi-square tests? For example, I’m primarily investigating the difference between positive opinions and everything else. Could I group (strongly disagree, disagree, and neutral) together and (agree and strongly agree) together to see if there are interactions or would this be excessive manipulation of the data? (I’m leaning towards excessive manipulation, but I’m beginning to grasp at straws since I’m struggling with figuring out which tests *are* actually appropriate.)
This is a lot of big questions. I definitely understand if you don’t have time to address any or all of them. I thank you for even looking at this comment and greatly appreciate whatever feedback you are able to provide me.
Sincerely,
Ali