Graphing your data before performing statistical analysis is a crucial step. Graphs bring your data to life in a way that statistical measures do not because they display the relationships and patterns. In this blog post, you’ll learn about using boxplots and individual value plots to compare distributions of continuous measurements between groups. You’ll also learn why you need to pair these plots with hypothesis tests when you want to make inferences about a population.
Use boxplots and individual value plots when you have a categorical grouping variable and a continuous outcome variable. The levels of the categorical variables form the groups in your data, and the researchers measure the continuous variable.
Both of these graphs allow you to compare the distribution of the continuous values between the groups in your sample data. You can assess properties such as the shapes of the distributions, central tendencies, and variability while looking out for outliers. These types of graphs are often precursors to hypothesis tests, such as 2-sample t-tests and ANOVA.
The sample datasheet below shows how researchers record data for both types of charts. Material is the categorical variable while Strength is the continuous variable. We’ll use this dataset for the individual value plot example that follows.
Note that the graphs in this post are best for comparing distributions between groups. When you need to assess a single continuous distribution, histograms and probability distribution plots are often better choices. For more information about histograms, see Using Histograms to Understand Your Data.
Individual Value Plots
As the name suggests, individual value plots display the value of each observation. This graph is best when you have fewer than 50 data points per group. With a larger number of samples, the data points can become packed close together, jumbled, and hard to evaluate.
- Assess the central tendency by noting the vertical position of each group’s center.
- Assess the variability by gauging the vertical range of data points within each group.
Let’s take a look at the example below. This chart displays the strengths of four different materials. To create this graph yourself, download the CSV data file: IndividualValuePlot. Material type is our categorical grouping variable and Strength is the continuous outcome variable that the researchers measured.
It appears that several different things are happening in this graph. We can compare the central tendencies of the groups. Material 1 has the highest central tendency of the four groups while Material three has the lowest. Regarding variability, Material 3 has a broader range than the other groups.
Like individual value plots, use boxplots to compare the shapes of distributions, find central tendencies, assess variability, and identify outliers. Boxplots are also known as box and whisker diagrams. While boxplots have the same goals as individual value plots, they look very different.
Instead of displaying the raw data points, boxplots take your sample data and present ranges of values based on quartiles and display asterisks for outliers that fall outside the whiskers. Boxplots work by breaking your data down into quartiles. When your sample size is too small, the quartile estimates might not be meaningful. Consequently, these graphs work best when you have at least 20 data points per group.
Let’s take a look at the anatomy of a boxplot before getting to an example. Notice how it divides your data into quarters—at least approximately because the upper and lower whiskers do not include outliers, which the chart displays separately.
The image below shows how boxplots compare to the probability distribution function for a normal distribution. Notice how each whisker contains 24.65% of the distribution rather than an exact 25%. Boxplots consider the observations beyond the whiskers to be outliers.
Learn more about outliers, including how boxplots detect them, in my post 5 Ways to Find Outliers in Your Data.
Using Boxplots to Assess Distributions
If you’re assessing a single distribution, using a histogram or a probability distribution plot is probably better. However, for comparing multiple distributions, boxplots are a great method. I find that they’re easier to interpret than individual value plots when you have a sufficiently large sample size.
To compare central tendencies, use the median line in the boxes.
For the variability, remember that half your data for each group falls within the interquartile box. The longer the box and whiskers, the greater the variability of the distribution.
To determine whether a distribution is skewed, look at where the data fall compared to the median. For symmetric distributions, the length of the box and whiskers on both sides of the median should be approximately equal. If the two sides are not roughly equivalent, your distribution is skewed.
Example of Using a Boxplot to Compare Groups
Suppose we have four groups of scores and we want to compare them by teaching method. To create this graph yourself, download the CSV data file: Boxplot. Teaching method is our categorical grouping variable and Score is the continuous outcome variable that the researchers measured.
Method 1 and 2 have nearly identical medians, but Method 1 has somewhat more variability. The second method also has an outlier that we should investigate. Method 3 has the highest variability in scores and is potentially left-skewed. Method 4 has the highest median.
Using Individual Value Plots and Boxplots in Conjunction with Hypothesis Tests
Graphing your data is an excellent way to obtain a more intuitive feel for the data. Are there differences between the groups? While these graphs can illustrate your data in this manner, you should use hypothesis tests in conjunction with these graphs if you want to go beyond just describing your sample and instead draw inferences about the population. If you go this route, you’ll also need to use a sampling method, such as random sampling, to obtain a sample that represents the population.
Do the differences in your sample represent differences in the population? You might see patterns in the graphs of your sample data that are just flukes based on random sampling rather than denoting a real relationship in the population. On boxplots and individual value plots, random error in your sample can produce apparent differences between the central tendency and variability of the groups. Additionally, arbitrary graph factors such as the scale of the Y-axis can exaggerate the appearance of differences.
Hypothesis tests play a critical role in separating the signal (real effects in the population) from the noise (random sampling error). This protective function helps prevent you from mistaking random error for a real effect. If the appropriate hypothesis test is not statistically significant, your sample provides insufficient evidence for concluding that the pattern on your graph represents a real effect at the population level. In other words, you might be looking at noise in the sample.
Hypothesis Tests for Boxplots and Individual Value Plots
The following are hypothesis tests you can use in conjunction with these graphs:
- 2-sample t-test: Assess the equality of two group means.
- ANOVA: Test the equality of three or more group means.
- Mann-Whitney: Assess the equality of two group medians.
- Kruskal-Wallis and Mood’s Median: Test the equality of three or more group medians.
- Test of Equal Variances: Assess the equality of group variances or standard deviations.
Boxplots and individual value plots are great ways to explore your data. Just be sure to use a representative sampling methodology and an appropriate hypothesis test if you want to go beyond the sample and draw inferences about the entire population.