What is a Stem and Leaf Plot?
Stem and leaf plots display the shape and spread of a continuous data distribution. These graphs are similar to histograms, but instead of using bars, they show digits. It’s a particularly valuable tool during exploratory data analysis. They can help you identify the central tendency, variability, skewness of your distribution, and outliers. Stem and leaf plots are also known as stemplots.
Stem and leaf plots have one advantage over histograms because they display the original data, while histograms only summarize them.
Stem and leaf plots have been around for a long time. They were popular in the early 1900s because you could easily make these graphs by hand or with typewriters. In the early days of computers, primitive monitors and printers were able to display these simple graphs. However, as computer graphics capabilities grew, the popularity of stem and leaf plots have declined.
I have personally known statisticians who have waxed nostalgic over these graphs! In fact, when I worked at a statistical software company, we removed this graph from the menu, and there were complaints. We put it back in! The stem and leaf plot might not be the most common graph, but it has devoted followers.
In this post, you’ll learn how to make and read a stem and leaf plot.
How to Make a Stem and Leaf Plot
Stem and leaf plots are good choices for a medium amount of data. If you have fewer than 15 data points, you have too few data to produce a meaningful distribution. In this case, you’ll probably want to make a dot plot. Conversely, these graphs can become cluttered with more than 100 data points. Instead, you can use a histogram or boxplot.
In stem and leaf plots, you split each data point into a stem and leaf value. The stem values divide the data points into groups. The stem value contains all the digits of a data point except the final number, which is the leaf.
For example, if a data point is 42, the stem is 4 and the leaf is 2. When your data have more digits, you’ll need a longer stem. For instance, 238 has a stem of 23 and a leaf of 8.
You’ll need to round the values to a consistent decimal place. That decimal place becomes your leaf value. You can round to a fractional value (e.g., 0.1), but frequently you’ll round the final digit to a whole number. For very large values, you might round to the 10s or 100s place.
For the example in the next section, I’ve rounded the values to the 1s place.
Step-by-Step Instructions for Making a Stem and Leaf Plot
To make a stem and leaf plot, do the following:
- Sort your data in ascending order and round the values.
- Divide your raw data into stem and leaf values.
- Write down your stem values to set up the groups.
- Add the leaf values in numerical order to create the depths for each stem value group.
The example below shows the progression from raw data to stem and leaf values, and finally, the graph.
This stem and leaf plot displays a symmetric distribution with no apparent outliers. Additionally, if we had only the graph and not the original data, we could reconstruct the data values from it. In fact, after deriving all the original data, we can calculate all the usual sample statistics.
Here are several tips. Add the leaf values to each stem in numerical order. It makes the plot easier to read. You can see that in the example graph.
If you have a stem with no leaves, include it on the plot anyway to preserve the horizontal axis scaling and highlight the lack of values. That can be important when looking for outliers.
You can learn a lot about a data distribution by graphing it. The principles for interpreting a stem and leaf plot are the same as a histogram. To learn more, read my post about Interpreting Histograms.
How to Read a Stem and Leaf Plot
These days, it’s unlikely you’ll need to create a stem and leaf plot by hand, but you might see one made by statistical software and it will have several more features than a handmade one. Let’s learn about them!
The stem and leaf plot below displays the body fat percentage values I obtained during a study. I often use this dataset to illustrate a right-skewed, nonnormal distribution. You can download the dataset yourself: body_fat.
At first glance, you can see that there are 92 observations, the data are right-skewed, and the peak occurs at 22/23. Let’s look at some of the other features because they’ll allow us to draw additional conclusions.
Here’s how to read a stem and leaf plot.
Related post: Skewed Distributions
Leaf Unit or Key
The leaf unit or key allows us to interpret the value of each leaf. This stem and leaf plot uses a leaf unit, but others have a key, which provides similar information.
Our graph says the leaf unit = 1.0. That’s simple because a leaf of 1 = 1, 2 = 2, and so on. If the unit had been 10, the leaves would’ve been 10, 20, 30, etc. Or, if it had been 0.1, leaves would represent 0.1, 0.2, and so on. This unit depends on how you or your software rounds the data.
Because the leaf unit is 1, we know the stem values must start in the 10s place. Therefore, the stem values of 1, 2, 3, and 4 correspond to 10, 20, 30, and 40. Using this information, you can determine the value of every data point on this graph!
Multiple Stem Rows
Statistical software packages use an algorithm to improve the appearance of stem and leaf plots by using multiple rows of each stem value based on the data’s properties, which is the case with the body fat percentage graph. There are two 1s, five 2s, five 3s, and four 4s.
For the body fat percentage data, the graph divides stem values into five rows. Each row contains only two leaf values (e.g., 0 and 1, 2 and 3, etc.) The leaf values stop at the minimum and maximum values of the dataset. Consequently, the extreme stem values can have fewer rows than the other stem values. In our graph, 1 and 4 are the extreme stem values, and they both have fewer rows than the middle values (2 and 3).
In these rows, the minimum data point is 16 and the maximum is 46. The range of this dataset is 30.
For stem and leaf plots, statistical software often highlights the median in some fashion. This software indicates where the median occurs by placing parentheses around the count. For these data, we know the median is either 26 or 27.
The first column contains cumulative counts. The format of these counts might not be intuitive at first. For each row, the counts sum that row and all rows further away from the median out to the distribution’s tail.
For example, the stem = 2 row with the leaf values of 4 and 5 has a count of 39. This number indicates there are 39 observations in this row and lower (towards the left tail). On the higher side of the median, the stem = 2 row with values of 8 and 9 has a count of 43. This count indicates there are 43 observations in that row and higher (towards the right tail).
The purpose behind this funny way of counting is to present a kind of distribution density. Where do most values fall? Higher counts correspond to more frequently occurring data values. For these data, the counts indicate that the majority of the values are between 22 and 29.
Have you become a stem and leaf plot devotee? I like how they present the same distribution properties as histograms, but you can also pull out some or all of the data values.