What is the 5 Number Summary?
The 5 number summary is an exploratory data analysis tool that provides insight into the distribution of values for one variable. Collectively, this set of statistics describes where data values occur, their central tendency, variability, and the general shape of their distribution.
The five number summary provides this information using various descriptive statistics. These statistics are all order statistics—each one describes where a particular value falls in the distribution. The five statistics in this summary are the following, from highest to lowest data values:
- Highest value in the dataset.
- Third quartile (Q3)—greater than 75% of the values in the dataset
- Median or second quartile (Q2)—splits the dataset in half.
- First quartile (Q1)—greater than 25% of the values.
- Lowest value in the dataset.
Before we interpret an example, let’s briefly understand why the 5 number summary contains these particular statistics.
Why Are These Statistics in the 5 Number Summary?
Why does the 5 number summary contain these statistics instead of more familiar ones, such as the mean and standard deviation? These five statistics provide similar types of information as other statistics while having advantages over them.
Keep in mind that the purpose of the five number summary is to provide a preliminary sense of your data during the exploratory phase of analysis. At this point, you probably don’t know much about the dataset, its distribution, or whether it contains outliers. Statisticians picked these five statistics because they are less sensitive to skewed distributions and outliers. The statistics in the 5 number summary are more robust than the mean and standard deviation.
In other words, you can trust the five number summary with a wider variety of distributions and before you’ve had a chance to identify and remove outliers. That’s extremely helpful for analyses you perform when you’re just starting to understand your data.
Additionally, as I mentioned, these are all order statistics, which means that they are valid with continuous and ordinal data—giving you greater flexibility.
In short, the 5 number summary contains a good, solid set of robust statistics you can use with a variety of distributions shapes and data types before identifying characteristics that could adversely affect other statistics.
To learn more about the concept of robust statistics, what makes them robust, and examples, read my post, What are Robust Statistics?
Interpreting the Five Number Summary
Let’s look at what these statistics tell you by working through an example dataset. For this example, we’ll look at body fat percentages in middle school girls. You might not be familiar with this subject area, but the 5 number summary can quickly help you get your bearings. You can download this CSV dataset: body_fat.
Many of the statistics in the five-number summary are quartiles, which are special percentiles. Learn more about Percentiles.
Here is the 5 number summary for these data.
The median is the second quartile and, like the mean, it is a measure of central tendency. It finds the center of your distribution, which is the location of the most common values. Notably, the median is more robust to skewed distributions and outliers than the mean. Read more about The Median.
For the body fat percentage data, the median is 27.35%. Half the values in the sample are above this value, while half are below it.
Related post: Measures of Central Tendency
Minimum and Maximum Values
The minimum and maximum values in the 5 number summary indicate where all the data values occur.
For these data, these values tell us that all body percentages in this sample are between 16.8 and 46.8%. Additionally, if you take the maximum and subtract the minimum, you get the statistical range of the data, 46.8 – 16.8 = 30. The range is a measure of variability where large values indicate the data spread out further.
Related post: Range of the Data
Interquartile Range (Q3 – Q1)
The Interquartile Range (IQR) is the distance between the third and first quartile and it is an integral part of the 5 number summary. This range indicates where the middle 50% of the data fall. Conversely, you also know that 50% falls outside this range, 25% above and 25% below the IQR. Like the range, the IQR is also a measure of variability. However, the IQR is a more robust statistic than either the range or standard deviation. Again, larger values represent greater sample variability. Read more about The Interquartile Range.
In our example dataset, the interquartile range extends from 23.05 to 33.63%. Half the data values are in this range. Furthermore, 25% are less than 23.05, and 25% are greater than 33.63.
Related post: Measures of Variability
Shape of the Distribution
The five number summary can give you a general sense of whether the distribution is symmetrical or skewed. To make this determination, compare the median to Q1 and Q3. When the median is:
- Approximately halfway between Q1 and Q3, your data are symmetrical.
- Closer to Q1, your data are right-skewed.
- Closer to Q3, your data are left-skewed.
The median body fat percentage of 27.35% is closer to Q1 (23.05) than Q3 (33.63). Therefore, the distribution of values is right-skewed.
To gain a clearer picture of the distribution of these data, you should graph them with a histogram or stem-and-leaf plot. Additionally, I’ve determined the specific distribution these data follow in my post about identifying the distribution of your data.
Related post: Skewed Distributions
Boxplots Graphically Display the 5 Number Summary
Conveniently, boxplots display the 5 number summary in graphical form. The image below displays the boxplot for the body fat example. Notice how the five parts of the boxplot correspond to the summary values!
Related post: Boxplots
Hoaglin, David C.; Mosteller, Frederick; Tukey, John W., eds. (21 December 1982). Understanding Robust and Exploratory Data Analysis. Wiley Series in Probability and Statistics (1st ed.). Wiley.