Percentiles indicate the percentage of scores that fall below a particular value. They tell you where a score stands relative to other scores. For example, a person with an IQ of 120 is at the 91st percentile, which indicates that their IQ is higher than 91 percent of other scores.
Percentiles are a great tool to use when you need to know the relative standing of a value. Where does a value fall within a distribution of values? While the concept behind percentiles is straight forward, there are different mathematical methods for calculating them. In this post, learn about percentiles, special percentiles and their surprisingly flexible uses, and the various procedures for calculating them.
Using Percentiles to Understand a Score or Value
Percentiles tell you how a value compares to other values. The general rule is that if value X is at the kth percentile, then X is greater than K% of the values. Let’s see how this information can be helpful.
Often the units for raw test scores are not informative. When you obtain a score on the SAT, ACT, or GRE, the units are meaningless by themselves. A total SAT score of 1340 is not inherently meaningful. Instead, you really want to know the percentage of test takers that you scored better than. For the SAT, a total score of 1340 is approximately the 90th percentile. Congratulations, you scored better than 90% of the other test takers. Only 10% scored better than you. Now that’s helpful!
Sometimes measurement units are meaningful, but you still would like to know the relative standing. For example, if your one-month-old baby weighs five kilograms, you might wonder how that weight compares to other babies. For a one-month old baby girl, that equates to the 77th percentile. Your little girl weighs more than 77% of other girls her age, while 23% weigh more than her. You know right where she fits in with her cohort!
Special Names and Uses for Percentiles
We give names to special percentiles. The 50th percentile is the median. This value splits a dataset in half. Half the values are below the 50th percentile, and half are above it. The median is a measure of central tendency in statistics.
Quartiles are values that divide your data into quarters, and they are based on percentiles.
- The first quartile, also known as Q1 or the lower quartile, is the value of the 25th percentile. The bottom quarter of the scores fall below this value, while three-quarters fall above it.
- The second quartile, also known as Q2 or the median, is the value of the 50th percentile. Half the scores are above and half below.
- The third quartile, also known as Q3 or the upper quartile, is the value of the 75% percentile. The top quarter of the scores fall above this value, while three-quarters fall below it.
The interquartile range (IQR) is a measure of dispersion in statistics. This range corresponds to the distance between the first quartile and the third quartile (IQR = Q3 – Q1). Larger IQRs indicate that the data are more spread out. The interquartile range represents the middle half of the data. One-quarter of the values fall below the IQR while another quarter of the values are above it.
Percentiles are surprisingly versatile because you can use them not only to obtain a relative standing, but also for dividing your dataset into portions, identifying the central tendency, and measuring the dispersion of a distribution. Consequently, percentiles, in the form of quartiles, are at the core of the five-number summary, which is an exploratory data analysis tool for descriptive statistics.
Calculating Percentiles Using Values in a Dataset
Percentile is a fairly common word. Surprisingly, there isn’t a single standard definition for it. Consequently, there are multiple methods for calculating percentiles. In this post, I cover four procedures. The first three are methods that analysts use to calculate percentiles when looking at the actual data values in relatively small datasets. These three definitions define the kth percentile in the following different ways:
- The smallest value that is greater than k percent of the values.
- The smallest value that is greater than or equal to k percent of values.
- An interpolated value between the two closest ranks.
While the first two definitions might not seem drastically different, they can produce significantly different results, mainly when you are working with a small dataset. As you will see, this difference occurs because the first two definitions use different ranks that correspond to different scores. The third definition mitigates this concern by interpolating between two ranks to estimate a percentile value that falls between two values.
To calculate percentiles using these three approaches, start by ranking your dataset from the lowest to highest values.
Let’s use these three methods with the following dataset (n=11) to find the 70th percentile.
Definition 1: Greater Than
Using the first definition, we need to find the value that is greater than 70% of the values, and there are 11 values. Take 70% of 11, which is 7.7. Then, round 7.7 up to 8. Using the first definition, the value for the 70th percentile must be greater than eight values. Consequently, we pick the 9th ranked value in the dataset, which is 40.
Definition 2: Greater Than or Equal To
Using the second definition, we need to find the value that is greater than or equal to 70% of the values. Thanks to the “equal to” portion of the definition, we can use the 8th ranked value, which is 35.
Using the first two definitions, we have found two values for the 70% percentile—35 and 40.
Definition 3: Using an Interpolation Approach
As you saw above, using either “greater” or “greater than or equal to” changes the results. Depending on the nature and size of your dataset, this difference can be substantial. Consequently, a third approach interpolates between two data values.
To calculate an interpolated percentile, do the following:
- Calculate the rank to use for the percentile. Use: rank = p(n+1), where p = the percentile and n = the sample size. For our example, to find the rank for the 70th percentile, we take 0.7*(11 + 1) = 8.4.
- If the rank in step 1 is an integer, find the data value that corresponds to that rank and use it for the percentile.
- If the rank is not an integer, you need to interpolate between the two closest observations. For our example, 8.4 falls between 8 and 9, which corresponds to the data values of 35 and 40.
- Take the difference between these two observations and multiply it by the fractional portion of the rank. For our example, this is: (40 – 35)0.4 = 2.
- Take the lower-ranked value in step 3 and add the value from step 4 to obtain the interpolated value for the percentile. For our example, that value is 35 + 2 = 37.
Using three common calculations for percentiles, we find three different values for the 70th percentile: 35, 37, and 40.
I’ve recently learned about yet another way to calculate a type of percentile called percentile ranks. Like the interpolation method, percentile ranks is a different way to split the difference between definition 1 and definition 2 above. Analysts frequently use this measure for standardized test scores because they have many repeated scores on an integer distribution. For example, millions of students take the SAT and yet their scores on a single section can only be the integers from 200 to 800. There will be a vast number of repeated scores!
Percentile ranks are the percentage of scores that are less than the score of interest PLUS it adds the percentage of scores that corresponds to half of the scores with the value value of interest. This method literally splits the middle for the block of repeated values for the score of interest. For more information, read the Wikipedia article about Percentile Ranks.
Next, I’ll show you one more method for calculating percentiles that does not directly use the values in the dataset.
Using a Probability Distribution Function to Estimate Percentiles
If you know the probability distribution function (PDF) that a population of values follows, you can use the PDF to calculate percentiles. Perhaps the population follows the normal distribution? Or, you might have collected a sample and then identified the PDF that provides the best fit.
Read my post about identifying the distribution of your data. This approach identifies the population distribution that has the highest probability (i.e., maximum likelihood) of producing the distribution that you observe in a random sample from that population.
After you identify the distribution for your sample, you can use your statistical software to calculate the percentage of values in the distribution that falls below a value. I’ll use graphs to show two examples to make the ideas crystal clear. I’m using Minitab statistical software to generate these graphs. The data for one example follows a normal distribution while the other follows a skewed lognormal distribution. Both of these variables were collected from the same sample of middle school girls.
Related post: Understanding Probability Distribution Functions
Using the Normal Distribution to Estimate Height Percentiles
Height tends to follow the normal distribution, which is the case for our sample data. The heights for this population follow a normal distribution with a mean of 1.512 meters and a standard deviation of 0.0741 meters. For normally distributed populations, you can use Z-scores to calculate percentiles. This method is convenient when you have only summary information about a sample and access to a table of Z-scores. I talk about Z-scores and show how to use them to calculate percentiles in my post, Z-score: Definition, Formula, and Uses.
However, for this post, I’ll use the probability density function (PDF) to calculate and graph the percentile. In this type of probability density plot, the proportion of the shaded area under the curve indicates the percentage of the distribution that falls within that range of values. For this graph, I shade the region that contains the lower 70% of the values, and the software calculates the height that corresponds with this percentage, which is the 70th percentile.
The plot above shows that a height of 1.551 meters is at the 70th percentile for this population of middle school girls.
Related post: Understanding the Normal Distribution
Using the Lognormal Distribution to Estimate Body Fat Percentiles
Not all data follows the normal distribution. In this vein, the body fat percentage data for the same sample are skewed. In my post about identifying the distribution of your data, I determined that these data follow a lognormal distribution with a location of 3.32317 and a scale of 0.24188.
The graph below clearly shows the right-skew. Below, I use the same process to calculate the 70th percentile for body fat percentage as I did for height. I only need to specify the correct distribution for the software. Using this approach, we’re sure to factor in the skewness of our data when obtaining percentiles.
The plot above shows that having 31.5% body fat is at the 70th percentile for this population of middle school girls.
Cumulative distribution functions (CDFs) are related to probability distribution plots but instead use percentiles for the Y-axis, making them a great way to find percentiles! Learn more about Cumulative Distribution Functions: Uses, Graphs & vs PDF.
Empirical cumulative distribution function plots, or eCDF plots are special type of graph that compares the observed cumulative distribution in your sample data to a fitted distribution. To learn more about this type of graph and why you’d use it, read my Guide to Empirical CDF plots.
Percentiles are a very intuitive way to understand where a value falls within a distribution of values. However, if you need to calculate a percentile, you’ll need to decide which method to use!