Percentiles indicate the percentage of scores that fall below a particular value. They tell you where a score stands relative to other scores. For example, a person with an IQ of 120 is at the 91^{st }percentile, which indicates that their IQ is higher than 91 percent of other scores.

Percentiles are a great tool to use when you need to know the relative standing of a value. Where does a value fall within a distribution of values? While the concept behind percentiles is straight forward, there are different mathematical methods for calculating them. In this post, learn about percentiles, special percentiles and their surprisingly flexible uses, and the various procedures for calculating them.

## Using Percentiles to Understand a Score or Value

Percentiles tell you how a value compares to other values. The general rule is that if value X is at the k^{th} percentile, then X is greater than K% of the values. Let’s see how this information can be helpful.

Often the units for raw test scores are not informative. When you obtain a score on the SAT, ACT, or GRE, the units are meaningless by themselves. A total SAT score of 1340 is not inherently meaningful. Instead, you really want to know the percentage of test takers that you scored better than. For the SAT, a total score of 1340 is approximately the 90^{th} percentile. Congratulations, you scored better than 90% of the other test takers. Only 10% scored better than you. Now that’s helpful!

Sometimes measurement units are meaningful, but you still would like to know the relative standing. For example, if your one-month-old baby weighs five kilograms, you might wonder how that weight compares to other babies. For a one-month old baby girl, that equates to the 77^{th} percentile. Your little girl weighs more than 77% of other girls her age, while 23% weigh more than her. You know right where she fits in with her cohort!

## Special Names and Uses for Percentiles

We give names to special percentiles. The 50^{th} percentile is the median. This value splits a dataset in half. Half the values are below the 50^{th} percentile, and half are above it. The median is a measure of central tendency in statistics.

Quartiles are values that divide your data into quarters, and they are based on percentiles.

- The first quartile, also known as Q1 or the lower quartile, is the value of the 25
^{th}percentile. The bottom quarter of the scores fall below this value, while three-quarters fall above it. - The second quartile, also known as Q2 or the median, is the value of the 50
^{th}percentile. Half the scores are above and half below. - The third quartile, also known as Q3 or the upper quartile, is the value of the 75% percentile. The top quarter of the scores fall above this value, while three-quarters fall below it.

The interquartile range (IQR) is a measure of dispersion in statistics. This range corresponds to the distance between the first quartile and the third quartile (IQR = Q3 – Q1). Larger IQRs indicate that the data are more spread out. The interquartile range represents the middle half of the data. One-quarter of the values fall below the IQR while another quarter of the values are above it.

Percentiles are surprisingly versatile because you can use them not only to obtain a relative standing, but also for dividing your dataset into portions, identifying the central tendency, and measuring the dispersion of a distribution.

**Related posts**: Measures of Central Tendency and Measures of Dispersion

## Calculating Percentiles Using Values in a Dataset

Percentile is a fairly common word. Surprisingly, there isn’t a single standard definition for it. Consequently, there are multiple methods for calculating percentiles. In this post, I cover four procedures. The first three are methods that analysts use to calculate percentiles when looking at the actual data values in relatively small datasets. These three definitions define the k^{th} percentile in the following different ways:

- The smallest value that is greater than k percent of the values.
- The smallest value that is greater than or equal to k percent of values.
- An interpolated value between the two closest ranks.

While the first two definitions might not seem drastically different, they can produce significantly different results, mainly when you are working with a small dataset. As you will see, this difference occurs because the first two definitions use different ranks that correspond to different scores. The third definition mitigates this concern by interpolating between two ranks to estimate a percentile value that falls between two values.

To calculate percentiles using these three approaches, start by ranking your dataset from the lowest to highest values.

Let’s use these three methods with the following dataset (n=11) to find the 70^{th} percentile.

### Definition 1: Greater Than

Using the first definition, we need to find the value that is greater than 70% of the values, and there are 11 values. Take 70% of 11, which is 7.7. Then, round 7.7 up to 8. Using the first definition, the value for the 70^{th} percentile must be greater than eight values. Consequently, we pick the 9^{th} ranked value in the dataset, which is 40.

### Definition 2: Greater Than or Equal To

Using the second definition, we need to find the value that is greater than or equal to 70% of the values. Thanks to the “equal to” portion of the definition, we can use the 8^{th} ranked value, which is 35.

Using the first two definitions, we have found two values for the 70% percentile—35 and 40.

### Definition 3: Using an Interpolation Approach

As you saw above, using either “greater” or “greater than or equal to” changes the results. Depending on the nature and size of your dataset, this difference can be substantial. Consequently, a third approach interpolates between two data values.

To calculate an interpolated percentile, do the following:

- Calculate the rank to use for the percentile. Use: rank = p(n+1), where p = the percentile and n = the sample size. For our example, to find the rank for the 70
^{th}percentile, we take 0.7*(11 + 1) = 8.4. - If the rank in step 1 is an integer, find the data value that corresponds to that rank and use it for the percentile.
- If the rank is not an integer, you need to interpolate between the two closest observations. For our example, 8.4 falls between 8 and 9, which corresponds to the data values of 35 and 40.
- Take the difference between these two observations and multiply it by the fractional portion of the rank. For our example, this is: (40 – 35)0.4 = 2.
- Take the lower-ranked value in step 3 and add the value from step 4 to obtain the interpolated value for the percentile. For our example, that value is 35 + 2 = 37.

Using three common calculations for percentiles, we find three different values for the 70^{th} percentile: 35, 37, and 40.

Next, I’ll show you one more method for calculating percentiles that does not directly use the values in the dataset.

## Using a Probability Distribution Function to Estimate Percentiles

If you know the probability distribution function (PDF) that a population of values follows, you can use the PDF to calculate percentiles. Perhaps the population follows the normal distribution? Or, you might have collected a sample and then identified the PDF that provides the best fit.

Read my post about identifying the distribution of your data. This approach identifies the population distribution that has the highest probability (i.e., maximum likelihood) of producing the distribution that you observe in a random sample from that population.

After you identify the distribution for your sample, you can use your statistical software to calculate the percentage of values in the distribution that falls below a value. I’ll use graphs to show two examples to make the ideas crystal clear. I’m using Minitab statistical software to generate these graphs. The data for one example follows a normal distribution while the other follows a skewed lognormal distribution. Both of these variables were collected from the same sample of middle school girls.

**Related post**: Understanding Probability Distribution Functions

### Using the Normal Distribution to Estimate Height Percentiles

Height tends to follow the normal distribution, which is the case for our sample data. The heights for this population follow a normal distribution with a mean of 1.512 meters and a standard deviation of 0.0741 meters. For normally distributed populations, you can use Z-scores to calculate percentiles. This method is convenient when you have only summary information about a sample and access to a table of Z-scores. I talk about Z-scores and show how to use them to calculate percentiles in my blog post about the Normal Distribution.

However, for this post, I’ll use the probability density function to calculate and graph the percentile. In this type of probability density plot, the proportion of the shaded area under the curve indicates the percentage of the distribution that falls within that range of values. For this graph, I shade the region that contains the lower 70% of the values, and the software calculates the height that corresponds with this percentage, which is the 70^{th} percentile.

The plot above shows that a height of 1.551 meters is at the 70^{th} percentile for this population of middle school girls.

### Using the Lognormal Distribution to Estimate Body Fat Percentiles

Not all data follows the normal distribution. In this vein, the body fat percentage data for the same sample are skewed. In my post about identifying the distribution of your data, I determined that these data follow a lognormal distribution with a location of 3.32317 and a scale of 0.24188.

The graph below clearly shows the right-skew. Below, I use the same process to calculate the 70^{th} percentile for body fat percentage as I did for height. I only need to specify the correct distribution for the software. Using this approach, we’re sure to factor in the skewness of our data when obtaining percentiles.

The plot above shows that having 31.5% body fat is at the 70^{th} percentile for this population of middle school girls.

Percentiles are a very intuitive way to understand where a value falls within a distribution of values. However, if you need to calculate a percentile, you’ll need to decide which method to use!

Jose says

Hi Jim,

Many thanks for providing these explanations in a language that is actually easy to understand!

My question is similar to that asked by J R Jenks above. I have a dataset of 500 individuals that have visited a neighbour x times per month. And I want to translate this information into a Likert scale = (1) never, (2) seldomly, (3) often, and (4) always. The problem is that I have way too many zeros in my dataset. This means, many individuals expressed they simply don’t visit their neighbours.

When using excel for bringing this data into percentiles, 0-times responses cover the 10th, 20th and even the 40th percentile. My question is, does it makes sense to group all those zero observations into one category (i.e. effectively ignoring them) and using a quartile ranking for the remaining observations? Or should they be included with all other responses in a quartile or decile ranking calculation? Any help will be much appreciated!

J R Jenks says

I am working on some Census datasets and am somewhat mystified by the practice of ignoring zeros when computing percentiles (I know how to do it, but intuitively it seems … wrong).

Say you were looking at “households with children”. You have total households for each area (county, tract, whatever), and the number of households containing children. So you calculate the percentage for each area. Then using the range of those percentages, you can calculate the percentile each falls into (and subsequently group them into deciles, etc.).

In following some recommended procedures, they say to first eliminate any “zero” values from the percentile calculation. If for some particular area, no households have children, its percentage would be zero, and that entry in the table should be excluded when calculating percentiles.

Seems to me that skews the result. If some areas have “zero” somethings, and other have various amounts, why would the zero valued area by excluded? If an area had only one “household with children”, it would be included, but the one with none would not.

Thanks for all your helpful elucidation on stats!

Sachin Pullil says

Hi Jim,

Here’s a scenario that doesn’t make sense to me. Take the following set:

2, 13, 33, 33, 51, 99, 100, 100

If the question asks me for the percentile of the value of 51, I would do:

4/8 * 100 = 50th percentile.

If the question asks me to find the value of the 50th percentile in the set, I would do:

Rank = p(n+1) = 0.50*(8+1) = 4.5

The value corresponding to this rank would be 33 + (51-33)*0.5 = 42.

So, I would find 42 to be the 50th percentile.

I’m aware that the second question above uses the third (interpolation) definition of percentile that you described. If I use the first definition, then my answer to both questions will be 51.

Is there a way to use the interpolation definition in both questions above and arrive at the same answer? Or is the interpolation scenario doomed to fail in the above scenario?

Jim Frost says

Hi Sachin,

That’s a great question! There are several things at play here. The first is that, as you point out, you’re using different methods. And, it’s not surprising that different methods will come up with different answers. Additionally, it’s a very small sample, so the precision of the estimates will be low. And, the 50th percentile is the median. There are large gaps between some of the numbers, which means the precise method you use to calculate the median can produce fairly different answers. If we drew a larger random sample from the same population, we’d start filling in those gaps and get more complete information about the distribution. The differences between the the various methods would decrease.

For your dataset, I would say that the interpolation method is not doomed because I think it’s giving the best answer. However, it’s “doomed” in the sense that it is destined to give a different answer for these data.

Again, think of it as the median. The method for calculating the median with an even number of observations is to move inwards until you reach the center two numbers. The middle two numbers are 33 and 51. You then take the average of those two to calculate the median, which comes to 42. That’s a different way of doing the interpolation method. For this dataset, there’s just a relatively large gap between 33 and 51. The 50th percentile is most likely in there somewhere. Given the small dataset, 42 is the best estimate that we have.

My sense is that 51 is a bit on the high side. And, there are in fact only 3 values above it and 4 below it. So, saying it’s the 50th percentile doesn’t feel quite right to me. Indeed, definition 1, greater than, gives you 51 because you need to use the 5th ranked value. The second definition, greater than or equal to, gives you 33 because you can use the 4th ranked value. But, neither of those are in the middle of the data set. One is ranked too high and the other is a rank too low. With a small dataset, that makes a difference.

Both the interpolation method and the median method find a better answer that falls between the actual values in the dataset. I think the underlying problems of the first two methods are twofold; the small dataset and being forced to use an actual value in the dataset. Using the interpolation method, you’re still stuck with the small dataset, but at least you’re not stuck with using an existing value.

Carl says

Hi Jim,

Not sure if you can help me? I am looking at a UK salary for different roles. I Have the overall UK 25th,50th,75th and 100th percentile values. I also have the UK average (mean).

I have also the regional average and median.

Is there away that i can work out the regional percentiles from the data i have?

Not sure if i have posted on the correct thread but this is puzzling me if its possible

Steven Philips says

I appreciate the feedback Jim. The math is there to flip it around but I just wanted to see if there was a precedent for such scenarios.

Best regards,

-Steven

Steven Philips says

Hello Jim,

Thanks for the article on this page.

I am curious if there is a prevailing opinion if percentiles should follow the directional of the overall measure?

For example, I have a compliance measure and that graph needs to trend upward to compliance of 100%, so theoretically your percentile for top performers would be above the 95th percentile. However, on a graph of harm incidents you want to trend downward to 0%. In which case, do you want to be striving to be in the 5th percentile?

I have clients that want to be in the top 5 percent. Mathematically I see it is feasible to flip the calculation but there is the consideration of trending direction.

Thoughts?

Jim Frost says

Hi Steven,

It seems like there’s no contradiction for your compliance measure. If clients want to be in the top 5%, they’d have be at or higher than the 95th percentile. But, yes, for the harm measure you’d want to be a the 5th percentile at most. Is this just a perception thing among your clients? They want to be in the top 5% versus the bottom 5%. If so, I don’t see any reason not to flip it as you say. As long the system works given your needs.

If I’ve misunderstood your question, please let me know. But, I don’t see a problem with what you’re proposing. Percentiles are based on ranks. All you’re really doing is changing the ranking criteria from low is bad to low is good. Given your scenario, that sounds completely legitimate.

Saif says

Hello Jim, You have shared very nice article full of profitable information!

In my opinion, percentiles are vital statistical tools, Percentiles provide an direction of how the data values are spread over the interval from the smallest value to the largest value.

Shukhrat Shokirov says

Hi Jim,

Thanks for wonderful explanation. I am figuring out the results of my data analysis. I have LiDAR point cloud which was collected from terrestrial and UAV based sensors on the same landscape. Basically, terrestrial LiDAR collects much dense points than the UAV LiDAR. When I calculate 5 th, 50 th and 90 th percentiles, all the time, UAV data height percentile values are higher than terrestrial data height percentiles. I am not sure how to interpret this. Does it mean terrestrial sensor is collecting more data in the lower layer than the UAV sensor and UAV collects more points in upper layer of the landscape? I would greatly appreciate if you help me with interpretation.

Jennifer says

How to assign ranks to 2 or 3 individuals having the same score in ungrouped data?

Calisie Jane says

how do i interpret each results after I got all the final answers??

Jim Frost says

Hi Calisie,

Do you mean the different ways of calculating percentiles? The interpretation is the same, which I know is confusing! Problems arise with large difference between results when you have smaller sample sizes. It’s just harder to get good estimates for anything, including percentiles, with small samples. Small samples tend to have more erratic estimates in genearal. However, once you decide which approach to use, the interpretation is the usual one for percentiles.

Nat Kitaw says

How would you calculate percentile rank for something with an underlying exponential distribution?

Jim Frost says

Hi Nat,

You can use any of the methods I discuss in this post. You can calculate the percentiles based on the values in your dataset using one of those three methods. Or, find the distribution that best fits your data (presumably the exponential distribution in your case) and use that to calculate percentiles. In this post, I use a lognormal distribution to illustrate this method, but you can use the exponential distribution.

suttonfelty says

Why are there different percentile calculations?

Jim Frost says

Hi, there are several different reasons. For one thing, you’re starting with slightly different definitions. For whatever reason, there’s not one standard definition. The calculations depend on how you define it (greater than versus greater than or equal to). This problem is exacerbated with smaller datasets where the difference in definition has a larger impact on the end result. There’s also the fact that you can calculate percentiles for values in a dataset or you can use probability distributions to calculate percentiles based on estimates of the population parameters. In short, there are different calculations because of different definitions and different goals (i.e., for values in a dataset vs. for a population).

Appy says

How do I calculate the percentile ranks for data where a lower score means better performance?

Jim Frost says

Hi Appy,

What you need to do is start by ranking the scores accordingly. Put the higher data values with lower ranks and lower data values with higher ranks. The opposite of what I show in this post. Then, where I talk about values being “greater than,” you need to substitute “less than.” I believe with those change you can proceed as I show in this post.

When you report the results, be sure to clarify how you’re using percentiles in this manner. For example, “70% of the scores are worse than X, where high values indicate worse performance.” Something like that because I think it would be easy to get confused given the normal usage of percentiles.

I hope this helps!

Takele says

nice and clear. tell us about logistic and Bayesian analysis

Jim Frost says

Thanks! I’d like to address those in future posts. So many potential topics to cover!