Percentiles indicate the percentage of scores that fall below a particular value. They tell you where a score stands relative to other scores. For example, a person with an IQ of 120 is at the 91st percentile, which indicates that their IQ is higher than 91 percent of other scores.
Percentiles are a great tool to use when you need to know the relative standing of a value. Where does a value fall within a distribution of values? While the concept behind percentiles is straight forward, there are different mathematical methods for calculating them. In this post, learn about percentiles, special percentiles and their surprisingly flexible uses, and the various procedures for calculating them.
Using Percentiles to Understand a Score or Value
Percentiles tell you how a value compares to other values. The general rule is that if value X is at the kth percentile, then X is greater than K% of the values. Let’s see how this information can be helpful.
Often the units for raw test scores are not informative. When you obtain a score on the SAT, ACT, or GRE, the units are meaningless by themselves. A total SAT score of 1340 is not inherently meaningful. Instead, you really want to know the percentage of test takers that you scored better than. For the SAT, a total score of 1340 is approximately the 90th percentile. Congratulations, you scored better than 90% of the other test takers. Only 10% scored better than you. Now that’s helpful!
Sometimes measurement units are meaningful, but you still would like to know the relative standing. For example, if your one-month-old baby weighs five kilograms, you might wonder how that weight compares to other babies. For a one-month old baby girl, that equates to the 77th percentile. Your little girl weighs more than 77% of other girls her age, while 23% weigh more than her. You know right where she fits in with her cohort!
Special Names and Uses for Percentiles
We give names to special percentiles. The 50th percentile is the median. This value splits a dataset in half. Half the values are below the 50th percentile, and half are above it. The median is a measure of central tendency in statistics.
Quartiles are values that divide your data into quarters, and they are based on percentiles.
- The first quartile, also known as Q1 or the lower quartile, is the value of the 25th percentile. The bottom quarter of the scores fall below this value, while three-quarters fall above it.
- The second quartile, also known as Q2 or the median, is the value of the 50th percentile. Half the scores are above and half below.
- The third quartile, also known as Q3 or the upper quartile, is the value of the 75% percentile. The top quarter of the scores fall above this value, while three-quarters fall below it.
The interquartile range (IQR) is a measure of dispersion in statistics. This range corresponds to the distance between the first quartile and the third quartile (IQR = Q3 – Q1). Larger IQRs indicate that the data are more spread out. The interquartile range represents the middle half of the data. One-quarter of the values fall below the IQR while another quarter of the values are above it.
Percentiles are surprisingly versatile because you can use them not only to obtain a relative standing, but also for dividing your dataset into portions, identifying the central tendency, and measuring the dispersion of a distribution. Consequently, percentiles, in the form of quartiles, are at the core of the five-number summary, which is an exploratory data analysis tool for descriptive statistics.
Related posts: Quartiles, Measures of Central Tendency, and Measures of Dispersion
Calculating Percentiles Using Values in a Dataset
Percentile is a fairly common word. Surprisingly, there isn’t a single standard definition for it. Consequently, there are multiple methods for calculating percentiles. In this post, I cover four procedures. The first three are methods that analysts use to calculate percentiles when looking at the actual data values in relatively small datasets. These three definitions define the kth percentile in the following different ways:
- The smallest value that is greater than k percent of the values.
- The smallest value that is greater than or equal to k percent of values.
- An interpolated value between the two closest ranks.
While the first two definitions might not seem drastically different, they can produce significantly different results, mainly when you are working with a small dataset. As you will see, this difference occurs because the first two definitions use different ranks that correspond to different scores. The third definition mitigates this concern by interpolating between two ranks to estimate a percentile value that falls between two values.
To calculate percentiles using these three approaches, start by ranking your dataset from the lowest to highest values.
Let’s use these three methods with the following dataset (n=11) to find the 70th percentile.
Definition 1: Greater Than
Using the first definition, we need to find the value that is greater than 70% of the values, and there are 11 values. Take 70% of 11, which is 7.7. Then, round 7.7 up to 8. Using the first definition, the value for the 70th percentile must be greater than eight values. Consequently, we pick the 9th ranked value in the dataset, which is 40.
Definition 2: Greater Than or Equal To
Using the second definition, we need to find the value that is greater than or equal to 70% of the values. Thanks to the “equal to” portion of the definition, we can use the 8th ranked value, which is 35.
Using the first two definitions, we have found two values for the 70% percentile—35 and 40.
Definition 3: Using an Interpolation Approach
As you saw above, using either “greater” or “greater than or equal to” changes the results. Depending on the nature and size of your dataset, this difference can be substantial. Consequently, a third approach interpolates between two data values.
To calculate an interpolated percentile, do the following:
- Calculate the rank to use for the percentile. Use: rank = p(n+1), where p = the percentile and n = the sample size. For our example, to find the rank for the 70th percentile, we take 0.7*(11 + 1) = 8.4.
- If the rank in step 1 is an integer, find the data value that corresponds to that rank and use it for the percentile.
- If the rank is not an integer, you need to interpolate between the two closest observations. For our example, 8.4 falls between 8 and 9, which corresponds to the data values of 35 and 40.
- Take the difference between these two observations and multiply it by the fractional portion of the rank. For our example, this is: (40 – 35)0.4 = 2.
- Take the lower-ranked value in step 3 and add the value from step 4 to obtain the interpolated value for the percentile. For our example, that value is 35 + 2 = 37.
Using three common calculations for percentiles, we find three different values for the 70th percentile: 35, 37, and 40.
Percentile Ranks
I’ve recently learned about yet another way to calculate a type of percentile called percentile ranks. Like the interpolation method, percentile ranks is a different way to split the difference between definition 1 and definition 2 above. Analysts frequently use this measure for standardized test scores because they have many repeated scores on an integer distribution. For example, millions of students take the SAT and yet their scores on a single section can only be the integers from 200 to 800. There will be a vast number of repeated scores!
Percentile ranks are the percentage of scores that are less than the score of interest PLUS it adds the percentage of scores that corresponds to half of the scores with the value value of interest. This method literally splits the middle for the block of repeated values for the score of interest. For more information, read the Wikipedia article about Percentile Ranks.
Next, I’ll show you one more method for calculating percentiles that does not directly use the values in the dataset.
Using a Probability Distribution Function to Estimate Percentiles
If you know the probability distribution function (PDF) that a population of values follows, you can use the PDF to calculate percentiles. Perhaps the population follows the normal distribution? Or, you might have collected a sample and then identified the PDF that provides the best fit.
Read my post about identifying the distribution of your data. This approach identifies the population distribution that has the highest probability (i.e., maximum likelihood) of producing the distribution that you observe in a random sample from that population.
After you identify the distribution for your sample, you can use your statistical software to calculate the percentage of values in the distribution that falls below a value. I’ll use graphs to show two examples to make the ideas crystal clear. I’m using Minitab statistical software to generate these graphs. The data for one example follows a normal distribution while the other follows a skewed lognormal distribution. Both of these variables were collected from the same sample of middle school girls.
Related post: Understanding Probability Distribution Functions
Using the Normal Distribution to Estimate Height Percentiles
Height tends to follow the normal distribution, which is the case for our sample data. The heights for this population follow a normal distribution with a mean of 1.512 meters and a standard deviation of 0.0741 meters. For normally distributed populations, you can use Z-scores to calculate percentiles. This method is convenient when you have only summary information about a sample and access to a table of Z-scores. I talk about Z-scores and show how to use them to calculate percentiles in my post, Z-score: Definition, Formula, and Uses.
However, for this post, I’ll use the probability density function (PDF) to calculate and graph the percentile. In this type of probability density plot, the proportion of the shaded area under the curve indicates the percentage of the distribution that falls within that range of values. For this graph, I shade the region that contains the lower 70% of the values, and the software calculates the height that corresponds with this percentage, which is the 70th percentile.
The plot above shows that a height of 1.551 meters is at the 70th percentile for this population of middle school girls.
Related post: Understanding the Normal Distribution
Using the Lognormal Distribution to Estimate Body Fat Percentiles
Not all data follows the normal distribution. In this vein, the body fat percentage data for the same sample are skewed. In my post about identifying the distribution of your data, I determined that these data follow a lognormal distribution with a location of 3.32317 and a scale of 0.24188.
The graph below clearly shows the right-skew. Below, I use the same process to calculate the 70th percentile for body fat percentage as I did for height. I only need to specify the correct distribution for the software. Using this approach, we’re sure to factor in the skewness of our data when obtaining percentiles.
The plot above shows that having 31.5% body fat is at the 70th percentile for this population of middle school girls.
Cumulative distribution functions (CDFs) are related to probability distribution plots but instead use percentiles for the Y-axis, making them a great way to find percentiles! Learn more about Cumulative Distribution Functions: Uses, Graphs & vs PDF.
Empirical cumulative distribution function plots, or eCDF plots are special type of graph that compares the observed cumulative distribution in your sample data to a fitted distribution. To learn more about this type of graph and why you’d use it, read my Guide to Empirical CDF plots.
Percentiles are a very intuitive way to understand where a value falls within a distribution of values. However, if you need to calculate a percentile, you’ll need to decide which method to use!
Dear Jim, there is no calculation error. What he means is that Rank=0.95*(11+1)=11.4, but there is no item bin 11.4 of course… So who to proceed?
Hi Rui,
It’s a bit vague who you’re referring to when you say “he.” If you’re referring to Kevin, no one said any calculations were in error. The discussion was about how there are different ways to calculate percentiles and none of them are standard. Kevin was unsure how Excel calculates percentiles.
As to your question, that commonly happens, particularly with smaller datasets. The same thing happens with the median, which is the 50th percentile. It doesn’t always correspond to a value in the dataset. It’s not really a problem, it’s just how it works out. If you really want to have a percentile that applies to a specific data point in the dataset, you’ll need to adjust the percentile value until it hits a data point.
Hi Jim,
If a student scores at the 23rd national percentile rank and the following year the student scores at the 23rd national percentile rank on the same assessment can you conclude the student made a year’s growth but did not close the achievement gap?
Hi Trudie,
It depends on the specifics, but it’s probably safe to interpret it that way.
If the percentiles are based on the relative standing within each of the two years AND growth is expected over that time, then yes, it suggests that the student grew at a typical rate for all students over that year. The student’s outcome is greater than 23 percent of their peers in year one and greater than 23 percent of their peers in year two. Student outcomes improved overall during those two years and the specific student’s outcomes kept pace. In other words, s/he grew in absolute terms but not in relative terms.
However, if improved outcomes are not expected generally over those two years, then the student didn’t grow but also didn’t lose ground. In this case, the student wouldn’t have grown in absolute terms and not in relative terms. But the cohort as a whole didn’t grow either.
If you’re talking about an educational setting, it’s safe to assume that growth/higher achievement is expected. So, your interpretation is probably correct.
I’m trying to teach myself some really basic stats and I’m not sure that I’m understanding a basic concept. For a score at, say, the 4th percentile, how many people out of 100 would have that same score? How many would have a lower score? How many would have a higher score?
Hi Divan,
Percentiles don’t tell you how many have a particular value. Instead, percentiles indicate the number less than a particular value. For example, if the value of X is the 75th percentile, then you know that 75% of scores (or 75 out of 100) are less than the value of X. However, percentiles don’t tell you how many scores are equal to X. To calculate the percentage of higher scores, you just need simple subtraction. Take 100 – the percentile to find the percentage of higher scores. For example, X is at the 75th percentile. 100-75 = 25%. Hence, 25% of scores are greater than X.
I was wondering how do you calculate the percentile of a dataset that has multiple repeated values. For example: 100 participants, 50 of them score 50, 25 score less than 50, 25 score more than 50. If I have scored 50 am I in the 25th percentile or 75th percentile? Do we always go with the highest (last index of the provided value in the set) or lowest (first index of the provided value in the set) or some sort of average of the last and first index?
Hi Jim, I have the same issue. In the interpolation example….0.95(11+1)=11.4. I am trying to replicate what excel is doing with the percentile.inc formula. None of these examples, even the percentile rank, are working for me to replicate the excel result. Do you happen to know the formula that excel uses? Thank you.
Hi Keven,
Offhand, I’m not sure what method Excel uses. I’ll need to take a look into it. Unfortunately, there is no universal standard for which method is best. The differences between methods usually become more noticeable with smaller datasets.
Hi, thanks for this informative article. My question is whether it would make sense to use percentiles to compare a small number of similar organizations? My dataset is small and uncomplicated, just the total operating budget figure for each of the 15 institutions (including my own). It is somewhat useful to produce a bar graph showing the absolute values from lowest to highest. However, the values in this dataset can vary widely, so the bar graph needs some extra context. I can calculate the percentile where my organization’s budget number falls. However, I have never seen anyone use percentiles in this manner. Would it be a generally legitimate way to utilize percentiles? Thank you.
Hi Chuck,
I could see how it would make sense to say which percentile your organization’s budget falls at because it would indicate where your organization falls within the group of organizations.
Good day! I have a question that I can’t seem to reconcile. Here it goes.
Using the sample data you’ve given, if we are to compute for P95, it would mean that we are looking for the 11.4th item. Since the given data is up to 11 only, how can we solve for the P95? Or in any case, how do compute the percentile that will be nth item that goes beyond n?
Hi Arvin,
I show multiple ways to calculate percentiles in this post. Each method handles the fact that you have a finite number of observations in a sample using a different method. The percentiles don’t always map perfectly to a specific data point.
For your case, I’m not sure how you’d get a result for the 11.4th item in a list of 11 for the 95th percentile. I’d guess there was a calculation error. I suggest recalculating using one of the methods I show in this post. Personally, I favor the interpolation approach.
Hello Jim,
Is there a difference between percentile and percentle rank? I read on wikipedia that they are ”’opposite’ concepts, though I have found no real explanaton as to why? I looked up google search for other sources but I am still left confused.
Hi D,
This is one I had to look up! As you’ve seen in my article, there are several ways to calculate percentiles. And, I’ve seen varying definitions of both percentiles and percentile ranks. The confusion often occurs around whether percentiles and percentile ranks are “less than” or “less than or equal to” a specific value. For example, suppose you take a standardized test and get a score of 600 and are told that it is the 90th percentile. Does that mean that 90% of scores are less than 600? Or, 90% of scores are less than or equal to 600?
For percentiles, it can go either way and depends on the specific calculation method you use, which I discuss in this article. Statisticians say that percentiles can be either inclusive or exclusive of the score you’re considering, which is 600 in our example.
If I understand correctly, percentile rank refers to a very specific calculation method. The method includes the score in question but only counts half the scores with that value. So, in our example of a test score of 600, a percentile rank would be the percentage of scores that are less than 600 PLUS adding the percentage of scores that corresponds to half of the scores with values of 600. So, it’s inclusive of half the score in question plus all lower scores. In other words, it’s between the exclusive and inclusive percentile values.
I think percentile rank was created to address an issue that occurs particularly in standardized testing. I have not read up on the reasons for this specific calculation method. Usually, for continuous data, you’re unlikely to have many, if any, repeat values. However, imagine, say, the SAT, where the scores for a section can range from 200-800 and they occur only in integers. And you have millions of students taking the SAT each year. Millions of students and 600 possible score values. There’s going to be many repeated values for the scores. And, there are some scores near the mode that will have far more repeated values than other scores further away from the mode. So, they need to account for the discrete nature of their scores, the vast number of repeated values for some of them, and the unequal number of repeated values for the different scores. I’m reading between the lines here, but that is my sense for why they’d come up with a special name for this particular calculation method. They’re literally trying to split the middle for the blocks of repeated scores.
Here’s the Wikipedia article about Percentile Ranks that explains this issue. It’s not an opposite concept. Instead, it’s the same basic concept but with a specific calculation method.
I hope that helps!
good session
Hello Sir. Your post has been really insightful for me. I want to assess the “perceived importance” of X variable in my study based on 5 point Likert scale. I want to categorize the scores into three groups- ‘Very important’, ‘Fairly Important’ and ‘Less Important’. I am planning to use percentile scores for this purpose. Can I use 33rd and 66th percentile as cut off scores for categorization of scores or is there any other way to determine the cutoff range?
If a data chart shows that the score 62 equates to the 10th percentile, how would one find what the score of 55 equates to in percentile without having an entire data set? The information given is the following:
So if a student scores 55, what percentile would that equate to? Is there a way to solve that based on this info.?
90% = 161
75%= 137
50%= 97
25%= 79
10%= 62
Is this the formula used to calculate one’s percentile rank in a March Madness Bracket Challenge game? There were 14.7 million brackets entered in ESPN’s Tournament Challenge for the NCAA tournament this year (2021). If you rank in the 25th percentile, does that mean your rank is
rank = p(n+1),
= .25(14,700,000+1)
= 3,675,000.25.
What does that mean to say that out of 14.7 million, when I ranked in the 25th percentile, I ranked about 3,675,000 out of 14.7 million? Any help is appreciated.
How can I interpret the results of 75th, 50th, and 24th percentiles?
Hi Roselyn,
This post contains the answers to your questions. Look through the early paragraphs and you’ll know how to interpret all percentiles.
Dear Jim
In a manuscript published,(Modern Analytical Facilities 2.A Review of Quality Assurance and Quality Control
(QA/QC) Procedures for Lithogeochemical Data, Stephen J. Piercey), I saw something named as percentile Factor (PF). Would You please give me the way of its calculation?
Thank you very much. It is most helpful.
Hello Jim,
Thank you very much for your article and explanation.
I work with many instruments measuring behavioural and emotional functioning in people. T-scores and percentiles are common metrics here. I am trying to understand however, why percentiles provided by the tool developers differ among some tools when we put them against the T-scores. So, a T-score of let’s say 56 would correspond to 78th percentile, while on other tool the same T of 56 would correspond to 73rd percentile. Is it due to the approach used for calculating them?
Hi Alexandra, as I show in this blog post, there are different ways to calculate percentiles that give slightly different answers. And there’s even an additional way that I don’t show in this post that uses probability distributions, such as the t-distribution. The tools must be using different calculation methods.
Hi Jim,
Thank You so much for the excellent advice, guiding me towards very good analysis of the problem .Can you please suggest me any literature where I can read about raw percentiles ?as I am hearing this term for the first time.
Thank You
Yusuf
Hi Yusuf,
Percentiles are such basic concepts that I doubt you’ll need a reference for the. If you do, most introduction to statistics books will cover them. I know I cover percentiles in my intro book. Oh, I noticed that you’re asking about “raw” percentiles, I just mean you’re modeling the percentiles themselves rather than converting percentiles into groupings. When you have continuous data, it’s usually better to analyze it as continuous data rather than converting it to categorical or ordinal data because you’re throwing out information.
Another approach occurs to me. You can try modeling the percentiles as I described. You can also perform Poisson regression because you’re dealing with count data. You’d need an exposure variable, which accounts for the different populations in the counties.
Hello Jim,
I have cumulative COVId-19 cases data for 3000 counties all over the US up to Dec 31.Like, I have two columns, in Fist column, I have the name of counties and in the second column I have COVID-19 cases in those counties up to Dec 31.I need to divide the cases into low, medium and high number of cases using percentiles and than perform ordinal logistic regression .The natural choice is 0-33 percentile-low cases,33-66 percentile-medium cases, greater than 66 percentile-high cases. When I go for this division and perform Ordinal logistic regression in MINITAB,I am getting a bad log-likelihood value. The log -likelihood value continuously becomes good as I increase the starting percentile .For example, When I go for 0-80 percentile(low cases),80-95 percentile(medium cases) and greater than 95 percentile(high cases),I am getting a relatively good value of log-likelihood. Value of log-likelihood will improve further if I increase the starting percentile, but what could be the justification for this?
Here I am struck that what could be the logical percentiles here to divide the data, which can be justified.
Sorry for the long question
Thank You
Yusuf
Hi Yusuf,
Justifying where to make the divisions is a trick issue. I don’t know the subject area well enough to give you a concrete answer. You could check the literature to see if anyone has devised a scheme and how they justify it. On the one hand, it’s great that the second scheme you devised worked better. However, you don’t want to be cherry picking schemes based on what give you better analysis results! Through your question, it appears like you’re appreciating that concern.
One solution might be to use the raw percentiles and fit either a OLS model or nonlinear model, depending on what provides the better fit to the raw percentiles. The models won’t recognize the hard breaks at 0 and 100, but it might give you a better fit than devising artificial categories. It’s probably best to avoid the whole question about devising categories altogether by using the raw percentiles. I’d at least look into that approach.
Also, if you stick with ordinal regression, try the other link functions. Minitab use Logit by default. However, you can change that in the Options dialog. You can see if it gives you a better fit. You can also try including interaction and polynomial terms if they make sense. But, again, consider using the raw percentiles and avoid creating ordinal data. Some data are inherently ordinal and you have no recourse. However, here you have better data and should try to use it!
kindly sir tell me how many percentiles i can calculate for this data set;52 57 62 62 62 62 65 66 67 68 68 68 69 69 69 71 71 72 72 73 74 74 75 75 75 76 77 78 79 79 79 80 80 82 83 85 88 89 91 93 97 97 97 98 99 101 104 105 105 109.
Hi Yashfa,
I’m not 100% sure what you’re asking. If you’re asking how many different percentiles you can calculate for that set of data, you can technically calculate an infinite number of percentiles that will range from 0-100. It’s an infinite number because you can calculate the 82.454th percentile, 92.36456th percentile, etc. There are an infinite number of percentiles that you could calculate.
Thanks. They did have what they called a comparison group of those who did not but the matching was a bit off in my humble opinion. Needless to say those not receiving specialist help also improved and dependent on grade and other factors there were ‘modest’ gains when a specialist was involved. My objection though is that there’s not much difference is actual scores if you will between the 10th and 18th percentile and the fact that they didn’t report means, s.d.’s etc and only percentile was for me troublesome. Thanks again for your response.
I was reading an article put out by a school district using aimsweb whereby they attempted to demonstrate efficacy of the use of a reading specialist. They used percentiles ranks in reporting their data. Data was available for two time periods Fall and then Spring. Fall data showed students scores at the 10th percentile and in the Spring it was at the 18th percentile. However, when describing these results they stated
“K-3 students who received support from a Reading Specialist increased 8 percentage points between fall and spring”. Is this an appropriate way of reporting data?
Hi Wilfredo,
Based on what you write, I’d say that reporting is partially correct and partially not. I would not describe that increase as 8 percentage points because they’re writing about percentiles. It would be more accurate to say that these students increased from the 10th percentile to the 18th percentile on average. It’s not stated that these are average percentiles but I’m assuming they are.
This reporting also doesn’t compare the growth of these students to students who did not have a reading specialist. That’s important comparison information. Did students without reading specialists improve more or less?
Finally, the reporting correctly does not state that the read specialist caused the increase in ability. Given the little information I have, it doesn’t seem likely that they’d be able to infer a causal relationship. There might be other factors actually causing the increase, either differences between those with and without specialists or just the passage of time itself. That description doesn’t say that the specialists cause the increase (again, that’s correct), but it’s important to note the limitation.
I hope that helps!
Is it possible to calculate percentile if I have mean and median available but not the dataset?
Hi Jahnvi,
If you know the mean and standard deviation and can assume the distribution is roughly normal, you can use the properties of the normal distribution to estimate the percentiles. For more information, read my post about the normal distribution, which covers how to do that. However, you need both the mean and standard deviation. With only the mean, it’s not possible.
Hi
is there a way or program to calculate percentiles from graphs in the case we do not have the dataset but only the graph?
Hi, I don’t know of any direct way to calculate percentiles. Of course, it depends on what type of graph.
If you have a histogram or a probability distribution plot, you might be able to estimate percentiles based on the distribution. If you know the mean and standard deviation (whether from the graph or numeric output) and the distribution is normal or not blatantly nonnormal, you can certainly use that distribution information to estimate percentiles. Read my post about the normal distribution to see how you can use that approach. In one section, I cover how to calculate percentiles using that approach. That’s probably the best way.
If you have an individual value plot, you might be able to estimate the actual data values and then calculate the mean and standard deviation.
Basically, try to determine the properties of the distribution from the graph, and then use that distribution information to estimate percentiles. Unless you know the precise distribution information or have the raw data, you’ll only be able to approximate percentiles based on what you observe in the graph.
Hi Jim,
Many thanks for providing these explanations in a language that is actually easy to understand!
My question is similar to that asked by J R Jenks above. I have a dataset of 500 individuals that have visited a neighbour x times per month. And I want to translate this information into a Likert scale = (1) never, (2) seldomly, (3) often, and (4) always. The problem is that I have way too many zeros in my dataset. This means, many individuals expressed they simply don’t visit their neighbours.
When using excel for bringing this data into percentiles, 0-times responses cover the 10th, 20th and even the 40th percentile. My question is, does it makes sense to group all those zero observations into one category (i.e. effectively ignoring them) and using a quartile ranking for the remaining observations? Or should they be included with all other responses in a quartile or decile ranking calculation? Any help will be much appreciated!
I am working on some Census datasets and am somewhat mystified by the practice of ignoring zeros when computing percentiles (I know how to do it, but intuitively it seems … wrong).
Say you were looking at “households with children”. You have total households for each area (county, tract, whatever), and the number of households containing children. So you calculate the percentage for each area. Then using the range of those percentages, you can calculate the percentile each falls into (and subsequently group them into deciles, etc.).
In following some recommended procedures, they say to first eliminate any “zero” values from the percentile calculation. If for some particular area, no households have children, its percentage would be zero, and that entry in the table should be excluded when calculating percentiles.
Seems to me that skews the result. If some areas have “zero” somethings, and other have various amounts, why would the zero valued area by excluded? If an area had only one “household with children”, it would be included, but the one with none would not.
Thanks for all your helpful elucidation on stats!
Hi Jim,
Here’s a scenario that doesn’t make sense to me. Take the following set:
2, 13, 33, 33, 51, 99, 100, 100
If the question asks me for the percentile of the value of 51, I would do:
4/8 * 100 = 50th percentile.
If the question asks me to find the value of the 50th percentile in the set, I would do:
Rank = p(n+1) = 0.50*(8+1) = 4.5
The value corresponding to this rank would be 33 + (51-33)*0.5 = 42.
So, I would find 42 to be the 50th percentile.
I’m aware that the second question above uses the third (interpolation) definition of percentile that you described. If I use the first definition, then my answer to both questions will be 51.
Is there a way to use the interpolation definition in both questions above and arrive at the same answer? Or is the interpolation scenario doomed to fail in the above scenario?
Hi Sachin,
That’s a great question! There are several things at play here. The first is that, as you point out, you’re using different methods. And, it’s not surprising that different methods will come up with different answers. Additionally, it’s a very small sample, so the precision of the estimates will be low. And, the 50th percentile is the median. There are large gaps between some of the numbers, which means the precise method you use to calculate the median can produce fairly different answers. If we drew a larger random sample from the same population, we’d start filling in those gaps and get more complete information about the distribution. The differences between the the various methods would decrease.
For your dataset, I would say that the interpolation method is not doomed because I think it’s giving the best answer. However, it’s “doomed” in the sense that it is destined to give a different answer for these data.
Again, think of it as the median. The method for calculating the median with an even number of observations is to move inwards until you reach the center two numbers. The middle two numbers are 33 and 51. You then take the average of those two to calculate the median, which comes to 42. That’s a different way of doing the interpolation method. For this dataset, there’s just a relatively large gap between 33 and 51. The 50th percentile is most likely in there somewhere. Given the small dataset, 42 is the best estimate that we have.
My sense is that 51 is a bit on the high side. And, there are in fact only 3 values above it and 4 below it. So, saying it’s the 50th percentile doesn’t feel quite right to me. Indeed, definition 1, greater than, gives you 51 because you need to use the 5th ranked value. The second definition, greater than or equal to, gives you 33 because you can use the 4th ranked value. But, neither of those are in the middle of the data set. One is ranked too high and the other is a rank too low. With a small dataset, that makes a difference.
Both the interpolation method and the median method find a better answer that falls between the actual values in the dataset. I think the underlying problems of the first two methods are twofold; the small dataset and being forced to use an actual value in the dataset. Using the interpolation method, you’re still stuck with the small dataset, but at least you’re not stuck with using an existing value.
Hi Jim,
Not sure if you can help me? I am looking at a UK salary for different roles. I Have the overall UK 25th,50th,75th and 100th percentile values. I also have the UK average (mean).
I have also the regional average and median.
Is there away that i can work out the regional percentiles from the data i have?
Not sure if i have posted on the correct thread but this is puzzling me if its possible
I appreciate the feedback Jim. The math is there to flip it around but I just wanted to see if there was a precedent for such scenarios.
Best regards,
-Steven
Hello Jim,
Thanks for the article on this page.
I am curious if there is a prevailing opinion if percentiles should follow the directional of the overall measure?
For example, I have a compliance measure and that graph needs to trend upward to compliance of 100%, so theoretically your percentile for top performers would be above the 95th percentile. However, on a graph of harm incidents you want to trend downward to 0%. In which case, do you want to be striving to be in the 5th percentile?
I have clients that want to be in the top 5 percent. Mathematically I see it is feasible to flip the calculation but there is the consideration of trending direction.
Thoughts?
Hi Steven,
It seems like there’s no contradiction for your compliance measure. If clients want to be in the top 5%, they’d have be at or higher than the 95th percentile. But, yes, for the harm measure you’d want to be a the 5th percentile at most. Is this just a perception thing among your clients? They want to be in the top 5% versus the bottom 5%. If so, I don’t see any reason not to flip it as you say. As long the system works given your needs.
If I’ve misunderstood your question, please let me know. But, I don’t see a problem with what you’re proposing. Percentiles are based on ranks. All you’re really doing is changing the ranking criteria from low is bad to low is good. Given your scenario, that sounds completely legitimate.
Hello Jim, You have shared very nice article full of profitable information!
In my opinion, percentiles are vital statistical tools, Percentiles provide an direction of how the data values are spread over the interval from the smallest value to the largest value.
Hi Jim,
Thanks for wonderful explanation. I am figuring out the results of my data analysis. I have LiDAR point cloud which was collected from terrestrial and UAV based sensors on the same landscape. Basically, terrestrial LiDAR collects much dense points than the UAV LiDAR. When I calculate 5 th, 50 th and 90 th percentiles, all the time, UAV data height percentile values are higher than terrestrial data height percentiles. I am not sure how to interpret this. Does it mean terrestrial sensor is collecting more data in the lower layer than the UAV sensor and UAV collects more points in upper layer of the landscape? I would greatly appreciate if you help me with interpretation.
How to assign ranks to 2 or 3 individuals having the same score in ungrouped data?
how do i interpret each results after I got all the final answers??
Hi Calisie,
Do you mean the different ways of calculating percentiles? The interpretation is the same, which I know is confusing! Problems arise with large difference between results when you have smaller sample sizes. It’s just harder to get good estimates for anything, including percentiles, with small samples. Small samples tend to have more erratic estimates in genearal. However, once you decide which approach to use, the interpretation is the usual one for percentiles.
How would you calculate percentile rank for something with an underlying exponential distribution?
Hi Nat,
You can use any of the methods I discuss in this post. You can calculate the percentiles based on the values in your dataset using one of those three methods. Or, find the distribution that best fits your data (presumably the exponential distribution in your case) and use that to calculate percentiles. In this post, I use a lognormal distribution to illustrate this method, but you can use the exponential distribution.
Why are there different percentile calculations?
Hi, there are several different reasons. For one thing, you’re starting with slightly different definitions. For whatever reason, there’s not one standard definition. The calculations depend on how you define it (greater than versus greater than or equal to). This problem is exacerbated with smaller datasets where the difference in definition has a larger impact on the end result. There’s also the fact that you can calculate percentiles for values in a dataset or you can use probability distributions to calculate percentiles based on estimates of the population parameters. In short, there are different calculations because of different definitions and different goals (i.e., for values in a dataset vs. for a population).
How do I calculate the percentile ranks for data where a lower score means better performance?
Hi Appy,
What you need to do is start by ranking the scores accordingly. Put the higher data values with lower ranks and lower data values with higher ranks. The opposite of what I show in this post. Then, where I talk about values being “greater than,” you need to substitute “less than.” I believe with those change you can proceed as I show in this post.
When you report the results, be sure to clarify how you’re using percentiles in this manner. For example, “70% of the scores are worse than X, where high values indicate worse performance.” Something like that because I think it would be easy to get confused given the normal usage of percentiles.
I hope this helps!
nice and clear. tell us about logistic and Bayesian analysis
Thanks! I’d like to address those in future posts. So many potential topics to cover!