What is Benford’s Law?
Benford’s law describes the relative frequency distribution for leading digits of numbers in datasets. Leading digits with smaller values occur more frequently than larger values. This law states that approximately 30% of numbers start with a 1 while less than 5% start with a 9. According to this law, leading 1s appear 6.5 times as often as leading 9s! Benford’s law is also known as the First Digit Law.
If leading digits 1 – 9 had an equal probability, they’d each occur 11.1% of the time. However, that is not true in many datasets. The graph displays the distribution of leading digits according to Benford’s law.
Analysis of datasets shows that many follow Benford’s law. For example, analysts have found that stock prices, population numbers, death rates, sports statistics, TikTok likes, financial and tax information, and billing amounts often have leading digits that follow this distribution. Below is a table that Benford produced for his 1938 study, which shows the different types of data he evaluated.
While Benford popularized the law in 1938, he didn’t actually discover it. Simon Newcomb first found the distribution in 1881. Hence, some analysts refer to it as the Newcomb-Benford Law.
In this post, learn about Benford’s law, its formula, how it works, and the types of datasets it applies to. Additionally, I’ll work through an example where I assess how well Benford’s law applies to a real dataset. And you’ll learn how to use Excel to assess it yourself!
Uses for Benford’s Law
Benford’s law is an intriguing, counterintuitive distribution, but can you use it for practical purposes?
Analysts have used it extensively to look for fraud and manipulation in financial records, tax returns, applications, and decision-making documents. They compare the distribution of leading digits in these datasets to Benford’s law. When the leading digits don’t follow the distribution, it’s a red flag for fraud in some datasets.
The idea behind why this works is straightforward. When people manipulate numbers, they don’t track the frequencies of their fake leading digits, producing an unnatural distribution of leading digits. In some cases, they might systematically adjust the leading digits to be below a particular threshold value. For example, if there is a $100,000 limit on a transaction type, fraudsters might start many numbers with a 9 for $99,000.
Using Benford’s law to find fraud is admissible in local, state, and federal US courts. In the past, it has detected irregularities in Greece’s EU application and investment return data for Ponzi schemes, such as Bernie Madoff’s.
However, there are several important caveats.
When a dataset you expect should follow Benford’s curve does not, it’s only a red flag, not proof of fraud. You’ll still need to send in the auditors and investigators, but at least you can target them more effectively on questionable records.
Furthermore, not all data follow Benford’s law naturally. In those cases, leading digits that follow a different distribution aren’t signs of fraud. Consequently, it’s crucial to know which datasets are appropriate to compare to it—which takes us to the next section.
When Does Benford’s Law Apply and Not Apply
Benford’s law generally applies to data that fit some of the following guidelines:
- Quantitative data.
- Data that are measured rather than assigned.
- Ranges over orders of magnitudes.
- Not artificially restricted by minimums or maximums.
- Mixed populations.
- Larger datasets are better.
Elaborations on Guidelines
Benford’s law often does not apply to assigned numbers, such as ID numbers, phone numbers, and zip codes.
It works best for data that range over multiple orders of magnitudes from very low to very high. You can cover the 10s, 100s, 1000s, and so on. For example, population and incomes can range from very low to very high.
It helps, but isn’t required, when the values reflect a process with exponential growth or a power law.
Conversely, if the range of values is restricted, it affects the leading digits, and Benford’s law is less likely to apply. For example, human characteristics naturally fall into restricted ranges. Consequently, this distribution doesn’t apply to human ages, heights and weights. Similarly, limits imposed on potential values can also invalidate this law. Awards in small claims courts have an upper limit, which can negate Benford’s law.
Interestingly, mathematicians have proven that numbers from mixed populations follow Benford’s law. Mixed populations are things like all numbers pulled from a magazine issue. Obviously, those numbers will represent various topics and types of values. Beford himself did that with Reader’s Digest and newspapers. You can also combine data from different sources to achieve the same effect.
Like all distributions, larger datasets will produce observed relative frequencies that more closely approximate the theoretical values of Benford’s law. Smaller datasets can create relatively large deviations due to random error. Some analysts say datasets as small as 100 are acceptable, but most think a minimum size of 500 or even 1,000 is necessary.
Learn more about Probability Distributions.
Proceed Carefully!
These are all guidelines!
Curiously, it will work in some cases where it should not. For example, it applies to house numbers even though those are assigned.
In other cases, Benford’s law does not apply to seemingly appropriate datasets. For instance, it applies to stock prices but not Bitcoin prices. However, it does fit Bitcoin’s trading volume. Go figure!
There has been a lot of interest in using Benford’s law to look for fraud in election results. However, most analysts agree that using the leading digit is invalid for election results. Efforts have moved on to using the Second-Digit Benford’s Law (2BL) test. However, studies raise questions about its validity too. Currently, academics are debating whether they can use it in any form to flag potential election fraud.
In summary, you’ll need to research whether Benford’s law applies to a particular dataset. Don’t just look for deviations and assume they represent fraud.
Benford’s Law Example
Let’s have some fun with an empirical example of Benford’s law!
I downloaded the population for all 3,143 counties in the US. These are the official County Population Totals from the US Census Bureau for 2020. Do the leading digits for these population numbers follow the hypothesized distribution?
I picked a dataset that should satisfy all the assumptions. The data range from the lowest county population of 64 to the highest of 10,014,009. That definitively covers many orders of magnitude! It also relates to population values that people measure without limits. The large sample size (n = 3143) easily exceeds recommendations. So, these data are theoretically perfect for this test.
Despite that, I’m feeling a bit skeptical. Why should leading 1s occur much more frequently than 9s for naturally occurring data? I honestly didn’t know how this analysis would turn out before attempting it—and I was surprised by the results!
Testing data against Benford’s law is simple by using Excel. You just need to use two Excel functions.
- Left: To get the leading (or other) digit.
- Countif: To get the counts for each leading digit
I created an Excel column for the leading digits and then made a table of counts and percentages. The whole process took less than five minutes! You can download my Excel file with the dataset and formulas to see how I did it: BenfordsLawExample.
Below are my results in tabular and graph formats. On the chart, the blue line represents the relative frequency of each leading digit observed in the dataset, while the red line indicates the values that Benford’s law predicts.
I was shocked that the leading digits follow Benford’s law almost perfectly!
Benford’s Law Formula
Benford’s law formula is the following:
Where d = the values of the leading digits from 1 to 9.
The formula calculates the probability for each leading digit. The table below displays the probabilities that Benford’s law formula calculates for all digits.
Digit | Probability |
1 | 30.1% |
2 | 17.6% |
3 | 12.5% |
4 | 9.7% |
5 | 7.9% |
6 | 6.7% |
7 | 5.8% |
8 | 5.1% |
9 | 4.6% |
Note that the Excel file I use for the Benford’s law example also displays the calculations for the probabilities in the second tab.
Generalizations of the formula can calculate probabilities for the 2nd, 3rd, etc. digits and non-base 10 number systems.
How Does Benford’s Law Work?
The simple explanation is that Benford’s law works because you start counting with lower values. They just occur more frequently. You can’t get to the higher values until you work your way through the lower values.
A more technical explanation is that when the fractional parts of the base 10 logarithms in a dataset are evenly distributed within the interval [0, 1], the data follow Benford’s law. This condition frequently happens with naturally occurring data. From the formula I showed earlier, you can see how logarithms are built into the formula, which helps you understand why spanning multiple magnitudes is helpful.
However, Benford’s law works for various datasets where it isn’t easy to understand why. In some cases, it just seems to fit. Consider it a rule-of-thumb that is useful in some cases.
Reference
Frank Benford (March 1938). “The law of anomalous numbers“. Proc. Am. Philos. Soc. 78 (4): 551–572.
Claus H Godbersen says
Maybe it is because measurements are – just like counting – also human conventions: We observe some phenomenon in nature, then we work out a scale to measure it. As the scale has to start at some point, we chose something that we perceive as a lower limit or a neutral state and name it “Zero“. And hence, we often arrive at 1 as a leading digit when we use our scale. Could it be something like that?
mark seiden says
i understand why benford’s law would apply for things that are counted. but i have no intuition why it would possibly apply for values that are measured (like atomic weights or physical constants) or numbers which result from conversion from some native scale (e.g. bitcoin price in USD). can anyone help me with this?
Christopher Maxwell says
To me, this law essentially describes the areas of log-linear graph paper (or maybe log-log)… I haven’t sat down and done the math. But the log (1+1/d) is similar to how one would calculate decibels; which are always relative to a base (usually 1 of some unit)… for instance, electric field strength is expressed as dBuV/m… (20 x log (Field Strength/ 1 uVolt/meter), radiated power (say, for a cell phone signal) is expressed as dBmW/m^2… (10 x log (radiated power/ (1 mW/m^2))… now I’m wondering how dB values would be distributed…
This type of thinking also applies to estimated failure rates of parts/components; which are sometimes rounded to 1, 3, 10, 30, 100 etc… multiples of a power of ten; essentially because half of a decade is captured between 1 and 3, the rest between 3 and 10… and these values can vary of many orders of magnitude.
douglascooper says
Thanks for this. Interesting and informative. I have some ideas about why the law works. I find it easier to understand as P(d) = logten[(d+1)/(d)] the fraction of the proportional (logarithmic) space occupied by the next higher digit (d+1) compared to the digit in question (d). This seems relatable to the probability that the larger digit would occur instead. This is akin to when things stop growing, stopping rules.