Benford’s law asserts that the lower first significant digit (FSD) occurs more frequently than the higher FSD in naturally produced datasets. The applications of the law vary from detecting election, tax, and Covid-19 data fraud to checking abnormalities in the stock market. Hence, it is vital to know which common probability distributions satisfy Benford’s law, which is called Hill’s question. Many research studies have been performed to answer this question by using various methods. The purpose of the work is to give a more simple and intuitive method to address the question for some common probability distributions. Moreover, statistical simulation is adopted to test their conformity to Benford’s law.
I. INTRODUCTION
The basic form of Benford’s law is the following logarithmic distribution:
The law was first discovered by Newcomb in 1881 and later rediscovered by Benford in 1938.2 According to the law, one occurs roughly 30% of the time, but nine occurs only less than 5% of the time. Many research studies have been performed to explain the existence of the law by using various methods. Professor Hill has given a strict proof of existence of Benford’s law based on some assumptions in his milestone work.3 However, he also raised an open question: which common probability distributions satisfy Benford’s law? This is called Hill’s question.
It is very important to answer the question because Benford’s law has been found in various fields. For instance, Knuth, Burke, and Kincano observed that 30% of the most frequently used physical constants have first significant digit 1.4,5 The law also has many diverse applications, such as evaluating data validity6 and detecting tax, voter image, and Covid-19 data fraud.7–10
Distribution . | 100 . | 500 . | 1000 . | 2000 . | 5000 . | 10 000 . |
---|---|---|---|---|---|---|
(0, 0.5) | 99.2 | 100 | 100 | 100 | 100 | 100 |
(0, 1) | 4.5 | 7.5 | 9.2 | 14.5 | 30.4 | 63.7 |
(0, 1.2) | 5.2 | 5.6 | 5.2 | 4.5 | 5.3 | 7.1 |
(0, 1.5) | 4.6 | 4.1 | 5.8 | 5.3 | 5.4 | 4.5 |
(1, 0.5) | 97.3 | 100 | 100 | 100 | 100 | 100 |
(1, 1) | 4.2 | 4.8 | 7.8 | 13.4 | 28.1 | 54.4 |
(1, 1.2) | 5.2 | 5.4 | 4 | 4.9 | 4.9 | 6.1 |
(1, 1.5) | 5.1 | 4.6 | 4.8 | 5.9 | 4.8 | 5.4 |
(2, 0.5) | 98.5 | 100 | 100 | 100 | 100 | 100 |
(2, 1) | 5.8 | 7.8 | 8.9 | 13.7 | 28.3 | 58.1 |
(2, 1.2) | 4.5 | 6.7 | 5.6 | 5.2 | 6.7 | 5.7 |
(2, 1.5) | 6 | 4.1 | 4.6 | 6.1 | 5.1 | 5 |
Distribution . | 100 . | 500 . | 1000 . | 2000 . | 5000 . | 10 000 . |
---|---|---|---|---|---|---|
(0, 0.5) | 99.2 | 100 | 100 | 100 | 100 | 100 |
(0, 1) | 4.5 | 7.5 | 9.2 | 14.5 | 30.4 | 63.7 |
(0, 1.2) | 5.2 | 5.6 | 5.2 | 4.5 | 5.3 | 7.1 |
(0, 1.5) | 4.6 | 4.1 | 5.8 | 5.3 | 5.4 | 4.5 |
(1, 0.5) | 97.3 | 100 | 100 | 100 | 100 | 100 |
(1, 1) | 4.2 | 4.8 | 7.8 | 13.4 | 28.1 | 54.4 |
(1, 1.2) | 5.2 | 5.4 | 4 | 4.9 | 4.9 | 6.1 |
(1, 1.5) | 5.1 | 4.6 | 4.8 | 5.9 | 4.8 | 5.4 |
(2, 0.5) | 98.5 | 100 | 100 | 100 | 100 | 100 |
(2, 1) | 5.8 | 7.8 | 8.9 | 13.7 | 28.3 | 58.1 |
(2, 1.2) | 4.5 | 6.7 | 5.6 | 5.2 | 6.7 | 5.7 |
(2, 1.5) | 6 | 4.1 | 4.6 | 6.1 | 5.1 | 5 |
Distribution . | 100 . | 500 . | 1000 . | 2000 . | 5000 . | 10 000 . |
---|---|---|---|---|---|---|
(1, 0.5) | 5.7 | 5.4 | 4.8 | 4.6 | 5.6 | 5.1 |
(1, 0.6) | 4.3 | 5.1 | 4.6 | 5.3 | 5.6 | 5.2 |
(1, 1) | 7.3 | 17.4 | 35.4 | 69.5 | 99.4 | 100 |
(1, 1.5) | 42.3 | 99.8 | 100 | 100 | 100 | 100 |
(1, 2) | 94.4 | 100 | 100 | 100 | 100 | 100 |
(2, 0.5) | 5.5 | 5.6 | 5.4 | 6.8 | 5 | 4.5 |
(2, 0.6) | 6.4 | 5.7 | 5.2 | 5.1 | 4.9 | 6 |
(2, 1) | 6 | 14.9 | 30.5 | 65.1 | 97.8 | 100 |
(2, 1.5) | 31.1 | 99.5 | 100 | 100 | 100 | 100 |
(2, 2) | 95 | 100 | 100 | 100 | 100 | 100 |
(3, 0.5) | 5.5 | 4.3 | 4.3 | 5.7 | 4.9 | 6.3 |
(3, 0.6) | 5.1 | 3.6 | 4.9 | 4.4 | 6.5 | 6.9 |
(3, 1) | 6.3 | 19.5 | 35.3 | 66.7 | 99.1 | 100 |
(3, 1.5) | 35.7 | 99.7 | 100 | 100 | 100 | 100 |
(3, 2) | 94.9 | 100 | 100 | 100 | 100 | 100 |
Distribution . | 100 . | 500 . | 1000 . | 2000 . | 5000 . | 10 000 . |
---|---|---|---|---|---|---|
(1, 0.5) | 5.7 | 5.4 | 4.8 | 4.6 | 5.6 | 5.1 |
(1, 0.6) | 4.3 | 5.1 | 4.6 | 5.3 | 5.6 | 5.2 |
(1, 1) | 7.3 | 17.4 | 35.4 | 69.5 | 99.4 | 100 |
(1, 1.5) | 42.3 | 99.8 | 100 | 100 | 100 | 100 |
(1, 2) | 94.4 | 100 | 100 | 100 | 100 | 100 |
(2, 0.5) | 5.5 | 5.6 | 5.4 | 6.8 | 5 | 4.5 |
(2, 0.6) | 6.4 | 5.7 | 5.2 | 5.1 | 4.9 | 6 |
(2, 1) | 6 | 14.9 | 30.5 | 65.1 | 97.8 | 100 |
(2, 1.5) | 31.1 | 99.5 | 100 | 100 | 100 | 100 |
(2, 2) | 95 | 100 | 100 | 100 | 100 | 100 |
(3, 0.5) | 5.5 | 4.3 | 4.3 | 5.7 | 4.9 | 6.3 |
(3, 0.6) | 5.1 | 3.6 | 4.9 | 4.4 | 6.5 | 6.9 |
(3, 1) | 6.3 | 19.5 | 35.3 | 66.7 | 99.1 | 100 |
(3, 1.5) | 35.7 | 99.7 | 100 | 100 | 100 | 100 |
(3, 2) | 94.9 | 100 | 100 | 100 | 100 | 100 |
Distribution . | 100 . | 500 . | 1000 . | 2000 . | 5000 . | 10 000 . |
---|---|---|---|---|---|---|
(1, 0.3) | 4.7 | 6 | 5.8 | 6.7 | 8.4 | 10.1 |
(1, 0.5) | 5.3 | 7.5 | 6.4 | 11.8 | 20.1 | 41 |
(1, 1) | 8 | 19.6 | 33.4 | 62.4 | 97.7 | 100 |
(1, 1.5) | 16.5 | 61.2 | 93.2 | 99.9 | 100 | 100 |
(2, 0.3) | 4.2 | 5.4 | 5.2 | 6.1 | 8 | 10 |
(2, 0.5) | 3.9 | 6.7 | 6.1 | 10.3 | 20.8 | 39.4 |
(2, 1) | 4 | 17.5 | 33.5 | 68.6 | 98.8 | 100 |
(2, 1.5) | 13.1 | 61.8 | 94.9 | 100 | 100 | 100 |
(3, 0.3) | 5.1 | 4.9 | 6.6 | 5.2 | 7.2 | 11.6 |
(3, 0.5) | 4.9 | 5.2 | 7.2 | 11.2 | 17.6 | 37.5 |
(3, 1) | 5.4 | 15.8 | 30.4 | 63 | 98 | 100 |
(3, 1.5) | 10.1 | 62.8 | 93.3 | 99.9 | 100 | 100 |
Distribution . | 100 . | 500 . | 1000 . | 2000 . | 5000 . | 10 000 . |
---|---|---|---|---|---|---|
(1, 0.3) | 4.7 | 6 | 5.8 | 6.7 | 8.4 | 10.1 |
(1, 0.5) | 5.3 | 7.5 | 6.4 | 11.8 | 20.1 | 41 |
(1, 1) | 8 | 19.6 | 33.4 | 62.4 | 97.7 | 100 |
(1, 1.5) | 16.5 | 61.2 | 93.2 | 99.9 | 100 | 100 |
(2, 0.3) | 4.2 | 5.4 | 5.2 | 6.1 | 8 | 10 |
(2, 0.5) | 3.9 | 6.7 | 6.1 | 10.3 | 20.8 | 39.4 |
(2, 1) | 4 | 17.5 | 33.5 | 68.6 | 98.8 | 100 |
(2, 1.5) | 13.1 | 61.8 | 94.9 | 100 | 100 | 100 |
(3, 0.3) | 5.1 | 4.9 | 6.6 | 5.2 | 7.2 | 11.6 |
(3, 0.5) | 4.9 | 5.2 | 7.2 | 11.2 | 17.6 | 37.5 |
(3, 1) | 5.4 | 15.8 | 30.4 | 63 | 98 | 100 |
(3, 1.5) | 10.1 | 62.8 | 93.3 | 99.9 | 100 | 100 |
Since Hill’s question was raised, many researchers began to investigate the conformity of probability distribution to Benford’s law. Leemis et al. quantified compliance with Benford’s law for some survival distributions.11 Engel and Leuenberger proved that exponentially distributed random variable obeys the law approximately.12 Miller et al. proved that both Weibull distribution and inverse gamma distribution are almost Benford if their parameters satisfy some conditions.13,14 Fasli and Scott showed that the log-normal distribution is nearly conforming to Benford’s law.15 Rodriguez also proved log-normal distribution is almost Benford.16 Fang and Chen proved that several common probability distributions almost obey Benford’s law.17
After reviewing these research studies, a problem is worthy of study: why are these probability distributions almost Benford? There may be some essential connections among these probability distributions. Therefore, some common probability distributions are selected in this paper, which are log-normal, Weibull, and inverse gamma distributions. The graphs of probability density functions are observed, and some similar patterns have been found. The graphs of their pdfs with different parameters are shown in Figs. 1–3.
From the graphs, it can be observed that all the curves of their pdf f(x) are increasing on and deceasing on , and there is a maximum f(a), so there must be some internal common properties with these probability distributions. Gauvrit and Delahaye presented two new concepts: regularity and scatter.18 The former corresponds to the function f increasing on and deceasing on , and the latter corresponds to its small maximum f(a). They thought both of the concepts are related to a probability distribution conforming to Benford’s law. In the paper, both of the concepts were used to investigate whether the log-normal distribution, Weibull distribution, and inverse gamma distribution are close to Benford’s law. Moreover, statistical simulation was adopted to test their conformity to the law as Rodriguez has performed in his work.16
This paper is organized as follows: some basic definitions and theorems are listed in Sec. II. In Sec. III, main theoretical results are presented. In Sec. IV, statistical simulation is used to test the three probability distributions conforming to Benford’s law. In Sec. V, some final remarks and clues for future research are discussed.
II. BASIC PREPARATIONS
Some basic definitions and theorems about Benford’s law are listed as follows:
(Ref. 19) (Significand). Any positive number x can be expressed as the form of scientific notation, that is, x = SB(x) · Bk(x), where B is the base. SB(x) ∈ [1, B) represents the significand of x, and the integer k(x) is the exponent.
If a number is negative, its significand is the same as its absolute value. Usually, the number system is decimal, but in this work, a more general circumstance is concerned. Any other integers can be the base in addition to ten. If the base is ten, the significand can be easily computed; for example, 3.1415 is the significand of 31 415. In particular, 3 is called the first significant digit (FSD) of 31 415.
If the base is ten, the probability of Benford’s law can be calculated. Prob(FSD = 1) = prob(1 ≤ S(X) ⩽ 2) = lg2 − lg1 = ≈0.3010, Prob(FSD = 9) = prob(9 ≤ S(X) ⩽ 10) = 1 − lg9 ≈ 0.0457. The probability decreases as FSD becomes larger.
(Ref. 20). Any random variable, X > 0, obeys Benford’s law if and only if the fraction part of logB(X) is uniformly distributed in [0, 1].
Theorem II.4 states if a random variable is Benford, the fraction part of logB(X) is uniformly distributed in [0, 1]. Let FB(z) denote the cumulative distribution of the fraction part of logB(X). Benford’s law is equivalent to FB(z) = z or .To investigate a random variable with a probability distribution deviation from Benford’s law is to compare its FB(z) deviation from z.
III. MAIN RESULTS
Gauvrit and Delahaye have given a result to state how regularity and scatter are related to a probability distribution conforming to Benford’s law.18 The following lemma is their result.
In particular, being a sequence of continuous random variables with fn satisfying these conditions and such that mn = max (Id.fn) → 0, then converges to uniformity in [0, 1).
Based on Theorem III.1, three probability distributions close to Benford’s law are proved.
It can be shown that the log-normal random variable with large σ is almost to be Benford.
From the above inequality, we see that the Weibull distribution random variable with small γ is almost Benford.
Likewise, the inverse gamma distribution random variable with small α is almost to be Benford.
Compared to the previous methods to investigate a probability distribution deviation from Benford, such as Fourier analysis, the above method solves the problem easily. If the pdf of a probability distribution satisfies regularity and scatter, which can be observed from their graphs of pdfs, it should be close to Benford’s law if the value of its parameters is proper.
IV. STATISTICAL SIMULATION
In real world, if the population distribution is one of the above probability distributions, it is believed that a sample from the population should approximately conform to Benford’s law. However, this assumption needs to be confirmed and the above theoretical results need to be checked by using statistical simulation. Compared with the former numerical calculation method in Ref. 17, statistical simulation is more reasonable and practical. Two kinds of hypothesis are presented as follows:
H0: The distribution of FSD of a population is Benford.
H1: The distribution of FSD of a population is not Benford.
The six steps of the statistical simulation are as follows:
Set up a probability distribution with parameters for a population and fix a sample size n.
Use the R procedure to produce 1000 samples with the same size n.
investigate the distribution of FSD of each sample and calculate the value of χ2 between the distribution and Benford’s law.
Repeat the above hypothesis testing process for the 1000 samples and observe how many times the null hypothesis is rejected and obtain the rejection rate.
Change the sample size, produce another group 1000 samples with the size, and repeat the above process.
Complete the above process with six different sample sizes,then adjust the parameters of the population distribution, and do it again.
Here, set up six different sample sizes, which are 100 500, 1000, 2000, 5000, and 10 000, respectively, and the number of samples is 1000.
A. Log-normal distribution
Based on Theorem III.2, the log-normal distribution is close to Benford’s law if the parameter σ increases, but the other parameter μ has no significant effect. Let μ be equal to 0, 1, and 2, and σ be equal to 0.5, 1, 1.2, and 1.5, respectively, so that 12 different parameter combinations are formed. Six group data with different sample sizes from a population with one of the parameters are produced. The FSD and the rate of rejection are computed. Table I gives the result.
From the computing result of Table I, if the parameter μ is fixed, the value of the parameter σ is smaller, and the sample size is larger, the rate of rejection becomes larger. However, if the value of the parameter σ is greater than 1.2, no matter how big the sample size is, the rate of rejection is almost equal to 5%. If the value of the parameter σ is fixed and large, no matter how big or small the parameter μ is, the rate of rejection is almost same, not affected by the change of μ.
Obviously, the rate of rejection becomes small as the value of the parameter σ becomes larger, so whether the distribution of FSD of the log-normal distribution is close to Benford’s law is only related to the parameter σ.If it is greater than 1.2, the log-normal distribution is almost Benford. Although σ is larger than 1.2, Table I states that there is still some rate of rejection existing, which is so low that we believe a sample from log-normal population is almost Benford in real world.
B. Weibull distribution
Based on Theorem III.4, the Weibull distribution is also close to Benford’s law if the parameter γ decreases, but the other parameter α has no work. Same as the above, let α be equal to 1, 2, and 3, and γ be equal to 0.5, 0.6, 1, and 1.5, respectively, which forms 12 different parameters; choose one of these parameter combination as a population. Six group data with different sample size from the population are produced. The FSD and the rate of rejection are computed. The results are listed in Table II.
From the above result of Table II, it can be seen if the parameter α is fixed, the value of the parameter γ is greater than 1, and if the sample size is larger, the rate of rejection increases. However, if the value of the parameter γ is less than 0.6, no matter how big the sample is, the rate of rejection is almost equal to 5%. If the value of the parameter γ is fixed, no matter how big or small the parameter γ is, the rate of rejection is almost same, not affected by the change of α.
The rate of rejection becomes small as the value of the parameter γ becomes smaller. Whether the FSD of the Weibull distribution is close to Benford’s law is only related to the parameter γ; if it is not greater than 0.6, then it is almost Benford. Although γ is less than 0.6, the above table states that there is still some rate of rejection existing, which is so low that we believe a sample from the Weibull population is almost Benford in real world.
C. Inverse gamma distribution
From Theorem III.4, the inverse gamma distribution almost obeys Benford’s law if the parameter α decreases and the other parameter β is not effective. Let β be equal to 0, 1, and 2, and α be equal to 0.3, 0.5, 1, and 1.5. Likewise, 12 different parameter combinations are formed. Select a population with one of the parameter combinations, produce six group data with different sample sizes from the population, and compute the FSD and rate of rejection. The results are listed in Table III.
From the above computing result of Table III, if the parameter β is fixed, the value of the parameter α becomes larger and if the sample size is larger, the rate of rejection becomes larger. However, if the value of the parameter α is smaller than 0.3, no matter how big the sample is, the rate of rejection is almost equal to 5%. If the value of the parameter α is fixed, no matter how big or small the parameter β is, the rate of rejection is almost the same, not affected by the change of β.
Clearly, the rate of rejection becomes smaller as the value of the parameter α becomes smaller. In other words, whether the FSD of the inverse gamma distribution is close to Benford’s law is only related to the parameter α. If it is less than or equal to 0.3, the distribution of FSD of the inverse gamma distribution is almost Benford. Although α is less than or equal to 0.3, Table III states that there is still some rate of rejection existing, which is so low that we believe a sample from the inverse gamma population is almost Benford in natural world.
V. CONCLUSIONS
Hill’s question is investigated for some common probability distribution. Compared to previous research methods, such as Fourier analysis, he method adopted here is relatively easy to solve the question. These distributions obey Benford’s law approximately if their parameters satisfy some conditions.
In addition to these strict mathematical proofs, statistical simulation is used to test the theoretical results and also get similar results as before. Specifically, for the log-normal distribution, the parameter σ is bigger than 1.2; for the Weibull distribution, the shape parameter is less than 0.5; for the inverse gamma distribution, the shape parameter is less than 0.3. The other parameter almost has no effect.
However, it must be pointed out that Hill’s question has partly been solved. Many other common probability distributions need to be investigated to determine whether they are close to Benford’s law. The method used here can be helpful in answering the question. When many probability distributions are confirmed to be close to Benford’s law, we can say that the law is widely occurring in the natural world.
ACKNOWLEDGMENTS
The author acknowledges the support from the National Natural Science Foundation of China (Grant No. 71771142).
AUTHOR DECLARATIONS
Conflict of Interest
The author has no conflicts to disclose.
Author Contributions
Guojun Fang: Conceptualization (equal); Formal analysis (equal); Investigation (equal); Methodology (equal); Software (equal); Supervision (equal); Validation (equal); Writing – original draft (equal); Writing – review & editing (equal).
DATA AVAILABILITY
The data that support the findings of this study are available within the article.