diff --git a/references/358407.358414 b/references/358407.358414 new file mode 100644 index 0000000..794f128 Binary files /dev/null and b/references/358407.358414 differ diff --git a/references/358407.358414.1 b/references/358407.358414.1 new file mode 100644 index 0000000..df303b3 Binary files /dev/null and b/references/358407.358414.1 differ diff --git a/references/Beta_distribution?lang=en b/references/Beta_distribution?lang=en new file mode 100644 index 0000000..13430a4 --- /dev/null +++ b/references/Beta_distribution?lang=en @@ -0,0 +1,35345 @@ + + + + +Beta distribution - Wikipedia + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Jump to content +
+
+
+ + + + +
+
+ + + + + +
+
+
+
+
+
+
+
+ +
+
+ +
+
+
+ +
+ +
+
+
+ +

Beta distribution

+ + +
+
+
+
+ +
+
+ + + +
+
+
+
+
+ + +
+
+
+
+
+
+ +
From Wikipedia, the free encyclopedia
+
+
+ + +
+ +
Beta
+
Probability density function
Probability density function for the Beta distribution
+
Cumulative distribution function
Cumulative distribution function for the Beta distribution
Notation +Beta(α, β)
Parameters +α > 0 shape (real)
β > 0 shape (real)
Support + or
PDF +
where and is the Gamma function.
CDF +

+

+(the regularized incomplete beta function)
Mean +





(see section: Geometric mean)
+

+where is the digamma function
Median +
Mode +

for α, β > 1 +

any value in for α, β = 1 +

{0, 1} (bimodal) for α, β < 1 +

0 for α ≤ 1, β > 1 +

+1 for α > 1, β ≤ 1
Variance +

(see trigamma function and see section: Geometric variance)
Skewness +
Ex. kurtosis +
Entropy +
MGF +
CF + (see Confluent hypergeometric function)
Fisher information +
see section: Fisher information matrix
Method of Moments +
+

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or (0, 1) in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution. +

The beta distribution has been applied to model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines. The beta distribution is a suitable model for the random behavior of percentages and proportions. +

In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the Bernoulli, binomial, negative binomial and geometric distributions. +

The formulation of the beta distribution discussed here is also known as the beta distribution of the first kind, whereas beta distribution of the second kind is an alternative name for the beta prime distribution. The generalization to multiple variables is called a Dirichlet distribution. +

+ +

Definitions[edit]

+

Probability density function[edit]

+
An animation of the Beta distribution for different values of its parameters.
+

The probability density function (PDF) of the beta distribution, for 0 ≤ x ≤ 1 or 0 < x < 1, and shape parameters α, β > 0, is a power function of the variable x and of its reflection (1 − x) as follows: +

+
+

where Γ(z) is the gamma function. The beta function, , is a normalization constant to ensure that the total probability is 1. In the above equations x is a realization—an observed value that actually occurred—of a random variable X. +

Several authors, including N. L. Johnson and S. Kotz,[1] use the symbols p and q (instead of α and β) for the shape parameters of the beta distribution, reminiscent of the symbols traditionally used for the parameters of the Bernoulli distribution, because the beta distribution approaches the Bernoulli distribution in the limit when both shape parameters α and β approach the value of zero. +

In the following, a random variable X beta-distributed with parameters α and β will be denoted by:[2][3] +

+
+

Other notations for beta-distributed random variables used in the statistical literature are [4] and .[5] +

+

Cumulative distribution function[edit]

+
CDF for symmetric beta distribution vs. x and α = β
+
CDF for skewed beta distribution vs. x and β = 5α
+

The cumulative distribution function is +

+
+

where is the incomplete beta function and is the regularized incomplete beta function. +

+

Alternative parameterizations[edit]

+

Two parameters[edit]

+
Mean and sample size[edit]
+

The beta distribution may also be reparameterized in terms of its mean μ (0 < μ < 1) and the sum of the two shape parameters ν = α + β > 0([3] p. 83). Denoting by αPosterior and βPosterior the shape parameters of the posterior beta distribution resulting from applying Bayes theorem to a binomial likelihood function and a prior probability, the interpretation of the addition of both shape parameters to be sample size = ν = α·Posterior + β·Posterior is only correct for the Haldane prior probability Beta(0,0). Specifically, for the Bayes (uniform) prior Beta(1,1) the correct interpretation would be sample size = α·Posterior + β Posterior − 2, or ν = (sample size) + 2. For sample size much larger than 2, the difference between these two priors becomes negligible. (See section Bayesian inference for further details.) ν = α + β is referred to as the "sample size" of a Beta distribution, but one should remember that it is, strictly speaking, the "sample size" of a binomial likelihood function only when using a Haldane Beta(0,0) prior in Bayes theorem. +

This parametrization may be useful in Bayesian parameter estimation. For example, one may administer a test to a number of individuals. If it is assumed that each person's score (0 ≤ θ ≤ 1) is drawn from a population-level Beta distribution, then an important statistic is the mean of this population-level distribution. The mean and sample size parameters are related to the shape parameters α and β via[3] +

+
α = μν, β = (1 − μ)ν
+

Under this parametrization, one may place an uninformative prior probability over the mean, and a vague prior probability (such as an exponential or gamma distribution) over the positive reals for the sample size, if they are independent, and prior data and/or beliefs justify it. +

+
Mode and concentration[edit]
+

Concave beta distributions, which have , can be parametrized in terms of mode and "concentration". The mode, , and concentration, , can be used to define the usual shape parameters as follows:[6] +

+
+

For the mode, , to be well-defined, we need , or equivalently . If instead we define the concentration as , the condition simplifies to and the beta density at and can be written as: +

+
+

where directly scales the sufficient statistics, and . Note also that in the limit, , the distribution becomes flat. +

+
Mean and variance[edit]
+

Solving the system of (coupled) equations given in the above sections as the equations for the mean and the variance of the beta distribution in terms of the original parameters α and β, one can express the α and β parameters in terms of the mean (μ) and the variance (var): +

+
+

This parametrization of the beta distribution may lead to a more intuitive understanding than the one based on the original parameters α and β. For example, by expressing the mode, skewness, excess kurtosis and differential entropy in terms of the mean and the variance: +

+ + + +

+

Four parameters[edit]

+

A beta distribution with the two shape parameters α and β is supported on the range [0,1] or (0,1). It is possible to alter the location and scale of the distribution by introducing two further parameters representing the minimum, a, and maximum c (c > a), values of the distribution,[1] by a linear transformation substituting the non-dimensional variable x in terms of the new variable y (with support [a,c] or (a,c)) and the parameters a and c: +

+
+

The probability density function of the four parameter beta distribution is equal to the two parameter distribution, scaled by the range (c − a), (so that the total area under the density curve equals a probability of one), and with the "y" variable shifted and scaled as follows: +

+
+

That a random variable Y is Beta-distributed with four parameters α, β, a, and c will be denoted by: +

+
+

Some measures of central location are scaled (by (c − a)) and shifted (by a), as follows: +

+
+

Note: the geometric mean and harmonic mean cannot be transformed by a linear transformation in the way that the mean, median and mode can. +

The shape parameters of Y can be written in term of its mean and variance as +

+
+

The statistical dispersion measures are scaled (they do not need to be shifted because they are already centered on the mean) by the range (c − a), linearly for the mean deviation and nonlinearly for the variance: +

+
+
+
+

Since the skewness and excess kurtosis are non-dimensional quantities (as moments centered on the mean and normalized by the standard deviation), they are independent of the parameters a and c, and therefore equal to the expressions given above in terms of X (with support [0,1] or (0,1)): +

+
+
+

Properties[edit]

+

Measures of central tendency[edit]

+

Mode[edit]

+

The mode of a Beta distributed random variable X with α, β > 1 is the most likely value of the distribution (corresponding to the peak in the PDF), and is given by the following expression:[1] +

+
+

When both parameters are less than one (α, β < 1), this is the anti-mode: the lowest point of the probability density curve.[7] +

Letting α = β, the expression for the mode simplifies to 1/2, showing that for α = β > 1 the mode (resp. anti-mode when α, β < 1), is at the center of the distribution: it is symmetric in those cases. See Shapes section in this article for a full list of mode cases, for arbitrary values of α and β. For several of these cases, the maximum value of the density function occurs at one or both ends. In some cases the (maximum) value of the density function occurring at the end is finite. For example, in the case of α = 2, β = 1 (or α = 1, β = 2), the density function becomes a right-triangle distribution which is finite at both ends. In several other cases there is a singularity at one end, where the value of the density function approaches infinity. For example, in the case α = β = 1/2, the Beta distribution simplifies to become the arcsine distribution. There is debate among mathematicians about some of these cases and whether the ends (x = 0, and x = 1) can be called modes or not.[8][2] +

+
Mode for Beta distribution for 1 ≤ α ≤ 5 and 1 ≤ β ≤ 5
+
  • Whether the ends are part of the domain of the density function
  • +
  • Whether a singularity can ever be called a mode
  • +
  • Whether cases with two maxima should be called bimodal
+

Median[edit]

+
Median for Beta distribution for 0 ≤ α ≤ 5 and 0 ≤ β ≤ 5
+
(Mean–median) for Beta distribution versus alpha and beta from 0 to 2
+

The median of the beta distribution is the unique real number for which the regularized incomplete beta function . There is no general closed-form expression for the median of the beta distribution for arbitrary values of α and β. Closed-form expressions for particular values of the parameters α and β follow:[citation needed] +

+
  • For symmetric cases α = β, median = 1/2.
  • +
  • For α = 1 and β > 0, median (this case is the mirror-image of the power function [0,1] distribution)
  • +
  • For α > 0 and β = 1, median = (this case is the power function [0,1] distribution[8])
  • +
  • For α = 3 and β = 2, median = 0.6142724318676105..., the real solution to the quartic equation 1 − 8x3 + 6x4 = 0, which lies in [0,1].
  • +
  • For α = 2 and β = 3, median = 0.38572756813238945... = 1−median(Beta(3, 2))
+

The following are the limits with one parameter finite (non-zero) and the other approaching these limits:[citation needed] +

+
+

A reasonable approximation of the value of the median of the beta distribution, for both α and β greater or equal to one, is given by the formula[9] +

+
+

When α, β ≥ 1, the relative error (the absolute error divided by the median) in this approximation is less than 4% and for both α ≥ 2 and β ≥ 2 it is less than 1%. The absolute error divided by the difference between the mean and the mode is similarly small: +

Abs[(Median-Appr.)/Median] for Beta distribution for 1 ≤ α ≤ 5 and 1 ≤ β ≤ 5Abs[(Median-Appr.)/(Mean-Mode)] for Beta distribution for 1≤α≤5 and 1≤β≤5 +

+

Mean[edit]

+
Mean for Beta distribution for 0 ≤ α ≤ 5 and 0 ≤ β ≤ 5
+

The expected value (mean) (μ) of a Beta distribution random variable X with two parameters α and β is a function of only the ratio β/α of these parameters:[1] +

+
+

Letting α = β in the above expression one obtains μ = 1/2, showing that for α = β the mean is at the center of the distribution: it is symmetric. Also, the following limits can be obtained from the above expression: +

+
+

Therefore, for β/α → 0, or for α/β → ∞, the mean is located at the right end, x = 1. For these limit ratios, the beta distribution becomes a one-point degenerate distribution with a Dirac delta function spike at the right end, x = 1, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the right end, x = 1. +

Similarly, for β/α → ∞, or for α/β → 0, the mean is located at the left end, x = 0. The beta distribution becomes a 1-point Degenerate distribution with a Dirac delta function spike at the left end, x = 0, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the left end, x = 0. Following are the limits with one parameter finite (non-zero) and the other approaching these limits: +

+
+

While for typical unimodal distributions (with centrally located modes, inflexion points at both sides of the mode, and longer tails) (with Beta(αβ) such that α, β > 2) it is known that the sample mean (as an estimate of location) is not as robust as the sample median, the opposite is the case for uniform or "U-shaped" bimodal distributions (with Beta(αβ) such that α, β ≤ 1), with the modes located at the ends of the distribution. As Mosteller and Tukey remark ([10] p. 207) "the average of the two extreme observations uses all the sample information. This illustrates how, for short-tailed distributions, the extreme observations should get more weight." By contrast, it follows that the median of "U-shaped" bimodal distributions with modes at the edge of the distribution (with Beta(αβ) such that α, β ≤ 1) is not robust, as the sample median drops the extreme sample observations from consideration. A practical application of this occurs for example for random walks, since the probability for the time of the last visit to the origin in a random walk is distributed as the arcsine distribution Beta(1/2, 1/2):[5][11] the mean of a number of realizations of a random walk is a much more robust estimator than the median (which is an inappropriate sample measure estimate in this case). +

+

Geometric mean[edit]

+
(Mean − GeometricMean) for Beta distribution versus α and β from 0 to 2, showing the asymmetry between α and β for the geometric mean
+
Geometric means for Beta distribution Purple = G(x), Yellow = G(1 − x), smaller values α and β in front
+
Geometric means for Beta distribution. purple = G(x), yellow = G(1 − x), larger values α and β in front
+

The logarithm of the geometric mean GX of a distribution with random variable X is the arithmetic mean of ln(X), or, equivalently, its expected value: +

+
+

For a beta distribution, the expected value integral gives: +

+
+

where ψ is the digamma function. +

Therefore, the geometric mean of a beta distribution with shape parameters α and β is the exponential of the digamma functions of α and β as follows: +

+
+

While for a beta distribution with equal shape parameters α = β, it follows that skewness = 0 and mode = mean = median = 1/2, the geometric mean is less than 1/2: 0 < GX < 1/2. The reason for this is that the logarithmic transformation strongly weights the values of X close to zero, as ln(X) strongly tends towards negative infinity as X approaches zero, while ln(X) flattens towards zero as X → 1. +

Along a line α = β, the following limits apply: +

+
+

Following are the limits with one parameter finite (non-zero) and the other approaching these limits: +

+
+

The accompanying plot shows the difference between the mean and the geometric mean for shape parameters α and β from zero to 2. Besides the fact that the difference between them approaches zero as α and β approach infinity and that the difference becomes large for values of α and β approaching zero, one can observe an evident asymmetry of the geometric mean with respect to the shape parameters α and β. The difference between the geometric mean and the mean is larger for small values of α in relation to β than when exchanging the magnitudes of β and α. +

N. L.Johnson and S. Kotz[1] suggest the logarithmic approximation to the digamma function ψ(α) ≈ ln(α − 1/2) which results in the following approximation to the geometric mean: +

+
+

Numerical values for the relative error in this approximation follow: [(α = β = 1): 9.39%]; [(α = β = 2): 1.29%]; [(α = 2, β = 3): 1.51%]; [(α = 3, β = 2): 0.44%]; [(α = β = 3): 0.51%]; [(α = β = 4): 0.26%]; [(α = 3, β = 4): 0.55%]; [(α = 4, β = 3): 0.24%]. +

Similarly, one can calculate the value of shape parameters required for the geometric mean to equal 1/2. Given the value of the parameter β, what would be the value of the other parameter, α, required for the geometric mean to equal 1/2?. The answer is that (for β > 1), the value of α required tends towards β + 1/2 as β → ∞. For example, all these couples have the same geometric mean of 1/2: [β = 1, α = 1.4427], [β = 2, α = 2.46958], [β = 3, α = 3.47943], [β = 4, α = 4.48449], [β = 5, α = 5.48756], [β = 10, α = 10.4938], [β = 100, α = 100.499]. +

The fundamental property of the geometric mean, which can be proven to be false for any other mean, is +

+
+

This makes the geometric mean the only correct mean when averaging normalized results, that is results that are presented as ratios to reference values.[12] This is relevant because the beta distribution is a suitable model for the random behavior of percentages and it is particularly suitable to the statistical modelling of proportions. The geometric mean plays a central role in maximum likelihood estimation, see section "Parameter estimation, maximum likelihood." Actually, when performing maximum likelihood estimation, besides the geometric mean GX based on the random variable X, also another geometric mean appears naturally: the geometric mean based on the linear transformation ––(1 − X), the mirror-image of X, denoted by G(1−X): +

+
+

Along a line α = β, the following limits apply: +

+
+

Following are the limits with one parameter finite (non-zero) and the other approaching these limits: +

+
+

It has the following approximate value: +

+
+

Although both GX and G(1−X) are asymmetric, in the case that both shape parameters are equal α = β, the geometric means are equal: GX = G(1−X). This equality follows from the following symmetry displayed between both geometric means: +

+
+

Harmonic mean[edit]

+
Harmonic mean for beta distribution for 0 < α < 5 and 0 < β < 5
+
Harmonic mean for beta distribution versus α and β from 0 to 2
+
Harmonic means for beta distribution Purple = H(X), Yellow = H(1 − X), smaller values α and β in front
+
Harmonic Means for Beta distribution Purple = H(X), Yellow = H(1 − X), larger values α and β in front
+

The inverse of the harmonic mean (HX) of a distribution with random variable X is the arithmetic mean of 1/X, or, equivalently, its expected value. Therefore, the harmonic mean (HX) of a beta distribution with shape parameters α and β is: +

+
+

The harmonic mean (HX) of a Beta distribution with α < 1 is undefined, because its defining expression is not bounded in [0, 1] for shape parameter α less than unity. +

Letting α = β in the above expression one obtains +

+
+

showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞. +

Following are the limits with one parameter finite (non-zero) and the other approaching these limits: +

+
+

The harmonic mean plays a role in maximum likelihood estimation for the four parameter case, in addition to the geometric mean. Actually, when performing maximum likelihood estimation for the four parameter case, besides the harmonic mean HX based on the random variable X, also another harmonic mean appears naturally: the harmonic mean based on the linear transformation (1 − X), the mirror-image of X, denoted by H1 − X: +

+
+

The harmonic mean (H(1 − X)) of a Beta distribution with β < 1 is undefined, because its defining expression is not bounded in [0, 1] for shape parameter β less than unity. +

Letting α = β in the above expression one obtains +

+
+

showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞. +

Following are the limits with one parameter finite (non-zero) and the other approaching these limits: +

+
+

Although both HX and H1−X are asymmetric, in the case that both shape parameters are equal α = β, the harmonic means are equal: HX = H1−X. This equality follows from the following symmetry displayed between both harmonic means: +

+
+

Measures of statistical dispersion[edit]

+

Variance[edit]

+

The variance (the second moment centered on the mean) of a Beta distribution random variable X with parameters α and β is:[1][13] +

+
+

Letting α = β in the above expression one obtains +

+
+

showing that for α = β the variance decreases monotonically as α = β increases. Setting α = β = 0 in this expression, one finds the maximum variance var(X) = 1/4[1] which only occurs approaching the limit, at α = β = 0. +

The beta distribution may also be parametrized in terms of its mean μ (0 < μ < 1) and sample size ν = α + β (ν > 0) (see subsection Mean and sample size): +

+
+

Using this parametrization, one can express the variance in terms of the mean μ and the sample size ν as follows: +

+
+

Since ν = α + β > 0, it follows that var(X) < μ(1 − μ). +

For a symmetric distribution, the mean is at the middle of the distribution, μ = 1/2, and therefore: +

+
+

Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions: +

+
+

+

+

Geometric variance and covariance[edit]

+
log geometric variances vs. α and β
+
log geometric variances vs. α and β
+

The logarithm of the geometric variance, ln(varGX), of a distribution with random variable X is the second moment of the logarithm of X centered on the geometric mean of X, ln(GX): +

+
+

and therefore, the geometric variance is: +

+
+

In the Fisher information matrix, and the curvature of the log likelihood function, the logarithm of the geometric variance of the reflected variable 1 − X and the logarithm of the geometric covariance between X and 1 − X appear: +

+
+

For a beta distribution, higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order poly-gamma functions. See the section § Moments of logarithmically transformed random variables. The variance of the logarithmic variables and covariance of ln X and ln(1−X) are: +

+
+
+
+

where the trigamma function, denoted ψ1(α), is the second of the polygamma functions, and is defined as the derivative of the digamma function: +

+
+

Therefore, +

+
+
+
+

The accompanying plots show the log geometric variances and log geometric covariance versus the shape parameters α and β. The plots show that the log geometric variances and log geometric covariance are close to zero for shape parameters α and β greater than 2, and that the log geometric variances rapidly rise in value for shape parameter values α and β less than unity. The log geometric variances are positive for all values of the shape parameters. The log geometric covariance is negative for all values of the shape parameters, and it reaches large negative values for α and β less than unity. +

Following are the limits with one parameter finite (non-zero) and the other approaching these limits: +

+
+

Limits with two parameters varying: +

+
+

Although both ln(varGX) and ln(varG(1 − X)) are asymmetric, when the shape parameters are equal, α = β, one has: ln(varGX) = ln(varG(1−X)). This equality follows from the following symmetry displayed between both log geometric variances: +

+
+

The log geometric covariance is symmetric: +

+
+

Mean absolute deviation around the mean[edit]

+
Ratio of Mean Abs.Dev. to Std.Dev. for Beta distribution with α and β ranging from 0 to 5
+
Ratio of Mean Abs.Dev. to Std.Dev. for Beta distribution with mean 0 ≤ μ ≤ 1 and sample size 0 < ν ≤ 10
+

The mean absolute deviation around the mean for the beta distribution with shape parameters α and β is:[8] +

+
+

The mean absolute deviation around the mean is a more robust estimator of statistical dispersion than the standard deviation for beta distributions with tails and inflection points at each side of the mode, Beta(αβ) distributions with α,β > 2, as it depends on the linear (absolute) deviations rather than the square deviations from the mean. Therefore, the effect of very large deviations from the mean are not as overly weighted. +

Using Stirling's approximation to the Gamma function, N.L.Johnson and S.Kotz[1] derived the following approximation for values of the shape parameters greater than unity (the relative error for this approximation is only −3.5% for α = β = 1, and it decreases to zero as α → ∞, β → ∞): +

+
+

At the limit α → ∞, β → ∞, the ratio of the mean absolute deviation to the standard deviation (for the beta distribution) becomes equal to the ratio of the same measures for the normal distribution: . For α = β = 1 this ratio equals , so that from α = β = 1 to α, β → ∞ the ratio decreases by 8.5%. For α = β = 0 the standard deviation is exactly equal to the mean absolute deviation around the mean. Therefore, this ratio decreases by 15% from α = β = 0 to α = β = 1, and by 25% from α = β = 0 to α, β → ∞ . However, for skewed beta distributions such that α → 0 or β → 0, the ratio of the standard deviation to the mean absolute deviation approaches infinity (although each of them, individually, approaches zero) because the mean absolute deviation approaches zero faster than the standard deviation. +

Using the parametrization in terms of mean μ and sample size ν = α + β > 0: +

+
α = μν, β = (1−μ)ν
+

one can express the mean absolute deviation around the mean in terms of the mean μ and the sample size ν as follows: +

+
+

For a symmetric distribution, the mean is at the middle of the distribution, μ = 1/2, and therefore: +

+
+

Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions: +

+
+

Mean absolute difference[edit]

+

The mean absolute difference for the Beta distribution is: +

+
+

The Gini coefficient for the Beta distribution is half of the relative mean absolute difference: +

+
+

Skewness[edit]

+
Skewness for Beta Distribution as a function of variance and mean
+

The skewness (the third moment centered on the mean, normalized by the 3/2 power of the variance) of the beta distribution is[1] +

+
+

Letting α = β in the above expression one obtains γ1 = 0, showing once again that for α = β the distribution is symmetric and hence the skewness is zero. Positive skew (right-tailed) for α < β, negative skew (left-tailed) for α > β. +

Using the parametrization in terms of mean μ and sample size ν = α + β: +

+
+

one can express the skewness in terms of the mean μ and the sample size ν as follows: +

+
+

The skewness can also be expressed just in terms of the variance var and the mean μ as follows: +

+
+

The accompanying plot of skewness as a function of variance and mean shows that maximum variance (1/4) is coupled with zero skewness and the symmetry condition (μ = 1/2), and that maximum skewness (positive or negative infinity) occurs when the mean is located at one end or the other, so that the "mass" of the probability distribution is concentrated at the ends (minimum variance). +

The following expression for the square of the skewness, in terms of the sample size ν = α + β and the variance var, is useful for the method of moments estimation of four parameters: +

+
+

This expression correctly gives a skewness of zero for α = β, since in that case (see § Variance): . +

For the symmetric case (α = β), skewness = 0 over the whole range, and the following limits apply: +

+
+

For the asymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions: +

+
+

+

+

Kurtosis[edit]

+
Excess Kurtosis for Beta Distribution as a function of variance and mean
+

The beta distribution has been applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reported to be a good indicator of the condition of a gear.[14] Kurtosis has also been used to distinguish the seismic signal generated by a person's footsteps from other signals. As persons or other targets moving on the ground generate continuous signals in the form of seismic waves, one can separate different targets based on the seismic waves they generate. Kurtosis is sensitive to impulsive signals, so it's much more sensitive to the signal generated by human footsteps than other signals generated by vehicles, winds, noise, etc.[15] Unfortunately, the notation for kurtosis has not been standardized. Kenney and Keeping[16] use the symbol γ2 for the excess kurtosis, but Abramowitz and Stegun[17] use different terminology. To prevent confusion[18] between kurtosis (the fourth moment centered on the mean, normalized by the square of the variance) and excess kurtosis, when using symbols, they will be spelled out as follows:[8][19] +

+
+

Letting α = β in the above expression one obtains +

+
.
+

Therefore, for symmetric beta distributions, the excess kurtosis is negative, increasing from a minimum value of −2 at the limit as {α = β} → 0, and approaching a maximum value of zero as {α = β} → ∞. The value of −2 is the minimum value of excess kurtosis that any distribution (not just beta distributions, but any distribution of any possible kind) can ever achieve. This minimum value is reached when all the probability density is entirely concentrated at each end x = 0 and x = 1, with nothing in between: a 2-point Bernoulli distribution with equal probability 1/2 at each end (a coin toss: see section below "Kurtosis bounded by the square of the skewness" for further discussion). The description of kurtosis as a measure of the "potential outliers" (or "potential rare, extreme values") of the probability distribution, is correct for all distributions including the beta distribution. When rare, extreme values can occur in the beta distribution, the higher its kurtosis; otherwise, the kurtosis is lower. For α ≠ β, skewed beta distributions, the excess kurtosis can reach unlimited positive values (particularly for α → 0 for finite β, or for β → 0 for finite α) because the side away from the mode will produce occasional extreme values. Minimum kurtosis takes place when the mass density is concentrated equally at each end (and therefore the mean is at the center), and there is no probability mass density in between the ends. +

Using the parametrization in terms of mean μ and sample size ν = α + β: +

+
+

one can express the excess kurtosis in terms of the mean μ and the sample size ν as follows: +

+
+

The excess kurtosis can also be expressed in terms of just the following two parameters: the variance var, and the sample size ν as follows: +

+
+

and, in terms of the variance var and the mean μ as follows: +

+
+

The plot of excess kurtosis as a function of the variance and the mean shows that the minimum value of the excess kurtosis (−2, which is the minimum possible value for excess kurtosis for any distribution) is intimately coupled with the maximum value of variance (1/4) and the symmetry condition: the mean occurring at the midpoint (μ = 1/2). This occurs for the symmetric case of α = β = 0, with zero skewness. At the limit, this is the 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. (A coin toss: one face of the coin being x = 0 and the other face being x = 1.) Variance is maximum because the distribution is bimodal with nothing in between the two modes (spikes) at each end. Excess kurtosis is minimum: the probability density "mass" is zero at the mean and it is concentrated at the two peaks at each end. Excess kurtosis reaches the minimum possible value (for any distribution) when the probability density function has two spikes at each end: it is bi-"peaky" with nothing in between them. +

On the other hand, the plot shows that for extreme skewed cases, where the mean is located near one or the other end (μ = 0 or μ = 1), the variance is close to zero, and the excess kurtosis rapidly approaches infinity when the mean of the distribution approaches either end. +

Alternatively, the excess kurtosis can also be expressed in terms of just the following two parameters: the square of the skewness, and the sample size ν as follows: +

+
+

From this last expression, one can obtain the same limits published over a century ago by Karl Pearson[20] for the beta distribution (see section below titled "Kurtosis bounded by the square of the skewness"). Setting α + β= ν = 0 in the above expression, one obtains Pearson's lower boundary (values for the skewness and excess kurtosis below the boundary (excess kurtosis + 2 − skewness2 = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region"). The limit of α + β = ν → ∞ determines Pearson's upper boundary. +

+
+

therefore: +

+
+

Values of ν = α + β such that ν ranges from zero to infinity, 0 < ν < ∞, span the whole region of the beta distribution in the plane of excess kurtosis versus squared skewness. +

For the symmetric case (α = β), the following limits apply: +

+
+

For the unsymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions: +

+
+

+

+

Characteristic function[edit]

+
Re(characteristic function) symmetric case α = β ranging from 25 to 0
Re(characteristic function) symmetric case α = β ranging from 0 to 25
Re(characteristic function) β = α + 1/2; α ranging from 25 to 0
Re(characteristic function) α = β + 1/2; β ranging from 25 to 0
Re(characteristic function) α = β + 1/2; β ranging from 0 to 25
+

The characteristic function is the Fourier transform of the probability density function. The characteristic function of the beta distribution is Kummer's confluent hypergeometric function (of the first kind):[1][17][21] +

+
+

where +

+
+

is the rising factorial, also called the "Pochhammer symbol". The value of the characteristic function for t = 0, is one: +

+
.
+

Also, the real and imaginary parts of the characteristic function enjoy the following symmetries with respect to the origin of variable t: +

+
+
+

The symmetric case α = β simplifies the characteristic function of the beta distribution to a Bessel function, since in the special case α + β = 2α the confluent hypergeometric function (of the first kind) reduces to a Bessel function (the modified Bessel function of the first kind ) using Kummer's second transformation as follows: +

+
+

In the accompanying plots, the real part (Re) of the characteristic function of the beta distribution is displayed for symmetric (α = β) and skewed (α ≠ β) cases. +

+

Other moments[edit]

+

Moment generating function[edit]

+

It also follows[1][8] that the moment generating function is +

+
+

In particular MX(α; β; 0) = 1. +

+

Higher moments[edit]

+

Using the moment generating function, the k-th raw moment is given by[1] the factor +

+
+

multiplying the (exponential series) term in the series of the moment generating function +

+
+

where (x)(k) is a Pochhammer symbol representing rising factorial. It can also be written in a recursive form as +

+
+

Since the moment generating function has a positive radius of convergence, the beta distribution is determined by its moments.[22] +

+

Moments of transformed random variables[edit]

+
Moments of linearly transformed, product and inverted random variables[edit]
+

One can also show the following expectations for a transformed random variable,[1] where the random variable X is Beta-distributed with parameters α and β: X ~ Beta(α, β). The expected value of the variable 1 − X is the mirror-symmetry of the expected value based on X: +

+
+

Due to the mirror-symmetry of the probability density function of the beta distribution, the variances based on variables X and 1 − X are identical, and the covariance on X(1 − X is the negative of the variance: +

+
+

These are the expected values for inverted variables, (these are related to the harmonic means, see § Harmonic mean): +

+
+

The following transformation by dividing the variable X by its mirror-image X/(1 − X) results in the expected value of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI):[1] +

+
+

Variances of these transformed variables can be obtained by integration, as the expected values of the second moments centered on the corresponding variables: +

+
+
+

The following variance of the variable X divided by its mirror-image (X/(1−X) results in the variance of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI):[1] +

+
+
+

The covariances are: +

+
+

These expectations and variances appear in the four-parameter Fisher information matrix (§ Fisher information.) +

+
Moments of logarithmically transformed random variables[edit]
+
Plot of logit(X) = ln(X/(1−X)) (vertical axis) vs. X in the domain of 0 to 1 (horizontal axis). Logit transformations are interesting, as they usually transform various shapes (including J-shapes) into (usually skewed) bell-shaped densities over the logit variable, and they may remove the end singularities over the original variable
+

Expected values for logarithmic transformations (useful for maximum likelihood estimates, see § Parameter estimation, Maximum likelihood) are discussed in this section. The following logarithmic linear transformations are related to the geometric means GX and G(1−X) (see § Geometric Mean): +

+
+

Where the digamma function ψ(α) is defined as the logarithmic derivative of the gamma function:[17] +

+
+

Logit transformations are interesting,[23] as they usually transform various shapes (including J-shapes) into (usually skewed) bell-shaped densities over the logit variable, and they may remove the end singularities over the original variable: +

+
+

Johnson[24] considered the distribution of the logit - transformed variable ln(X/1−X), including its moment generating function and approximations for large values of the shape parameters. This transformation extends the finite support [0, 1] based on the original variable X to infinite support in both directions of the real line (−∞, +∞). +

Higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order poly-gamma functions as follows: +

+
+

therefore the variance of the logarithmic variables and covariance of ln(X) and ln(1−X) are: +

+
+

where the trigamma function, denoted ψ1(α), is the second of the polygamma functions, and is defined as the derivative of the digamma function: +

+
.
+

The variances and covariance of the logarithmically transformed variables X and (1−X) are different, in general, because the logarithmic transformation destroys the mirror-symmetry of the original variables X and (1−X), as the logarithm approaches negative infinity for the variable approaching zero. +

These logarithmic variances and covariance are the elements of the Fisher information matrix for the beta distribution. They are also a measure of the curvature of the log likelihood function (see section on Maximum likelihood estimation). +

The variances of the log inverse variables are identical to the variances of the log variables: +

+
+

It also follows that the variances of the logit transformed variables are: +

+
+

Quantities of information (entropy)[edit]

+

Given a beta distributed random variable, X ~ Beta(α, β), the differential entropy of X is (measured in nats),[25] the expected value of the negative of the logarithm of the probability density function: +

+
+

where f(x; α, β) is the probability density function of the beta distribution: +

+
+

The digamma function ψ appears in the formula for the differential entropy as a consequence of Euler's integral formula for the harmonic numbers which follows from the integral: +

+
+

The differential entropy of the beta distribution is negative for all values of α and β greater than zero, except at α = β = 1 (for which values the beta distribution is the same as the uniform distribution), where the differential entropy reaches its maximum value of zero. It is to be expected that the maximum entropy should take place when the beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible events are equiprobable. +

For α or β approaching zero, the differential entropy approaches its minimum value of negative infinity. For (either or both) α or β approaching zero, there is a maximum amount of order: all the probability density is concentrated at the ends, and there is zero probability density at points located between the ends. Similarly for (either or both) α or β approaching infinity, the differential entropy approaches its minimum value of negative infinity, and a maximum amount of order. If either α or β approaches infinity (and the other is finite) all the probability density is concentrated at an end, and the probability density is zero everywhere else. If both shape parameters are equal (the symmetric case), α = β, and they approach infinity simultaneously, the probability density becomes a spike (Dirac delta function) concentrated at the middle x = 1/2, and hence there is 100% probability at the middle x = 1/2 and zero probability everywhere else. +

+

The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the "entropy of a continuous distribution"), as the concluding part of the same paper where he defined the discrete entropy.[26] It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution). What really matters is the relative value of entropy. +

Given two beta distributed random variables, X1 ~ Beta(α, β) and X2 ~ Beta(α, β), the cross-entropy is (measured in nats)[27] +

+
+

The cross entropy has been used as an error metric to measure the distance between two hypotheses.[28][29] Its absolute value is minimum when the two distributions are identical. It is the information measure most closely related to the log maximum likelihood [27](see section on "Parameter estimation. Maximum likelihood estimation")). +

The relative entropy, or Kullback–Leibler divergence DKL(X1 || X2), is a measure of the inefficiency of assuming that the distribution is X2 ~ Beta(α, β) when the distribution is really X1 ~ Beta(α, β). It is defined as follows (measured in nats). +

+
+

The relative entropy, or Kullback–Leibler divergence, is always non-negative. A few numerical examples follow: +

+
  • X1 ~ Beta(1, 1) and X2 ~ Beta(3, 3); DKL(X1 || X2) = 0.598803; DKL(X2 || X1) = 0.267864; h(X1) = 0; h(X2) = −0.267864
  • +
  • X1 ~ Beta(3, 0.5) and X2 ~ Beta(0.5, 3); DKL(X1 || X2) = 7.21574; DKL(X2 || X1) = 7.21574; h(X1) = −1.10805; h(X2) = −1.10805.
+

The Kullback–Leibler divergence is not symmetric DKL(X1 || X2) ≠ DKL(X2 || X1) for the case in which the individual beta distributions Beta(1, 1) and Beta(3, 3) are symmetric, but have different entropies h(X1) ≠ h(X2). The value of the Kullback divergence depends on the direction traveled: whether going from a higher (differential) entropy to a lower (differential) entropy or the other way around. In the numerical example above, the Kullback divergence measures the inefficiency of assuming that the distribution is (bell-shaped) Beta(3, 3), rather than (uniform) Beta(1, 1). The "h" entropy of Beta(1, 1) is higher than the "h" entropy of Beta(3, 3) because the uniform distribution Beta(1, 1) has a maximum amount of disorder. The Kullback divergence is more than two times higher (0.598803 instead of 0.267864) when measured in the direction of decreasing entropy: the direction that assumes that the (uniform) Beta(1, 1) distribution is (bell-shaped) Beta(3, 3) rather than the other way around. In this restricted sense, the Kullback divergence is consistent with the second law of thermodynamics. +

The Kullback–Leibler divergence is symmetric DKL(X1 || X2) = DKL(X2 || X1) for the skewed cases Beta(3, 0.5) and Beta(0.5, 3) that have equal differential entropy h(X1) = h(X2). +

The symmetry condition: +

+
+

follows from the above definitions and the mirror-symmetry f(x; α, β) = f(1−x; α, β) enjoyed by the beta distribution. +

+

Relationships between statistical measures[edit]

+

Mean, mode and median relationship[edit]

+

If 1 < α < β then mode ≤ median ≤ mean.[9] Expressing the mode (only for α, β > 1), and the mean in terms of α and β: +

+
+

If 1 < β < α then the order of the inequalities are reversed. For α, β > 1 the absolute distance between the mean and the median is less than 5% of the distance between the maximum and minimum values of x. On the other hand, the absolute distance between the mean and the mode can reach 50% of the distance between the maximum and minimum values of x, for the (pathological) case of α = 1 and β = 1, for which values the beta distribution approaches the uniform distribution and the differential entropy approaches its maximum value, and hence maximum "disorder". +

For example, for α = 1.0001 and β = 1.00000001: +

+
  • mode = 0.9999; PDF(mode) = 1.00010
  • +
  • mean = 0.500025; PDF(mean) = 1.00003
  • +
  • median = 0.500035; PDF(median) = 1.00003
  • +
  • mean − mode = −0.499875
  • +
  • mean − median = −9.65538 × 10−6
+

where PDF stands for the value of the probability density function. +

+ +

+

Mean, geometric mean and harmonic mean relationship[edit]

+
:Mean, Median, Geometric Mean and Harmonic Mean for Beta distribution with 0 < α = β < 5
+

It is known from the inequality of arithmetic and geometric means that the geometric mean is lower than the mean. Similarly, the harmonic mean is lower than the geometric mean. The accompanying plot shows that for α = β, both the mean and the median are exactly equal to 1/2, regardless of the value of α = β, and the mode is also equal to 1/2 for α = β > 1, however the geometric and harmonic means are lower than 1/2 and they only approach this value asymptotically as α = β → ∞. +

+

Kurtosis bounded by the square of the skewness[edit]

+
Beta distribution α and β parameters vs. excess Kurtosis and squared Skewness
+

As remarked by Feller,[5] in the Pearson system the beta probability density appears as type I (any difference between the beta distribution and Pearson's type I distribution is only superficial and it makes no difference for the following discussion regarding the relationship between kurtosis and skewness). Karl Pearson showed, in Plate 1 of his paper [20] published in 1916, a graph with the kurtosis as the vertical axis (ordinate) and the square of the skewness as the horizontal axis (abscissa), in which a number of distributions were displayed.[30] The region occupied by the beta distribution is bounded by the following two lines in the (skewness2,kurtosis) plane, or the (skewness2,excess kurtosis) plane: +

+
+

or, equivalently, +

+
+

At a time when there were no powerful digital computers, Karl Pearson accurately computed further boundaries,[31][20] for example, separating the "U-shaped" from the "J-shaped" distributions. The lower boundary line (excess kurtosis + 2 − skewness2 = 0) is produced by skewed "U-shaped" beta distributions with both values of shape parameters α and β close to zero. The upper boundary line (excess kurtosis − (3/2) skewness2 = 0) is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. Karl Pearson showed[20] that this upper boundary line (excess kurtosis − (3/2) skewness2 = 0) is also the intersection with Pearson's distribution III, which has unlimited support in one direction (towards positive infinity), and can be bell-shaped or J-shaped. His son, Egon Pearson, showed[30] that the region (in the kurtosis/squared-skewness plane) occupied by the beta distribution (equivalently, Pearson's distribution I) as it approaches this boundary (excess kurtosis − (3/2) skewness2 = 0) is shared with the noncentral chi-squared distribution. Karl Pearson[32] (Pearson 1895, pp. 357, 360, 373–376) also showed that the gamma distribution is a Pearson type III distribution. Hence this boundary line for Pearson's type III distribution is known as the gamma line. (This can be shown from the fact that the excess kurtosis of the gamma distribution is 6/k and the square of the skewness is 4/k, hence (excess kurtosis − (3/2) skewness2 = 0) is identically satisfied by the gamma distribution regardless of the value of the parameter "k"). Pearson later noted that the chi-squared distribution is a special case of Pearson's type III and also shares this boundary line (as it is apparent from the fact that for the chi-squared distribution the excess kurtosis is 12/k and the square of the skewness is 8/k, hence (excess kurtosis − (3/2) skewness2 = 0) is identically satisfied regardless of the value of the parameter "k"). This is to be expected, since the chi-squared distribution X ~ χ2(k) is a special case of the gamma distribution, with parametrization X ~ Γ(k/2, 1/2) where k is a positive integer that specifies the "number of degrees of freedom" of the chi-squared distribution. +

An example of a beta distribution near the upper boundary (excess kurtosis − (3/2) skewness2 = 0) is given by α = 0.1, β = 1000, for which the ratio (excess kurtosis)/(skewness2) = 1.49835 approaches the upper limit of 1.5 from below. An example of a beta distribution near the lower boundary (excess kurtosis + 2 − skewness2 = 0) is given by α= 0.0001, β = 0.1, for which values the expression (excess kurtosis + 2)/(skewness2) = 1.01621 approaches the lower limit of 1 from above. In the infinitesimal limit for both α and β approaching zero symmetrically, the excess kurtosis reaches its minimum value at −2. This minimum value occurs at the point at which the lower boundary line intersects the vertical axis (ordinate). (However, in Pearson's original chart, the ordinate is kurtosis, instead of excess kurtosis, and it increases downwards rather than upwards). +

Values for the skewness and excess kurtosis below the lower boundary (excess kurtosis + 2 − skewness2 = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region". The boundary for this "impossible region" is determined by (symmetric or skewed) bimodal "U"-shaped distributions for which the parameters α and β approach zero and hence all the probability density is concentrated at the ends: x = 0, 1 with practically nothing in between them. Since for α ≈ β ≈ 0 the probability density is concentrated at the two ends x = 0 and x = 1, this "impossible boundary" is determined by a Bernoulli distribution, where the two only possible outcomes occur with respective probabilities p and q = 1−p. For cases approaching this limit boundary with symmetry α = β, skewness ≈ 0, excess kurtosis ≈ −2 (this is the lowest excess kurtosis possible for any distribution), and the probabilities are pq ≈ 1/2. For cases approaching this limit boundary with skewness, excess kurtosis ≈ −2 + skewness2, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at the left end x = 0 and at the right end x = 1. +

+

Symmetry[edit]

+

All statements are conditional on α, β > 0 +

+ +
+ +
+ +
+ +
+ +
+
  • Geometric Means each is individually asymmetric, the following symmetry applies between the geometric mean based on X and the geometric mean based on its reflection (1-X)
+
+
  • Harmonic means each is individually asymmetric, the following symmetry applies between the harmonic mean based on X and the harmonic mean based on its reflection (1-X)
+
.
+
  • Variance symmetry
+
+
  • Geometric variances each is individually asymmetric, the following symmetry applies between the log geometric variance based on X and the log geometric variance based on its reflection (1-X)
+
+
  • Geometric covariance symmetry
+
+ +
+ +
+
  • Excess kurtosis symmetry
+
+
  • Characteristic function symmetry of Real part (with respect to the origin of variable "t")
+
+ +
+
  • Characteristic function symmetry of Absolute value (with respect to the origin of variable "t")
+
+
  • Differential entropy symmetry
+
+ +
+
  • Fisher information matrix symmetry
+
+

Geometry of the probability density function[edit]

+

Inflection points[edit]

+
Inflection point location versus α and β showing regions with one inflection point
+
Inflection point location versus α and β showing region with two inflection points
+

For certain values of the shape parameters α and β, the probability density function has inflection points, at which the curvature changes sign. The position of these inflection points can be useful as a measure of the dispersion or spread of the distribution. +

Defining the following quantity: +

+
+

Points of inflection occur,[1][7][8][19] depending on the value of the shape parameters α and β, as follows: +

+
  • (α > 2, β > 2) The distribution is bell-shaped (symmetric for α = β and skewed otherwise), with two inflection points, equidistant from the mode:
+
+
  • (α = 2, β > 2) The distribution is unimodal, positively skewed, right-tailed, with one inflection point, located to the right of the mode:
+
+
  • (α > 2, β = 2) The distribution is unimodal, negatively skewed, left-tailed, with one inflection point, located to the left of the mode:
+
+
  • (1 < α < 2, β > 2, α+β>2) The distribution is unimodal, positively skewed, right-tailed, with one inflection point, located to the right of the mode:
+
+
  • (0 < α < 1, 1 < β < 2) The distribution has a mode at the left end x = 0 and it is positively skewed, right-tailed. There is one inflection point, located to the right of the mode:
+
+
  • (α > 2, 1 < β < 2) The distribution is unimodal negatively skewed, left-tailed, with one inflection point, located to the left of the mode:
+
+
  • (1 < α < 2, 0 < β < 1) The distribution has a mode at the right end x=1 and it is negatively skewed, left-tailed. There is one inflection point, located to the left of the mode:
+
+

There are no inflection points in the remaining (symmetric and skewed) regions: U-shaped: (α, β < 1) upside-down-U-shaped: (1 < α < 2, 1 < β < 2), reverse-J-shaped (α < 1, β > 2) or J-shaped: (α > 2, β < 1) +

The accompanying plots show the inflection point locations (shown vertically, ranging from 0 to 1) versus α and β (the horizontal axes ranging from 0 to 5). There are large cuts at surfaces intersecting the lines α = 1, β = 1, α = 2, and β = 2 because at these values the beta distribution change from 2 modes, to 1 mode to no mode. +

+

Shapes[edit]

+
PDF for symmetric beta distribution vs. x and α = β from 0 to 30
+
PDF for symmetric beta distribution vs. x and α = β from 0 to 2
+
PDF for skewed beta distribution vs. x and β = 2.5α from 0 to 9
+
PDF for skewed beta distribution vs. x and β = 5.5α from 0 to 9
+
PDF for skewed beta distribution vs. x and β = 8α from 0 to 10
+

The beta density function can take a wide variety of different shapes depending on the values of the two parameters α and β. The ability of the beta distribution to take this great diversity of shapes (using only two parameters) is partly responsible for finding wide application for modeling actual measurements: +

+
Symmetric (α = β)[edit]
+
  • the density function is symmetric about 1/2 (blue & teal plots).
  • +
  • median = mean = 1/2.
  • +
  • skewness = 0.
  • +
  • variance = 1/(4(2α + 1))
  • +
  • α = β < 1 +
    • U-shaped (blue plot).
    • +
    • bimodal: left mode = 0, right mode =1, anti-mode = 1/2
    • +
    • 1/12 < var(X) < 1/4[1]
    • +
    • −2 < excess kurtosis(X) < −6/5
    • +
    • α = β = 1/2 is the arcsine distribution +
      • var(X) = 1/8
      • +
      • excess kurtosis(X) = −3/2
      • +
      • CF = Rinc (t) [33]
    • +
    • α = β → 0 is a 2-point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. A coin toss: one face of the coin being x = 0 and the other face being x = 1. +
      • +
      • a lower value than this is impossible for any distribution to reach.
      • +
      • The differential entropy approaches a minimum value of −∞
  • +
  • α = β = 1 +
  • +
  • α = β > 1 +
    • symmetric unimodal
    • +
    • mode = 1/2.
    • +
    • 0 < var(X) < 1/12[1]
    • +
    • −6/5 < excess kurtosis(X) < 0
    • +
    • α = β = 3/2 is a semi-elliptic [0, 1] distribution, see: Wigner semicircle distribution[34] +
      • var(X) = 1/16.
      • +
      • excess kurtosis(X) = −1
      • +
      • CF = 2 Jinc (t)
    • +
    • α = β = 2 is the parabolic [0, 1] distribution +
      • var(X) = 1/20
      • +
      • excess kurtosis(X) = −6/7
      • +
      • CF = 3 Tinc (t) [35]
    • +
    • α = β > 2 is bell-shaped, with inflection points located to either side of the mode +
      • 0 < var(X) < 1/20
      • +
      • −6/7 < excess kurtosis(X) < 0
    • +
    • α = β → ∞ is a 1-point Degenerate distribution with a Dirac delta function spike at the midpoint x = 1/2 with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the single point x = 1/2. +
      • +
      • +
      • The differential entropy approaches a minimum value of −∞
+
Skewed (αβ)[edit]
+

The density function is skewed. An interchange of parameter values yields the mirror image (the reverse) of the initial curve, some more specific cases: +

+
  • α < 1, β < 1 +
    • U-shaped
    • +
    • Positive skew for α < β, negative skew for α > β.
    • +
    • bimodal: left mode = 0, right mode = 1, anti-mode =
    • +
    • 0 < median < 1.
    • +
    • 0 < var(X) < 1/4
  • +
  • α > 1, β > 1 +
    • unimodal (magenta & cyan plots),
    • +
    • Positive skew for α < β, negative skew for α > β.
    • +
    • +
    • 0 < median < 1
    • +
    • 0 < var(X) < 1/12
  • +
  • α < 1, β ≥ 1 +
    • reverse J-shaped with a right tail,
    • +
    • positively skewed,
    • +
    • strictly decreasing, convex
    • +
    • mode = 0
    • +
    • 0 < median < 1/2.
    • +
    • (maximum variance occurs for , or α = Φ the golden ratio conjugate)
  • +
  • α ≥ 1, β < 1 +
    • J-shaped with a left tail,
    • +
    • negatively skewed,
    • +
    • strictly increasing, convex
    • +
    • mode = 1
    • +
    • 1/2 < median < 1
    • +
    • (maximum variance occurs for , or β = Φ the golden ratio conjugate)
  • +
  • α = 1, β > 1 +
    • positively skewed,
    • +
    • strictly decreasing (red plot),
    • +
    • a reversed (mirror-image) power function [0,1] distribution
    • +
    • mean = 1 / (β + 1)
    • +
    • median = 1 - 1/21/β
    • +
    • mode = 0
    • +
    • α = 1, 1 < β < 2 +
      • concave
      • +
      • +
      • 1/18 < var(X) < 1/12.
    • +
    • α = 1, β = 2 +
      • a straight line with slope −2, the right-triangular distribution with right angle at the left end, at x = 0
      • +
      • +
      • var(X) = 1/18
    • +
    • α = 1, β > 2 +
      • reverse J-shaped with a right tail,
      • +
      • convex
      • +
      • +
      • 0 < var(X) < 1/18
  • +
  • α > 1, β = 1 +
    • negatively skewed,
    • +
    • strictly increasing (green plot),
    • +
    • the power function [0, 1] distribution[8]
    • +
    • mean = α / (α + 1)
    • +
    • median = 1/21/α
    • +
    • mode = 1
    • +
    • 2 > α > 1, β = 1 +
      • concave
      • +
      • +
      • 1/18 < var(X) < 1/12
    • +
    • α = 2, β = 1 +
      • a straight line with slope +2, the right-triangular distribution with right angle at the right end, at x = 1
      • +
      • +
      • var(X) = 1/18
    • +
    • α > 2, β = 1 +
      • J-shaped with a left tail, convex
      • +
      • +
      • 0 < var(X) < 1/18
+

Related distributions[edit]

+

Transformations[edit]

+
  • If X ~ Beta(α, β) then 1 − X ~ Beta(β, α) mirror-image symmetry
  • +
  • If X ~ Beta(α, β) then . The beta prime distribution, also called "beta distribution of the second kind".
  • +
  • If , then has a generalized logistic distribution, with density , where is the logistic sigmoid.
  • +
  • If X ~ Beta(α, β) then .
  • +
  • If X ~ Beta(n/2, m/2) then (assuming n > 0 and m > 0), the Fisher–Snedecor F distribution.
  • +
  • If then min + X(max − min) ~ PERT(min, max, m, λ) where PERT denotes a PERT distribution used in PERT analysis, and m=most likely value.[36] Traditionally[37] λ = 4 in PERT analysis.
  • +
  • If X ~ Beta(1, β) then X ~ Kumaraswamy distribution with parameters (1, β)
  • +
  • If X ~ Beta(α, 1) then X ~ Kumaraswamy distribution with parameters (α, 1)
  • +
  • If X ~ Beta(α, 1) then −ln(X) ~ Exponential(α)
+

Special and limiting cases[edit]

+
Example of eight realizations of a random walk in one dimension starting at 0: the probability for the time of the last visit to the origin is distributed as Beta(1/2, 1/2)
+
Beta(1/2, 1/2): The arcsine distribution probability density was proposed by Harold Jeffreys to represent uncertainty for a Bernoulli or a binomial distribution in Bayesian inference, and is now commonly referred to as Jeffreys prior: p−1/2(1 − p)−1/2. This distribution also appears in several random walk fundamental theorems
+
  • Beta(1, 1) ~ U(0, 1) with density 1 on that interval.
  • +
  • Beta(n, 1) ~ Maximum of n independent rvs. with U(0, 1), sometimes called a a standard power function distribution with density n xn-1 on that interval.
  • +
  • Beta(1, n) ~ Minimum of n independent rvs. with U(0, 1) with density n (1-x)n-1 on that interval.
  • +
  • If X ~ Beta(3/2, 3/2) and r > 0 then 2rX − r ~ Wigner semicircle distribution.
  • +
  • Beta(1/2, 1/2) is equivalent to the arcsine distribution. This distribution is also Jeffreys prior probability for the Bernoulli and binomial distributions. The arcsine probability density is a distribution that appears in several random-walk fundamental theorems. In a fair coin toss random walk, the probability for the time of the last visit to the origin is distributed as an (U-shaped) arcsine distribution.[5][11] In a two-player fair-coin-toss game, a player is said to be in the lead if the random walk (that started at the origin) is above the origin. The most probable number of times that a given player will be in the lead, in a game of length 2N, is not N. On the contrary, N is the least likely number of times that the player will be in the lead. The most likely number of times in the lead is 0 or 2N (following the arcsine distribution).
  • +
  • the exponential distribution.
  • +
  • the gamma distribution.
  • +
  • For large , the normal distribution. More precisely, if then converges in distribution to a normal distribution with mean 0 and variance as n increases.
+

Derived from other distributions[edit]

+
  • The kth order statistic of a sample of size n from the uniform distribution is a beta random variable, U(k) ~ Beta(k, n+1−k).[38]
  • +
  • If X ~ Gamma(α, θ) and Y ~ Gamma(β, θ) are independent, then .
  • +
  • If and are independent, then .
  • +
  • If X ~ U(0, 1) and α > 0 then X1/α ~ Beta(α, 1). The power function distribution.
  • +
  • If , then for discrete values of n and k where and .[39]
  • +
  • If X ~ Cauchy(0, 1) then
+

Combination with other distributions[edit]

+
  • X ~ Beta(α, β) and Y ~ F(2β,2α) then for all x > 0.
+

Compounding with other distributions[edit]

+ +

Generalisations[edit]

+ +

Statistical inference[edit]

+

Parameter estimation[edit]

+

Method of moments[edit]

+
Two unknown parameters[edit]
+

Two unknown parameters ( of a beta distribution supported in the [0,1] interval) can be estimated, using the method of moments, with the first two moments (sample mean and sample variance) as follows. Let: +

+
+

be the sample mean estimate and +

+
+

be the sample variance estimate. The method-of-moments estimates of the parameters are +

+
if
+
if
+

When the distribution is required over a known interval other than [0, 1] with random variable X, say [a, c] with random variable Y, then replace with and with in the above couple of equations for the shape parameters (see the "Alternative parametrizations, four parameters" section below).,[40] where: +

+
+
+
Four unknown parameters[edit]
+
Solutions for parameter estimates vs. (sample) excess Kurtosis and (sample) squared Skewness Beta distribution
+

All four parameters ( of a beta distribution supported in the [a, c] interval -see section "Alternative parametrizations, Four parameters"-) can be estimated, using the method of moments developed by Karl Pearson, by equating sample and population values of the first four central moments (mean, variance, skewness and excess kurtosis).[1][41][42] The excess kurtosis was expressed in terms of the square of the skewness, and the sample size ν = α + β, (see previous section "Kurtosis") as follows: +

+
+

One can use this equation to solve for the sample size ν= α + β in terms of the square of the skewness and the excess kurtosis as follows:[41] +

+
+
+

This is the ratio (multiplied by a factor of 3) between the previously derived limit boundaries for the beta distribution in a space (as originally done by Karl Pearson[20]) defined with coordinates of the square of the skewness in one axis and the excess kurtosis in the other axis (see § Kurtosis bounded by the square of the skewness): +

The case of zero skewness, can be immediately solved because for zero skewness, α = β and hence ν = 2α = 2β, therefore α = β = ν/2 +

+
+
+

(Excess kurtosis is negative for the beta distribution with zero skewness, ranging from -2 to 0, so that -and therefore the sample shape parameters- is positive, ranging from zero when the shape parameters approach zero and the excess kurtosis approaches -2, to infinity when the shape parameters approach infinity and the excess kurtosis approaches zero). +

For non-zero sample skewness one needs to solve a system of two coupled equations. Since the skewness and the excess kurtosis are independent of the parameters , the parameters can be uniquely determined from the sample skewness and the sample excess kurtosis, by solving the coupled equations with two known variables (sample skewness and sample excess kurtosis) and two unknowns (the shape parameters): +

+
+
+
+

resulting in the following solution:[41] +

+
+
+

Where one should take the solutions as follows: for (negative) sample skewness < 0, and for (positive) sample skewness > 0. +

The accompanying plot shows these two solutions as surfaces in a space with horizontal axes of (sample excess kurtosis) and (sample squared skewness) and the shape parameters as the vertical axis. The surfaces are constrained by the condition that the sample excess kurtosis must be bounded by the sample squared skewness as stipulated in the above equation. The two surfaces meet at the right edge defined by zero skewness. Along this right edge, both parameters are equal and the distribution is symmetric U-shaped for α = β < 1, uniform for α = β = 1, upside-down-U-shaped for 1 < α = β < 2 and bell-shaped for α = β > 2. The surfaces also meet at the front (lower) edge defined by "the impossible boundary" line (excess kurtosis + 2 - skewness2 = 0). Along this front (lower) boundary both shape parameters approach zero, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at the left end x = 0 and at the right end x = 1. The two surfaces become further apart towards the rear edge. At this rear edge the surface parameters are quite different from each other. As remarked, for example, by Bowman and Shenton,[43] sampling in the neighborhood of the line (sample excess kurtosis - (3/2)(sample skewness)2 = 0) (the just-J-shaped portion of the rear edge where blue meets beige), "is dangerously near to chaos", because at that line the denominator of the expression above for the estimate ν = α + β becomes zero and hence ν approaches infinity as that line is approached. Bowman and Shenton [43] write that "the higher moment parameters (kurtosis and skewness) are extremely fragile (near that line). However, the mean and standard deviation are fairly reliable." Therefore, the problem is for the case of four parameter estimation for very skewed distributions such that the excess kurtosis approaches (3/2) times the square of the skewness. This boundary line is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. See § Kurtosis bounded by the square of the skewness for a numerical example and further comments about this rear edge boundary line (sample excess kurtosis - (3/2)(sample skewness)2 = 0). As remarked by Karl Pearson himself [44] this issue may not be of much practical importance as this trouble arises only for very skewed J-shaped (or mirror-image J-shaped) distributions with very different values of shape parameters that are unlikely to occur much in practice). The usual skewed-bell-shape distributions that occur in practice do not have this parameter estimation problem. +

The remaining two parameters can be determined using the sample mean and the sample variance using a variety of equations.[1][41] One alternative is to calculate the support interval range based on the sample variance and the sample kurtosis. For this purpose one can solve, in terms of the range , the equation expressing the excess kurtosis in terms of the sample variance, and the sample size ν (see § Kurtosis and § Alternative parametrizations, four parameters): +

+
+

to obtain: +

+
+

Another alternative is to calculate the support interval range based on the sample variance and the sample skewness.[41] For this purpose one can solve, in terms of the range , the equation expressing the squared skewness in terms of the sample variance, and the sample size ν (see section titled "Skewness" and "Alternative parametrizations, four parameters"): +

+
+

to obtain:[41] +

+
+

The remaining parameter can be determined from the sample mean and the previously obtained parameters: : +

+
+

and finally, . +

In the above formulas one may take, for example, as estimates of the sample moments: +

+
+

The estimators G1 for sample skewness and G2 for sample kurtosis are used by DAP/SAS, PSPP/SPSS, and Excel. However, they are not used by BMDP and (according to [45]) they were not used by MINITAB in 1998. Actually, Joanes and Gill in their 1998 study[45] concluded that the skewness and kurtosis estimators used in BMDP and in MINITAB (at that time) had smaller variance and mean-squared error in normal samples, but the skewness and kurtosis estimators used in DAP/SAS, PSPP/SPSS, namely G1 and G2, had smaller mean-squared error in samples from a very skewed distribution. It is for this reason that we have spelled out "sample skewness", etc., in the above formulas, to make it explicit that the user should choose the best estimator according to the problem at hand, as the best estimator for skewness and kurtosis depends on the amount of skewness (as shown by Joanes and Gill[45]). +

+

Maximum likelihood[edit]

+
Two unknown parameters[edit]
+
Max (joint log likelihood/N) for beta distribution maxima at α = β = 2
+
Max (joint log likelihood/N) for Beta distribution maxima at α = β ∈ {0.25,0.5,1,2,4,6,8}
+

As is also the case for maximum likelihood estimates for the gamma distribution, the maximum likelihood estimates for the beta distribution do not have a general closed form solution for arbitrary values of the shape parameters. If X1, ..., XN are independent random variables each having a beta distribution, the joint log likelihood function for N iid observations is: +

+
+

Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the maximum likelihood estimator of the shape parameters: +

+
+
+

where: +

+
+
+

since the digamma function denoted ψ(α) is defined as the logarithmic derivative of the gamma function:[17] +

+
+

To ensure that the values with zero tangent slope are indeed a maximum (instead of a saddle-point or a minimum) one has to also satisfy the condition that the curvature is negative. This amounts to satisfying that the second partial derivative with respect to the shape parameters is negative +

+
+
+

using the previous equations, this is equivalent to: +

+
+
+

where the trigamma function, denoted ψ1(α), is the second of the polygamma functions, and is defined as the derivative of the digamma function: +

+
+

These conditions are equivalent to stating that the variances of the logarithmically transformed variables are positive, since: +

+
+
+

Therefore, the condition of negative curvature at a maximum is equivalent to the statements: +

+
+
+

Alternatively, the condition of negative curvature at a maximum is also equivalent to stating that the following logarithmic derivatives of the geometric means GX and G(1−X) are positive, since: +

+
+
+

While these slopes are indeed positive, the other slopes are negative: +

+
+

The slopes of the mean and the median with respect to α and β display similar sign behavior. +

From the condition that at a maximum, the partial derivative with respect to the shape parameter equals zero, we obtain the following system of coupled maximum likelihood estimate equations (for the average log-likelihoods) that needs to be inverted to obtain the (unknown) shape parameter estimates in terms of the (known) average of logarithms of the samples X1, ..., XN:[1] +

+
+

where we recognize as the logarithm of the sample geometric mean and as the logarithm of the sample geometric mean based on (1 − X), the mirror-image of X. For , it follows that . +

+
+

These coupled equations containing digamma functions of the shape parameter estimates must be solved by numerical methods as done, for example, by Beckman et al.[46] Gnanadesikan et al. give numerical solutions for a few cases.[47] N.L.Johnson and S.Kotz[1] suggest that for "not too small" shape parameter estimates , the logarithmic approximation to the digamma function may be used to obtain initial values for an iterative solution, since the equations resulting from this approximation can be solved exactly: +

+
+
+

which leads to the following solution for the initial values (of the estimate shape parameters in terms of the sample geometric means) for an iterative solution: +

+
+
+

Alternatively, the estimates provided by the method of moments can instead be used as initial values for an iterative solution of the maximum likelihood coupled equations in terms of the digamma functions. +

When the distribution is required over a known interval other than [0, 1] with random variable X, say [a, c] with random variable Y, then replace ln(Xi) in the first equation with +

+
+

and replace ln(1−Xi) in the second equation with +

+
+

(see "Alternative parametrizations, four parameters" section below). +

If one of the shape parameters is known, the problem is considerably simplified. The following logit transformation can be used to solve for the unknown shape parameter (for skewed cases such that , otherwise, if symmetric, both -equal- parameters are known when one is known): +

+
+

This logit transformation is the logarithm of the transformation that divides the variable X by its mirror-image (X/(1 - X) resulting in the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) with support [0, +∞). As previously discussed in the section "Moments of logarithmically transformed random variables," the logit transformation , studied by Johnson,[24] extends the finite support [0, 1] based on the original variable X to infinite support in both directions of the real line (−∞, +∞). +

If, for example, is known, the unknown parameter can be obtained in terms of the inverse[48] digamma function of the right hand side of this equation: +

+
+
+

In particular, if one of the shape parameters has a value of unity, for example for (the power function distribution with bounded support [0,1]), using the identity ψ(x + 1) = ψ(x) + 1/x in the equation , the maximum likelihood estimator for the unknown parameter is,[1] exactly: +

+
+

The beta has support [0, 1], therefore , and hence , and therefore +

In conclusion, the maximum likelihood estimates of the shape parameters of a beta distribution are (in general) a complicated function of the sample geometric mean, and of the sample geometric mean based on (1−X), the mirror-image of X. One may ask, if the variance (in addition to the mean) is necessary to estimate two shape parameters with the method of moments, why is the (logarithmic or geometric) variance not necessary to estimate two shape parameters with the maximum likelihood method, for which only the geometric means suffice? The answer is because the mean does not provide as much information as the geometric mean. For a beta distribution with equal shape parameters α = β, the mean is exactly 1/2, regardless of the value of the shape parameters, and therefore regardless of the value of the statistical dispersion (the variance). On the other hand, the geometric mean of a beta distribution with equal shape parameters α = β, depends on the value of the shape parameters, and therefore it contains more information. Also, the geometric mean of a beta distribution does not satisfy the symmetry conditions satisfied by the mean, therefore, by employing both the geometric mean based on X and geometric mean based on (1 − X), the maximum likelihood method is able to provide best estimates for both parameters α = β, without need of employing the variance. +

One can express the joint log likelihood per N iid observations in terms of the sufficient statistics (the sample geometric means) as follows: +

+
+

We can plot the joint log likelihood per N observations for fixed values of the sample geometric means to see the behavior of the likelihood function as a function of the shape parameters α and β. In such a plot, the shape parameter estimators correspond to the maxima of the likelihood function. See the accompanying graph that shows that all the likelihood functions intersect at α = β = 1, which corresponds to the values of the shape parameters that give the maximum entropy (the maximum entropy occurs for shape parameters equal to unity: the uniform distribution). It is evident from the plot that the likelihood function gives sharp peaks for values of the shape parameter estimators close to zero, but that for values of the shape parameters estimators greater than one, the likelihood function becomes quite flat, with less defined peaks. Obviously, the maximum likelihood parameter estimation method for the beta distribution becomes less acceptable for larger values of the shape parameter estimators, as the uncertainty in the peak definition increases with the value of the shape parameter estimators. One can arrive at the same conclusion by noticing that the expression for the curvature of the likelihood function is in terms of the geometric variances +

+
+
+

These variances (and therefore the curvatures) are much larger for small values of the shape parameter α and β. However, for shape parameter values α, β > 1, the variances (and therefore the curvatures) flatten out. Equivalently, this result follows from the Cramér–Rao bound, since the Fisher information matrix components for the beta distribution are these logarithmic variances. The Cramér–Rao bound states that the variance of any unbiased estimator of α is bounded by the reciprocal of the Fisher information: +

+
+
+

so the variance of the estimators increases with increasing α and β, as the logarithmic variances decrease. +

Also one can express the joint log likelihood per N iid observations in terms of the digamma function expressions for the logarithms of the sample geometric means as follows: +

+
+

this expression is identical to the negative of the cross-entropy (see section on "Quantities of information (entropy)"). Therefore, finding the maximum of the joint log likelihood of the shape parameters, per N iid observations, is identical to finding the minimum of the cross-entropy for the beta distribution, as a function of the shape parameters. +

+
+

with the cross-entropy defined as follows: +

+
+
Four unknown parameters[edit]
+

The procedure is similar to the one followed in the two unknown parameter case. If Y1, ..., YN are independent random variables each having a beta distribution with four parameters, the joint log likelihood function for N iid observations is: +

+
+

Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the maximum likelihood estimator of the shape parameters: +

+
+
+
+
+

these equations can be re-arranged as the following system of four coupled equations (the first two equations are geometric means and the second two equations are the harmonic means) in terms of the maximum likelihood estimates for the four parameters : +

+
+
+
+
+

with sample geometric means: +

+
+
+

The parameters are embedded inside the geometric mean expressions in a nonlinear way (to the power 1/N). This precludes, in general, a closed form solution, even for an initial value approximation for iteration purposes. One alternative is to use as initial values for iteration the values obtained from the method of moments solution for the four parameter case. Furthermore, the expressions for the harmonic means are well-defined only for , which precludes a maximum likelihood solution for shape parameters less than unity in the four-parameter case. Fisher's information matrix for the four parameter case is positive-definite only for α, β > 2 (for further discussion, see section on Fisher information matrix, four parameter case), for bell-shaped (symmetric or unsymmetric) beta distributions, with inflection points located to either side of the mode. The following Fisher information components (that represent the expectations of the curvature of the log likelihood function) have singularities at the following values: +

+
+
+
+
+

(for further discussion see section on Fisher information matrix). Thus, it is not possible to strictly carry on the maximum likelihood estimation for some well known distributions belonging to the four-parameter beta distribution family, like the uniform distribution (Beta(1, 1, a, c)), and the arcsine distribution (Beta(1/2, 1/2, a, c)). N.L.Johnson and S.Kotz[1] ignore the equations for the harmonic means and instead suggest "If a and c are unknown, and maximum likelihood estimators of a, c, α and β are required, the above procedure (for the two unknown parameter case, with X transformed as X = (Y − a)/(c − a)) can be repeated using a succession of trial values of a and c, until the pair (a, c) for which maximum likelihood (given a and c) is as great as possible, is attained" (where, for the purpose of clarity, their notation for the parameters has been translated into the present notation). +

+

Fisher information matrix[edit]

+

Let a random variable X have a probability density f(x;α). The partial derivative with respect to the (unknown, and to be estimated) parameter α of the log likelihood function is called the score. The second moment of the score is called the Fisher information: +

+
+

The expectation of the score is zero, therefore the Fisher information is also the second moment centered on the mean of the score: the variance of the score. +

If the log likelihood function is twice differentiable with respect to the parameter α, and under certain regularity conditions,[49] then the Fisher information may also be written as follows (which is often a more convenient form for calculation purposes): +

+
+

Thus, the Fisher information is the negative of the expectation of the second derivative with respect to the parameter α of the log likelihood function. Therefore, Fisher information is a measure of the curvature of the log likelihood function of α. A low curvature (and therefore high radius of curvature), flatter log likelihood function curve has low Fisher information; while a log likelihood function curve with large curvature (and therefore low radius of curvature) has high Fisher information. When the Fisher information matrix is computed at the evaluates of the parameters ("the observed Fisher information matrix") it is equivalent to the replacement of the true log likelihood surface by a Taylor's series approximation, taken as far as the quadratic terms.[50] The word information, in the context of Fisher information, refers to information about the parameters. Information such as: estimation, sufficiency and properties of variances of estimators. The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any estimator of a parameter α: +

+
+

The precision to which one can estimate the estimator of a parameter α is limited by the Fisher Information of the log likelihood function. The Fisher information is a measure of the minimum error involved in estimating a parameter of a distribution and it can be viewed as a measure of the resolving power of an experiment needed to discriminate between two alternative hypothesis of a parameter.[51] +

When there are N parameters +

+
+

then the Fisher information takes the form of an N×N positive semidefinite symmetric matrix, the Fisher Information Matrix, with typical element: +

+
+

Under certain regularity conditions,[49] the Fisher Information Matrix may also be written in the following form, which is often more convenient for computation: +

+
+

With X1, ..., XN iid random variables, an N-dimensional "box" can be constructed with sides X1, ..., XN. Costa and Cover[52] show that the (Shannon) differential entropy h(X) is related to the volume of the typical set (having the sample entropy close to the true entropy), while the Fisher information is related to the surface of this typical set. +

+
Two parameters[edit]
+

For X1, ..., XN independent random variables each having a beta distribution parametrized with shape parameters α and β, the joint log likelihood function for N iid observations is: +

+
+

therefore the joint log likelihood function per N iid observations is: +

+
+

For the two parameter case, the Fisher information has 4 components: 2 diagonal and 2 off-diagonal. Since the Fisher information matrix is symmetric, one of these off diagonal components is independent. Therefore, the Fisher information matrix has 3 independent components (2 diagonal and 1 off diagonal). +

Aryal and Nadarajah[53] calculated Fisher's information matrix for the four-parameter case, from which the two parameter case can be obtained as follows: +

+
+
+
+

Since the Fisher information matrix is symmetric +

+
+

The Fisher information components are equal to the log geometric variances and log geometric covariance. Therefore, they can be expressed as trigamma functions, denoted ψ1(α), the second of the polygamma functions, defined as the derivative of the digamma function: +

+
+

These derivatives are also derived in the § Two unknown parameters and plots of the log likelihood function are also shown in that section. § Geometric variance and covariance contains plots and further discussion of the Fisher information matrix components: the log geometric variances and log geometric covariance as a function of the shape parameters α and β. § Moments of logarithmically transformed random variables contains formulas for moments of logarithmically transformed random variables. Images for the Fisher information components and are shown in § Geometric variance. +

The determinant of Fisher's information matrix is of interest (for example for the calculation of Jeffreys prior probability). From the expressions for the individual components of the Fisher information matrix, it follows that the determinant of Fisher's (symmetric) information matrix for the beta distribution is: +

+
+

From Sylvester's criterion (checking whether the diagonal elements are all positive), it follows that the Fisher information matrix for the two parameter case is positive-definite (under the standard condition that the shape parameters are positive α > 0 and β > 0). +

+
Four parameters[edit]
+
Fisher Information I(a,a) for α = β vs range (c − a) and exponent α = β
+
Fisher Information I(α,a) for α = β, vs. range (c − a) and exponent α = β
+

If Y1, ..., YN are independent random variables each having a beta distribution with four parameters: the exponents α and β, and also a (the minimum of the distribution range), and c (the maximum of the distribution range) (section titled "Alternative parametrizations", "Four parameters"), with probability density function: +

+
+

the joint log likelihood function per N iid observations is: +

+
+

For the four parameter case, the Fisher information has 4*4=16 components. It has 12 off-diagonal components = (4×4 total − 4 diagonal). Since the Fisher information matrix is symmetric, half of these components (12/2=6) are independent. Therefore, the Fisher information matrix has 6 independent off-diagonal + 4 diagonal = 10 independent components. Aryal and Nadarajah[53] calculated Fisher's information matrix for the four parameter case as follows: +

+
+
+
+

In the above expressions, the use of X instead of Y in the expressions var[ln(X)] = ln(varGX) is not an error. The expressions in terms of the log geometric variances and log geometric covariance occur as functions of the two parameter X ~ Beta(α, β) parametrization because when taking the partial derivatives with respect to the exponents (α, β) in the four parameter case, one obtains the identical expressions as for the two parameter case: these terms of the four parameter Fisher information matrix are independent of the minimum a and maximum c of the distribution's range. The only non-zero term upon double differentiation of the log likelihood function with respect to the exponents α and β is the second derivative of the log of the beta function: ln(B(α, β)). This term is independent of the minimum a and maximum c of the distribution's range. Double differentiation of this term results in trigamma functions. The sections titled "Maximum likelihood", "Two unknown parameters" and "Four unknown parameters" also show this fact. +

The Fisher information for N i.i.d. samples is N times the individual Fisher information (eq. 11.279, page 394 of Cover and Thomas[27]). (Aryal and Nadarajah[53] take a single observation, N = 1, to calculate the following components of the Fisher information, which leads to the same result as considering the derivatives of the log likelihood per N observations. Moreover, below the erroneous expression for in Aryal and Nadarajah has been corrected.) +

+
+

The lower two diagonal entries of the Fisher information matrix, with respect to the parameter a (the minimum of the distribution's range): , and with respect to the parameter c (the maximum of the distribution's range): are only defined for exponents α > 2 and β > 2 respectively. The Fisher information matrix component for the minimum a approaches infinity for exponent α approaching 2 from above, and the Fisher information matrix component for the maximum c approaches infinity for exponent β approaching 2 from above. +

The Fisher information matrix for the four parameter case does not depend on the individual values of the minimum a and the maximum c, but only on the total range (c − a). Moreover, the components of the Fisher information matrix that depend on the range (c − a), depend only through its inverse (or the square of the inverse), such that the Fisher information decreases for increasing range (c − a). +

The accompanying images show the Fisher information components and . Images for the Fisher information components and are shown in § Geometric variance. All these Fisher information components look like a basin, with the "walls" of the basin being located at low values of the parameters. +

The following four-parameter-beta-distribution Fisher information components can be expressed in terms of the two-parameter: X ~ Beta(α, β) expectations of the transformed ratio ((1 − X)/X) and of its mirror image (X/(1 − X)), scaled by the range (c − a), which may be helpful for interpretation: +

+
+
+

These are also the expected values of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) [1] and its mirror image, scaled by the range (c − a). +

Also, the following Fisher information components can be expressed in terms of the harmonic (1/X) variances or of variances based on the ratio transformed variables ((1-X)/X) as follows: +

+
+

See section "Moments of linearly transformed, product and inverted random variables" for these expectations. +

The determinant of Fisher's information matrix is of interest (for example for the calculation of Jeffreys prior probability). From the expressions for the individual components, it follows that the determinant of Fisher's (symmetric) information matrix for the beta distribution with four parameters is: +

+
+

Using Sylvester's criterion (checking whether the diagonal elements are all positive), and since diagonal components and have singularities at α=2 and β=2 it follows that the Fisher information matrix for the four parameter case is positive-definite for α>2 and β>2. Since for α > 2 and β > 2 the beta distribution is (symmetric or unsymmetric) bell shaped, it follows that the Fisher information matrix is positive-definite only for bell-shaped (symmetric or unsymmetric) beta distributions, with inflection points located to either side of the mode. Thus, important well known distributions belonging to the four-parameter beta distribution family, like the parabolic distribution (Beta(2,2,a,c)) and the uniform distribution (Beta(1,1,a,c)) have Fisher information components () that blow up (approach infinity) in the four-parameter case (although their Fisher information components are all defined for the two parameter case). The four-parameter Wigner semicircle distribution (Beta(3/2,3/2,a,c)) and arcsine distribution (Beta(1/2,1/2,a,c)) have negative Fisher information determinants for the four-parameter case. +

+

Bayesian inference[edit]

+ +
: The uniform distribution probability density was proposed by Thomas Bayes to represent ignorance of prior probabilities in Bayesian inference.
+

The use of Beta distributions in Bayesian inference is due to the fact that they provide a family of conjugate prior probability distributions for binomial (including Bernoulli) and geometric distributions. The domain of the beta distribution can be viewed as a probability, and in fact the beta distribution is often used to describe the distribution of a probability value p:[23] +

+
+

Examples of beta distributions used as prior probabilities to represent ignorance of prior parameter values in Bayesian inference are Beta(1,1), Beta(0,0) and Beta(1/2,1/2). +

+

Rule of succession[edit]

+ +

A classic application of the beta distribution is the rule of succession, introduced in the 18th century by Pierre-Simon Laplace[54] in the course of treating the sunrise problem. It states that, given s successes in n conditionally independent Bernoulli trials with probability p, that the estimate of the expected value in the next trial is . This estimate is the expected value of the posterior distribution over p, namely Beta(s+1, ns+1), which is given by Bayes' rule if one assumes a uniform prior probability over p (i.e., Beta(1, 1)) and then observes that p generated s successes in n trials. Laplace's rule of succession has been criticized by prominent scientists. R. T. Cox described Laplace's application of the rule of succession to the sunrise problem ([55] p. 89) as "a travesty of the proper use of the principle". Keynes remarks ([56] Ch.XXX, p. 382) "indeed this is so foolish a theorem that to entertain it is discreditable". Karl Pearson[57] showed that the probability that the next (n + 1) trials will be successes, after n successes in n trials, is only 50%, which has been considered too low by scientists like Jeffreys and unacceptable as a representation of the scientific process of experimentation to test a proposed scientific law. As pointed out by Jeffreys ([58] p. 128) (crediting C. D. Broad[59] ) Laplace's rule of succession establishes a high probability of success ((n+1)/(n+2)) in the next trial, but only a moderate probability (50%) that a further sample (n+1) comparable in size will be equally successful. As pointed out by Perks,[60] "The rule of succession itself is hard to accept. It assigns a probability to the next trial which implies the assumption that the actual run observed is an average run and that we are always at the end of an average run. It would, one would think, be more reasonable to assume that we were in the middle of an average run. Clearly a higher value for both probabilities is necessary if they are to accord with reasonable belief." These problems with Laplace's rule of succession motivated Haldane, Perks, Jeffreys and others to search for other forms of prior probability (see the next § Bayesian inference). According to Jaynes,[51] the main problem with the rule of succession is that it is not valid when s=0 or s=n (see rule of succession, for an analysis of its validity). +

+

Bayes-Laplace prior probability (Beta(1,1))[edit]

+

The beta distribution achieves maximum differential entropy for Beta(1,1): the uniform probability density, for which all values in the domain of the distribution have equal density. This uniform distribution Beta(1,1) was suggested ("with a great deal of doubt") by Thomas Bayes[61] as the prior probability distribution to express ignorance about the correct prior distribution. This prior distribution was adopted (apparently, from his writings, with little sign of doubt[54]) by Pierre-Simon Laplace, and hence it was also known as the "Bayes-Laplace rule" or the "Laplace rule" of "inverse probability" in publications of the first half of the 20th century. In the later part of the 19th century and early part of the 20th century, scientists realized that the assumption of uniform "equal" probability density depended on the actual functions (for example whether a linear or a logarithmic scale was most appropriate) and parametrizations used. In particular, the behavior near the ends of distributions with finite support (for example near x = 0, for a distribution with initial support at x = 0) required particular attention. Keynes ([56] Ch.XXX, p. 381) criticized the use of Bayes's uniform prior probability (Beta(1,1)) that all values between zero and one are equiprobable, as follows: "Thus experience, if it shows anything, shows that there is a very marked clustering of statistical ratios in the neighborhoods of zero and unity, of those for positive theories and for correlations between positive qualities in the neighborhood of zero, and of those for negative theories and for correlations between negative qualities in the neighborhood of unity. " +

+

Haldane's prior probability (Beta(0,0))[edit]

+
: The Haldane prior probability expressing total ignorance about prior information, where we are not even sure whether it is physically possible for an experiment to yield either a success or a failure. As α, β → 0, the beta distribution approaches a two-point Bernoulli distribution with all probability density concentrated at each end, at 0 and 1, and nothing in between. A coin-toss: one face of the coin being at 0 and the other face being at 1.
+

The Beta(0,0) distribution was proposed by J.B.S. Haldane,[62] who suggested that the prior probability representing complete uncertainty should be proportional to p−1(1−p)−1. The function p−1(1−p)−1 can be viewed as the limit of the numerator of the beta distribution as both shape parameters approach zero: α, β → 0. The Beta function (in the denominator of the beta distribution) approaches infinity, for both parameters approaching zero, α, β → 0. Therefore, p−1(1−p)−1 divided by the Beta function approaches a 2-point Bernoulli distribution with equal probability 1/2 at each end, at 0 and 1, and nothing in between, as α, β → 0. A coin-toss: one face of the coin being at 0 and the other face being at 1. The Haldane prior probability distribution Beta(0,0) is an "improper prior" because its integration (from 0 to 1) fails to strictly converge to 1 due to the singularities at each end. However, this is not an issue for computing posterior probabilities unless the sample size is very small. Furthermore, Zellner[63] points out that on the log-odds scale, (the logit transformation ln(p/1−p)), the Haldane prior is the uniformly flat prior. The fact that a uniform prior probability on the logit transformed variable ln(p/1−p) (with domain (-∞, ∞)) is equivalent to the Haldane prior on the domain [0, 1] was pointed out by Harold Jeffreys in the first edition (1939) of his book Theory of Probability ([58] p. 123). Jeffreys writes "Certainly if we take the Bayes-Laplace rule right up to the extremes we are led to results that do not correspond to anybody's way of thinking. The (Haldane) rule dx/(x(1−x)) goes too far the other way. It would lead to the conclusion that if a sample is of one type with respect to some property there is a probability 1 that the whole population is of that type." The fact that "uniform" depends on the parametrization, led Jeffreys to seek a form of prior that would be invariant under different parametrizations. +

+

Jeffreys' prior probability (Beta(1/2,1/2) for a Bernoulli or for a binomial distribution)[edit]

+ +
Jeffreys prior probability for the beta distribution: the square root of the determinant of Fisher's information matrix: is a function of the trigamma function ψ1 of shape parameters α, β
+
Posterior Beta densities with samples having success = "s", failure = "f" of s/(s + f) = 1/2, and s + f = {3,10,50}, based on 3 different prior probability functions: Haldane (Beta(0,0), Jeffreys (Beta(1/2,1/2)) and Bayes (Beta(1,1)). The image shows that there is little difference between the priors for the posterior with sample size of 50 (with more pronounced peak near p = 1/2). Significant differences appear for very small sample sizes (the flatter distribution for sample size of 3)
+
Posterior Beta densities with samples having success = "s", failure = "f" of s/(s + f) = 1/4, and s + f ∈ {3,10,50}, based on three different prior probability functions: Haldane (Beta(0,0), Jeffreys (Beta(1/2,1/2)) and Bayes (Beta(1,1)). The image shows that there is little difference between the priors for the posterior with sample size of 50 (with more pronounced peak near p = 1/4). Significant differences appear for very small sample sizes (the very skewed distribution for the degenerate case of sample size = 3, in this degenerate and unlikely case the Haldane prior results in a reverse "J" shape with mode at p = 0 instead of p = 1/4. If there is sufficient sampling data, the three priors of Bayes (Beta(1,1)), Jeffreys (Beta(1/2,1/2)) and Haldane (Beta(0,0)) should yield similar posterior probability densities.
+
Posterior Beta densities with samples having success = s, failure = f of s/(s + f) = 1/4, and s + f ∈ {4,12,40}, based on three different prior probability functions: Haldane (Beta(0,0), Jeffreys (Beta(1/2,1/2)) and Bayes (Beta(1,1)). The image shows that there is little difference between the priors for the posterior with sample size of 40 (with more pronounced peak near p = 1/4). Significant differences appear for very small sample sizes
+

Harold Jeffreys[58][64] proposed to use an uninformative prior probability measure that should be invariant under reparameterization: proportional to the square root of the determinant of Fisher's information matrix. For the Bernoulli distribution, this can be shown as follows: for a coin that is "heads" with probability p ∈ [0, 1] and is "tails" with probability 1 − p, for a given (H,T) ∈ {(0,1), (1,0)} the probability is pH(1 − p)T. Since T = 1 − H, the Bernoulli distribution is pH(1 − p)1 − H. Considering p as the only parameter, it follows that the log likelihood for the Bernoulli distribution is +

+
+

The Fisher information matrix has only one component (it is a scalar, because there is only one parameter: p), therefore: +

+
+

Similarly, for the Binomial distribution with n Bernoulli trials, it can be shown that +

+
+

Thus, for the Bernoulli, and Binomial distributions, Jeffreys prior is proportional to , which happens to be proportional to a beta distribution with domain variable x = p, and shape parameters α = β = 1/2, the arcsine distribution: +

+
+

It will be shown in the next section that the normalizing constant for Jeffreys prior is immaterial to the final result because the normalizing constant cancels out in Bayes theorem for the posterior probability. Hence Beta(1/2,1/2) is used as the Jeffreys prior for both Bernoulli and binomial distributions. As shown in the next section, when using this expression as a prior probability times the likelihood in Bayes theorem, the posterior probability turns out to be a beta distribution. It is important to realize, however, that Jeffreys prior is proportional to for the Bernoulli and binomial distribution, but not for the beta distribution. Jeffreys prior for the beta distribution is given by the determinant of Fisher's information for the beta distribution, which, as shown in the § Fisher information matrix is a function of the trigamma function ψ1 of shape parameters α and β as follows: +

+
+

As previously discussed, Jeffreys prior for the Bernoulli and binomial distributions is proportional to the arcsine distribution Beta(1/2,1/2), a one-dimensional curve that looks like a basin as a function of the parameter p of the Bernoulli and binomial distributions. The walls of the basin are formed by p approaching the singularities at the ends p → 0 and p → 1, where Beta(1/2,1/2) approaches infinity. Jeffreys prior for the beta distribution is a 2-dimensional surface (embedded in a three-dimensional space) that looks like a basin with only two of its walls meeting at the corner α = β = 0 (and missing the other two walls) as a function of the shape parameters α and β of the beta distribution. The two adjoining walls of this 2-dimensional surface are formed by the shape parameters α and β approaching the singularities (of the trigamma function) at α, β → 0. It has no walls for α, β → ∞ because in this case the determinant of Fisher's information matrix for the beta distribution approaches zero. +

It will be shown in the next section that Jeffreys prior probability results in posterior probabilities (when multiplied by the binomial likelihood function) that are intermediate between the posterior probability results of the Haldane and Bayes prior probabilities. +

Jeffreys prior may be difficult to obtain analytically, and for some cases it just doesn't exist (even for simple distribution functions like the asymmetric triangular distribution). Berger, Bernardo and Sun, in a 2009 paper[65] defined a reference prior probability distribution that (unlike Jeffreys prior) exists for the asymmetric triangular distribution. They cannot obtain a closed-form expression for their reference prior, but numerical calculations show it to be nearly perfectly fitted by the (proper) prior +

+
+

where θ is the vertex variable for the asymmetric triangular distribution with support [0, 1] (corresponding to the following parameter values in Wikipedia's article on the triangular distribution: vertex c = θ, left end a = 0,and right end b = 1). Berger et al. also give a heuristic argument that Beta(1/2,1/2) could indeed be the exact Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution. Therefore, Beta(1/2,1/2) not only is Jeffreys prior for the Bernoulli and binomial distributions, but also seems to be the Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution (for which the Jeffreys prior does not exist), a distribution used in project management and PERT analysis to describe the cost and duration of project tasks. +

Clarke and Barron[66] prove that, among continuous positive priors, Jeffreys prior (when it exists) asymptotically maximizes Shannon's mutual information between a sample of size n and the parameter, and therefore Jeffreys prior is the most uninformative prior (measuring information as Shannon information). The proof rests on an examination of the Kullback–Leibler divergence between probability density functions for iid random variables. +

+

Effect of different prior probability choices on the posterior beta distribution[edit]

+

If samples are drawn from the population of a random variable X that result in s successes and f failures in "n" Bernoulli trials n = s + f, then the likelihood function for parameters s and f given x = p (the notation x = p in the expressions below will emphasize that the domain x stands for the value of the parameter p in the binomial distribution), is the following binomial distribution: +

+
+

If beliefs about prior probability information are reasonably well approximated by a beta distribution with parameters α Prior and β Prior, then: +

+
+

According to Bayes' theorem for a continuous event space, the posterior probability is given by the product of the prior probability and the likelihood function (given the evidence s and f = n − s), normalized so that the area under the curve equals one, as follows: +

+
+

The binomial coefficient +

+
+

appears both in the numerator and the denominator of the posterior probability, and it does not depend on the integration variable x, hence it cancels out, and it is irrelevant to the final result. Similarly the normalizing factor for the prior probability, the beta function B(αPrior,βPrior) cancels out and it is immaterial to the final result. The same posterior probability result can be obtained if one uses an un-normalized prior +

+
+

because the normalizing factors all cancel out. Several authors (including Jeffreys himself) thus use an un-normalized prior formula since the normalization constant cancels out. The numerator of the posterior probability ends up being just the (un-normalized) product of the prior probability and the likelihood function, and the denominator is its integral from zero to one. The beta function in the denominator, B(s + α Prior, n − s + β Prior), appears as a normalization constant to ensure that the total posterior probability integrates to unity. +

The ratio s/n of the number of successes to the total number of trials is a sufficient statistic in the binomial case, which is relevant for the following results. +

For the Bayes' prior probability (Beta(1,1)), the posterior probability is: +

+
+

For the Jeffreys' prior probability (Beta(1/2,1/2)), the posterior probability is: +

+
+

and for the Haldane prior probability (Beta(0,0)), the posterior probability is: +

+
+

From the above expressions it follows that for s/n = 1/2) all the above three prior probabilities result in the identical location for the posterior probability mean = mode = 1/2. For s/n < 1/2, the mean of the posterior probabilities, using the following priors, are such that: mean for Bayes prior > mean for Jeffreys prior > mean for Haldane prior. For s/n > 1/2 the order of these inequalities is reversed such that the Haldane prior probability results in the largest posterior mean. The Haldane prior probability Beta(0,0) results in a posterior probability density with mean (the expected value for the probability of success in the "next" trial) identical to the ratio s/n of the number of successes to the total number of trials. Therefore, the Haldane prior results in a posterior probability with expected value in the next trial equal to the maximum likelihood. The Bayes prior probability Beta(1,1) results in a posterior probability density with mode identical to the ratio s/n (the maximum likelihood). +

In the case that 100% of the trials have been successful s = n, the Bayes prior probability Beta(1,1) results in a posterior expected value equal to the rule of succession (n + 1)/(n + 2), while the Haldane prior Beta(0,0) results in a posterior expected value of 1 (absolute certainty of success in the next trial). Jeffreys prior probability results in a posterior expected value equal to (n + 1/2)/(n + 1). Perks[60] (p. 303) points out: "This provides a new rule of succession and expresses a 'reasonable' position to take up, namely, that after an unbroken run of n successes we assume a probability for the next trial equivalent to the assumption that we are about half-way through an average run, i.e. that we expect a failure once in (2n + 2) trials. The Bayes–Laplace rule implies that we are about at the end of an average run or that we expect a failure once in (n + 2) trials. The comparison clearly favours the new result (what is now called Jeffreys prior) from the point of view of 'reasonableness'." +

Conversely, in the case that 100% of the trials have resulted in failure (s = 0), the Bayes prior probability Beta(1,1) results in a posterior expected value for success in the next trial equal to 1/(n + 2), while the Haldane prior Beta(0,0) results in a posterior expected value of success in the next trial of 0 (absolute certainty of failure in the next trial). Jeffreys prior probability results in a posterior expected value for success in the next trial equal to (1/2)/(n + 1), which Perks[60] (p. 303) points out: "is a much more reasonably remote result than the Bayes-Laplace result 1/(n + 2)". +

Jaynes[51] questions (for the uniform prior Beta(1,1)) the use of these formulas for the cases s = 0 or s = n because the integrals do not converge (Beta(1,1) is an improper prior for s = 0 or s = n). In practice, the conditions 0<s<n necessary for a mode to exist between both ends for the Bayes prior are usually met, and therefore the Bayes prior (as long as 0 < s < n) results in a posterior mode located between both ends of the domain. +

As remarked in the section on the rule of succession, K. Pearson showed that after n successes in n trials the posterior probability (based on the Bayes Beta(1,1) distribution as the prior probability) that the next (n + 1) trials will all be successes is exactly 1/2, whatever the value of n. Based on the Haldane Beta(0,0) distribution as the prior probability, this posterior probability is 1 (absolute certainty that after n successes in n trials the next (n + 1) trials will all be successes). Perks[60] (p. 303) shows that, for what is now known as the Jeffreys prior, this probability is ((n + 1/2)/(n + 1))((n + 3/2)/(n + 2))...(2n + 1/2)/(2n + 1), which for n = 1, 2, 3 gives 15/24, 315/480, 9009/13440; rapidly approaching a limiting value of as n tends to infinity. Perks remarks that what is now known as the Jeffreys prior: "is clearly more 'reasonable' than either the Bayes-Laplace result or the result on the (Haldane) alternative rule rejected by Jeffreys which gives certainty as the probability. It clearly provides a very much better correspondence with the process of induction. Whether it is 'absolutely' reasonable for the purpose, i.e. whether it is yet large enough, without the absurdity of reaching unity, is a matter for others to decide. But it must be realized that the result depends on the assumption of complete indifference and absence of knowledge prior to the sampling experiment." +

Following are the variances of the posterior distribution obtained with these three prior probability distributions: +

for the Bayes' prior probability (Beta(1,1)), the posterior variance is: +

+
+

for the Jeffreys' prior probability (Beta(1/2,1/2)), the posterior variance is: +

+
+

and for the Haldane prior probability (Beta(0,0)), the posterior variance is: +

+
+

So, as remarked by Silvey,[49] for large n, the variance is small and hence the posterior distribution is highly concentrated, whereas the assumed prior distribution was very diffuse. This is in accord with what one would hope for, as vague prior knowledge is transformed (through Bayes theorem) into a more precise posterior knowledge by an informative experiment. For small n the Haldane Beta(0,0) prior results in the largest posterior variance while the Bayes Beta(1,1) prior results in the more concentrated posterior. Jeffreys prior Beta(1/2,1/2) results in a posterior variance in between the other two. As n increases, the variance rapidly decreases so that the posterior variance for all three priors converges to approximately the same value (approaching zero variance as n → ∞). Recalling the previous result that the Haldane prior probability Beta(0,0) results in a posterior probability density with mean (the expected value for the probability of success in the "next" trial) identical to the ratio s/n of the number of successes to the total number of trials, it follows from the above expression that also the Haldane prior Beta(0,0) results in a posterior with variance identical to the variance expressed in terms of the max. likelihood estimate s/n and sample size (in § Variance): +

+
+

with the mean μ = s/n and the sample size ν = n. +

In Bayesian inference, using a prior distribution Beta(αPrior,βPrior) prior to a binomial distribution is equivalent to adding (αPrior − 1) pseudo-observations of "success" and (βPrior − 1) pseudo-observations of "failure" to the actual number of successes and failures observed, then estimating the parameter p of the binomial distribution by the proportion of successes over both real- and pseudo-observations. A uniform prior Beta(1,1) does not add (or subtract) any pseudo-observations since for Beta(1,1) it follows that (αPrior − 1) = 0 and (βPrior − 1) = 0. The Haldane prior Beta(0,0) subtracts one pseudo observation from each and Jeffreys prior Beta(1/2,1/2) subtracts 1/2 pseudo-observation of success and an equal number of failure. This subtraction has the effect of smoothing out the posterior distribution. If the proportion of successes is not 50% (s/n ≠ 1/2) values of αPrior and βPrior less than 1 (and therefore negative (αPrior − 1) and (βPrior − 1)) favor sparsity, i.e. distributions where the parameter p is closer to either 0 or 1. In effect, values of αPrior and βPrior between 0 and 1, when operating together, function as a concentration parameter. +

The accompanying plots show the posterior probability density functions for sample sizes n ∈ {3,10,50}, successes s ∈ {n/2,n/4} and Beta(αPrior,βPrior) ∈ {Beta(0,0),Beta(1/2,1/2),Beta(1,1)}. Also shown are the cases for n = {4,12,40}, success s = {n/4} and Beta(αPrior,βPrior) ∈ {Beta(0,0),Beta(1/2,1/2),Beta(1,1)}. The first plot shows the symmetric cases, for successes s ∈ {n/2}, with mean = mode = 1/2 and the second plot shows the skewed cases s ∈ {n/4}. The images show that there is little difference between the priors for the posterior with sample size of 50 (characterized by a more pronounced peak near p = 1/2). Significant differences appear for very small sample sizes (in particular for the flatter distribution for the degenerate case of sample size = 3). Therefore, the skewed cases, with successes s = {n/4}, show a larger effect from the choice of prior, at small sample size, than the symmetric cases. For symmetric distributions, the Bayes prior Beta(1,1) results in the most "peaky" and highest posterior distributions and the Haldane prior Beta(0,0) results in the flattest and lowest peak distribution. The Jeffreys prior Beta(1/2,1/2) lies in between them. For nearly symmetric, not too skewed distributions the effect of the priors is similar. For very small sample size (in this case for a sample size of 3) and skewed distribution (in this example for s ∈ {n/4}) the Haldane prior can result in a reverse-J-shaped distribution with a singularity at the left end. However, this happens only in degenerate cases (in this example n = 3 and hence s = 3/4 < 1, a degenerate value because s should be greater than unity in order for the posterior of the Haldane prior to have a mode located between the ends, and because s = 3/4 is not an integer number, hence it violates the initial assumption of a binomial distribution for the likelihood) and it is not an issue in generic cases of reasonable sample size (such that the condition 1 < s < n − 1, necessary for a mode to exist between both ends, is fulfilled). +

In Chapter 12 (p. 385) of his book, Jaynes[51] asserts that the Haldane prior Beta(0,0) describes a prior state of knowledge of complete ignorance, where we are not even sure whether it is physically possible for an experiment to yield either a success or a failure, while the Bayes (uniform) prior Beta(1,1) applies if one knows that both binary outcomes are possible. Jaynes states: "interpret the Bayes-Laplace (Beta(1,1)) prior as describing not a state of complete ignorance, but the state of knowledge in which we have observed one success and one failure...once we have seen at least one success and one failure, then we know that the experiment is a true binary one, in the sense of physical possibility." Jaynes [51] does not specifically discuss Jeffreys prior Beta(1/2,1/2) (Jaynes discussion of "Jeffreys prior" on pp. 181, 423 and on chapter 12 of Jaynes book[51] refers instead to the improper, un-normalized, prior "1/p dp" introduced by Jeffreys in the 1939 edition of his book,[58] seven years before he introduced what is now known as Jeffreys' invariant prior: the square root of the determinant of Fisher's information matrix. "1/p" is Jeffreys' (1946) invariant prior for the exponential distribution, not for the Bernoulli or binomial distributions). However, it follows from the above discussion that Jeffreys Beta(1/2,1/2) prior represents a state of knowledge in between the Haldane Beta(0,0) and Bayes Beta (1,1) prior. +

Similarly, Karl Pearson in his 1892 book The Grammar of Science[67][68] (p. 144 of 1900 edition) maintained that the Bayes (Beta(1,1) uniform prior was not a complete ignorance prior, and that it should be used when prior information justified to "distribute our ignorance equally"". K. Pearson wrote: "Yet the only supposition that we appear to have made is this: that, knowing nothing of nature, routine and anomy (from the Greek ανομία, namely: a- "without", and nomos "law") are to be considered as equally likely to occur. Now we were not really justified in making even this assumption, for it involves a knowledge that we do not possess regarding nature. We use our experience of the constitution and action of coins in general to assert that heads and tails are equally probable, but we have no right to assert before experience that, as we know nothing of nature, routine and breach are equally probable. In our ignorance we ought to consider before experience that nature may consist of all routines, all anomies (normlessness), or a mixture of the two in any proportion whatever, and that all such are equally probable. Which of these constitutions after experience is the most probable must clearly depend on what that experience has been like." +

If there is sufficient sampling data, and the posterior probability mode is not located at one of the extremes of the domain (x=0 or x=1), the three priors of Bayes (Beta(1,1)), Jeffreys (Beta(1/2,1/2)) and Haldane (Beta(0,0)) should yield similar posterior probability densities. Otherwise, as Gelman et al.[69] (p. 65) point out, "if so few data are available that the choice of noninformative prior distribution makes a difference, one should put relevant information into the prior distribution", or as Berger[4] (p. 125) points out "when different reasonable priors yield substantially different answers, can it be right to state that there is a single answer? Would it not be better to admit that there is scientific uncertainty, with the conclusion depending on prior beliefs?." +

+

Occurrence and applications[edit]

+

Order statistics[edit]

+ +

The beta distribution has an important application in the theory of order statistics. A basic result is that the distribution of the kth smallest of a sample of size n from a continuous uniform distribution has a beta distribution.[38] This result is summarized as: +

+
+

From this, and application of the theory related to the probability integral transform, the distribution of any individual order statistic from any continuous distribution can be derived.[38] +

+

Subjective logic[edit]

+ +

In standard logic, propositions are considered to be either true or false. In contradistinction, subjective logic assumes that humans cannot determine with absolute certainty whether a proposition about the real world is absolutely true or false. In subjective logic the posteriori probability estimates of binary events can be represented by beta distributions.[70] +

+

Wavelet analysis[edit]

+ +

A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" that promptly decays. Wavelets can be used to extract information from many different kinds of data, including – but certainly not limited to – audio signals and images. Thus, wavelets are purposefully crafted to have specific properties that make them useful for signal processing. Wavelets are localized in both time and frequency whereas the standard Fourier transform is only localized in frequency. Therefore, standard Fourier Transforms are only applicable to stationary processes, while wavelets are applicable to non-stationary processes. Continuous wavelets can be constructed based on the beta distribution. Beta wavelets[71] can be viewed as a soft variety of Haar wavelets whose shape is fine-tuned by two shape parameters α and β. +

+

Population genetics[edit]

+ + +

The Balding–Nichols model is a two-parameter parametrization of the beta distribution used in population genetics.[72] It is a statistical description of the allele frequencies in the components of a sub-divided population: +

+
+

where and ; here F is (Wright's) genetic distance between two populations. +

+

Project management: task cost and schedule modeling[edit]

+

The beta distribution can be used to model events which are constrained to take place within an interval defined by a minimum and maximum value. For this reason, the beta distribution — along with the triangular distribution — is used extensively in PERT, critical path method (CPM), Joint Cost Schedule Modeling (JCSM) and other project management/control systems to describe the time to completion and the cost of a task. In project management, shorthand computations are widely used to estimate the mean and standard deviation of the beta distribution:[37] +

+
+

where a is the minimum, c is the maximum, and b is the most likely value (the mode for α > 1 and β > 1). +

The above estimate for the mean is known as the PERT three-point estimation and it is exact for either of the following values of β (for arbitrary α within these ranges): +

+
β = α > 1 (symmetric case) with standard deviation , skewness = 0, and excess kurtosis =
+

+

or +

+
β = 6 − α for 5 > α > 1 (skewed case) with standard deviation
+
+

skewness = , and excess kurtosis = +

+

The above estimate for the standard deviation σ(X) = (ca)/6 is exact for either of the following values of α and β: +

+
α = β = 4 (symmetric) with skewness = 0, and excess kurtosis = −6/11.
+
β = 6 − α and (right-tailed, positive skew) with skewness , and excess kurtosis = 0
+
β = 6 − α and (left-tailed, negative skew) with skewness , and excess kurtosis = 0
+

+

Otherwise, these can be poor approximations for beta distributions with other values of α and β, exhibiting average errors of 40% in the mean and 549% in the variance.[73][74][75] +

+

Random variate generation[edit]

+ +

If X and Y are independent, with and then +

+
+

So one algorithm for generating beta variates is to generate , where X is a gamma variate with parameters (α, 1) and Y is an independent gamma variate with parameters (β, 1).[76] In fact, here and are independent, and . If and is independent of and , then and is independent of . This shows that the product of independent and random variables is a random variable. +

Also, the kth order statistic of n uniformly distributed variates is , so an alternative if α and β are small integers is to generate α + β − 1 uniform variates and choose the α-th smallest.[38] +

Another way to generate the Beta distribution is by Pólya urn model. According to this method, one start with an "urn" with α "black" balls and β "white" balls and draw uniformly with replacement. Every trial an additional ball is added according to the color of the last ball which was drawn. Asymptotically, the proportion of black and white balls will be distributed according to the Beta distribution, where each repetition of the experiment will produce a different value. +

It is also possible to use the inverse transform sampling. +

+

Normal approximation to the Beta distribution[edit]

+

A beta distribution with α ~ β and α and β >> 1 is approximately normal with mean 1/2 and variance 1/(4(2α + 1)). If α ≥ β the Normal approximation can be improved by taking the cube-root of the logarithm of the reciprocal of [77] +

+

History[edit]

+

Thomas Bayes, in a posthumous paper [61] published in 1763 by Richard Price, obtained a beta distribution as the density of the probability of success in Bernoulli trials (see § Applications, Bayesian inference), but the paper does not analyze any of the moments of the beta distribution or discuss any of its properties. +

+
Karl Pearson analyzed the beta distribution as the solution Type I of Pearson distributions
+

The first systematic modern discussion of the beta distribution is probably due to Karl Pearson.[78][79] In Pearson's papers[20][32] the beta distribution is couched as a solution of a differential equation: Pearson's Type I distribution which it is essentially identical to except for arbitrary shifting and re-scaling (the beta and Pearson Type I distributions can always be equalized by proper choice of parameters). In fact, in several English books and journal articles in the few decades prior to World War II, it was common to refer to the beta distribution as Pearson's Type I distribution. William P. Elderton in his 1906 monograph "Frequency curves and correlation"[41] further analyzes the beta distribution as Pearson's Type I distribution, including a full discussion of the method of moments for the four parameter case, and diagrams of (what Elderton describes as) U-shaped, J-shaped, twisted J-shaped, "cocked-hat" shapes, horizontal and angled straight-line cases. Elderton wrote "I am chiefly indebted to Professor Pearson, but the indebtedness is of a kind for which it is impossible to offer formal thanks." Elderton in his 1906 monograph [41] provides an impressive amount of information on the beta distribution, including equations for the origin of the distribution chosen to be the mode, as well as for other Pearson distributions: types I through VII. Elderton also included a number of appendixes, including one appendix ("II") on the beta and gamma functions. In later editions, Elderton added equations for the origin of the distribution chosen to be the mean, and analysis of Pearson distributions VIII through XII. +

As remarked by Bowman and Shenton[43] "Fisher and Pearson had a difference of opinion in the approach to (parameter) estimation, in particular relating to (Pearson's method of) moments and (Fisher's method of) maximum likelihood in the case of the Beta distribution." Also according to Bowman and Shenton, "the case of a Type I (beta distribution) model being the center of the controversy was pure serendipity. A more difficult model of 4 parameters would have been hard to find." The long running public conflict of Fisher with Karl Pearson can be followed in a number of articles in prestigious journals. For example, concerning the estimation of the four parameters for the beta distribution, and Fisher's criticism of Pearson's method of moments as being arbitrary, see Pearson's article "Method of moments and method of maximum likelihood" [44] (published three years after his retirement from University College, London, where his position had been divided between Fisher and Pearson's son Egon) in which Pearson writes "I read (Koshai's paper in the Journal of the Royal Statistical Society, 1933) which as far as I am aware is the only case at present published of the application of Professor Fisher's method. To my astonishment that method depends on first working out the constants of the frequency curve by the (Pearson) Method of Moments and then superposing on it, by what Fisher terms "the Method of Maximum Likelihood" a further approximation to obtain, what he holds, he will thus get, "more efficient values" of the curve constants". +

David and Edwards's treatise on the history of statistics[80] cites the first modern treatment of the beta distribution, in 1911,[81] using the beta designation that has become standard, due to Corrado Gini, an Italian statistician, demographer, and sociologist, who developed the Gini coefficient. N.L.Johnson and S.Kotz, in their comprehensive and very informative monograph[82] on leading historical personalities in statistical sciences credit Corrado Gini[83] as "an early Bayesian...who dealt with the problem of eliciting the parameters of an initial Beta distribution, by singling out techniques which anticipated the advent of the so-called empirical Bayes approach." +

+

References[edit]

+
+
    +
  1. ^ a b c d e f g h i j k l m n o p q r s t u v w x y Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1995). "Chapter 25:Beta Distributions". Continuous Univariate Distributions Vol. 2 (2nd ed.). Wiley. ISBN 978-0-471-58494-0. +
  2. +
  3. ^ a b Rose, Colin; Smith, Murray D. (2002). Mathematical Statistics with MATHEMATICA. Springer. ISBN 978-0387952345. +
  4. +
  5. ^ a b c Kruschke, John K. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. Academic Press / Elsevier. p. 83. ISBN 978-0123814852. +
  6. +
  7. ^ a b Berger, James O. (2010). Statistical Decision Theory and Bayesian Analysis (2nd ed.). Springer. ISBN 978-1441930743. +
  8. +
  9. ^ a b c d Feller, William (1971). An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley. ISBN 978-0471257097. +
  10. +
  11. ^ Kruschke, John K. (2015). Doing Bayesian Data Analysis: A Tutorial with R, JAGS and Stan. Academic Press / Elsevier. ISBN 978-0-12-405888-0. +
  12. +
  13. ^ a b Wadsworth, George P. and Joseph Bryan (1960). Introduction to Probability and Random Variables. McGraw-Hill. +
  14. +
  15. ^ a b c d e f g Gupta, Arjun K., ed. (2004). Handbook of Beta Distribution and Its Applications. CRC Press. ISBN 978-0824753962. +
  16. +
  17. ^ a b Kerman J (2011) "A closed-form approximation for the median of the beta distribution". arXiv:1111.0433v1 +
  18. +
  19. ^ Mosteller, Frederick and John Tukey (1977). Data Analysis and Regression: A Second Course in Statistics. Addison-Wesley Pub. Co. Bibcode:1977dars.book.....M. ISBN 978-0201048544. +
  20. +
  21. ^ a b Feller, William (1968). An Introduction to Probability Theory and Its Applications. Vol. 1 (3rd ed.). ISBN 978-0471257080. +
  22. +
  23. ^ Philip J. Fleming and John J. Wallace. How not to lie with statistics: the correct way to summarize benchmark results. Communications of the ACM, 29(3):218–221, March 1986. +
  24. +
  25. ^ "NIST/SEMATECH e-Handbook of Statistical Methods 1.3.6.6.17. Beta Distribution". National Institute of Standards and Technology Information Technology Laboratory. April 2012. Retrieved May 31, 2016. +
  26. +
  27. ^ Oguamanam, D.C.D.; Martin, H. R.; Huissoon, J. P. (1995). "On the application of the beta distribution to gear damage analysis". Applied Acoustics. 45 (3): 247–261. doi:10.1016/0003-682X(95)00001-P. +
  28. +
  29. ^ Zhiqiang Liang; Jianming Wei; Junyu Zhao; Haitao Liu; Baoqing Li; Jie Shen; Chunlei Zheng (27 August 2008). "The Statistical Meaning of Kurtosis and Its New Application to Identification of Persons Based on Seismic Signals". Sensors. 8 (8): 5106–5119. Bibcode:2008Senso...8.5106L. doi:10.3390/s8085106. PMC 3705491. PMID 27873804. +
  30. +
  31. ^ Kenney, J. F., and E. S. Keeping (1951). Mathematics of Statistics Part Two, 2nd edition. D. Van Nostrand Company Inc.{{cite book}}: CS1 maint: multiple names: authors list (link) +
  32. +
  33. ^ a b c d Abramowitz, Milton and Irene A. Stegun (1965). Handbook Of Mathematical Functions With Formulas, Graphs, And Mathematical Tables. Dover. ISBN 978-0-486-61272-0. +
  34. +
  35. ^ Weisstein., Eric W. "Kurtosis". MathWorld--A Wolfram Web Resource. Retrieved 13 August 2012. +
  36. +
  37. ^ a b Panik, Michael J (2005). Advanced Statistics from an Elementary Point of View. Academic Press. ISBN 978-0120884940. +
  38. +
  39. ^ a b c d e f Pearson, Karl (1916). "Mathematical contributions to the theory of evolution, XIX: Second supplement to a memoir on skew variation". Philosophical Transactions of the Royal Society A. 216 (538–548): 429–457. Bibcode:1916RSPTA.216..429P. doi:10.1098/rsta.1916.0009. JSTOR 91092. +
  40. +
  41. ^ Gradshteyn, Izrail Solomonovich; Ryzhik, Iosif Moiseevich; Geronimus, Yuri Veniaminovich; Tseytlin, Michail Yulyevich; Jeffrey, Alan (2015) [October 2014]. Zwillinger, Daniel; Moll, Victor Hugo (eds.). Table of Integrals, Series, and Products. Translated by Scripta Technica, Inc. (8 ed.). Academic Press, Inc. ISBN 978-0-12-384933-5. LCCN 2014010276. +
  42. +
  43. ^ Billingsley, Patrick (1995). "30". Probability and measure (3rd ed.). Wiley-Interscience. ISBN 978-0-471-00710-4. +
  44. +
  45. ^ a b MacKay, David (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press; First Edition. Bibcode:2003itil.book.....M. ISBN 978-0521642989. +
  46. +
  47. ^ a b Johnson, N.L. (1949). "Systems of frequency curves generated by methods of translation" (PDF). Biometrika. 36 (1–2): 149–176. doi:10.1093/biomet/36.1-2.149. hdl:10338.dmlcz/135506. PMID 18132090. +
  48. +
  49. ^ Verdugo Lazo, A. C. G.; Rathie, P. N. (1978). "On the entropy of continuous probability distributions". IEEE Trans. Inf. Theory. 24 (1): 120–122. doi:10.1109/TIT.1978.1055832. +
  50. +
  51. ^ Shannon, Claude E. (1948). "A Mathematical Theory of Communication". Bell System Technical Journal. 27 (4): 623–656. doi:10.1002/j.1538-7305.1948.tb01338.x. +
  52. +
  53. ^ a b c Cover, Thomas M. and Joy A. Thomas (2006). Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience; 2 edition. ISBN 978-0471241959. +
  54. +
  55. ^ Plunkett, Kim, and Jeffrey Elman (1997). Exercises in Rethinking Innateness: A Handbook for Connectionist Simulations (Neural Network Modeling and Connectionism). A Bradford Book. p. 166. ISBN 978-0262661058.{{cite book}}: CS1 maint: multiple names: authors list (link) +
  56. +
  57. ^ Nallapati, Ramesh (2006). The smoothed dirichlet distribution: understanding cross-entropy ranking in information retrieval (Thesis). Computer Science Dept., University of Massachusetts Amherst. +
  58. +
  59. ^ a b Pearson, Egon S. (July 1969). "Some historical reflections traced through the development of the use of frequency curves". THEMIS Statistical Analysis Research Program, Technical Report 38. Office of Naval Research, Contract N000014-68-A-0515 (Project NR 042–260). +
  60. +
  61. ^ Hahn, Gerald J.; Shapiro, S. (1994). Statistical Models in Engineering (Wiley Classics Library). Wiley-Interscience. ISBN 978-0471040651. +
  62. +
  63. ^ a b Pearson, Karl (1895). "Contributions to the mathematical theory of evolution, II: Skew variation in homogeneous material". Philosophical Transactions of the Royal Society. 186: 343–414. Bibcode:1895RSPTA.186..343P. doi:10.1098/rsta.1895.0010. JSTOR 90649. +
  64. +
  65. ^ Buchanan, K.; Rockway, J.; Sternberg, O.; Mai, N. N. (May 2016). "Sum-difference beamforming for radar applications using circularly tapered random arrays". 2016 IEEE Radar Conference (RadarConf). pp. 1–5. doi:10.1109/RADAR.2016.7485289. ISBN 978-1-5090-0863-6. S2CID 32525626. +
  66. +
  67. ^ Buchanan, K.; Flores, C.; Wheeland, S.; Jensen, J.; Grayson, D.; Huff, G. (May 2017). "Transmit beamforming for radar applications using circularly tapered random arrays". 2017 IEEE Radar Conference (RadarConf). pp. 0112–0117. doi:10.1109/RADAR.2017.7944181. ISBN 978-1-4673-8823-8. S2CID 38429370. +
  68. +
  69. ^ Ryan, Buchanan, Kristopher (2014-05-29). "Theory and Applications of Aperiodic (Random) Phased Arrays". {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link) +
  70. +
  71. ^ Herrerías-Velasco, José Manuel and Herrerías-Pleguezuelo, Rafael and René van Dorp, Johan. (2011). Revisiting the PERT mean and Variance. European Journal of Operational Research (210), p. 448–451. +
  72. +
  73. ^ a b Malcolm, D. G.; Roseboom, J. H.; Clark, C. E.; Fazar, W. (September–October 1958). "Application of a Technique for Research and Development Program Evaluation". Operations Research. 7 (5): 646–669. doi:10.1287/opre.7.5.646. ISSN 0030-364X. +
  74. +
  75. ^ a b c d David, H. A., Nagaraja, H. N. (2003) Order Statistics (3rd Edition). Wiley, New Jersey pp 458. ISBN 0-471-38926-9 +
  76. +
  77. ^ "Beta distribution". www.statlect.com. +
  78. +
  79. ^ "1.3.6.6.17. Beta Distribution". www.itl.nist.gov. +
  80. +
  81. ^ a b c d e f g h Elderton, William Palin (1906). Frequency-Curves and Correlation. Charles and Edwin Layton (London). +
  82. +
  83. ^ Elderton, William Palin and Norman Lloyd Johnson (2009). Systems of Frequency Curves. Cambridge University Press. ISBN 978-0521093361. +
  84. +
  85. ^ a b c Bowman, K. O.; Shenton, L. R. (2007). "The beta distribution, moment method, Karl Pearson and R.A. Fisher" (PDF). Far East J. Theo. Stat. 23 (2): 133–164. +
  86. +
  87. ^ a b Pearson, Karl (June 1936). "Method of moments and method of maximum likelihood". Biometrika. 28 (1/2): 34–59. doi:10.2307/2334123. JSTOR 2334123. +
  88. +
  89. ^ a b c Joanes, D. N.; C. A. Gill (1998). "Comparing measures of sample skewness and kurtosis". The Statistician. 47 (Part 1): 183–189. doi:10.1111/1467-9884.00122. +
  90. +
  91. ^ Beckman, R. J.; G. L. Tietjen (1978). "Maximum likelihood estimation for the beta distribution". Journal of Statistical Computation and Simulation. 7 (3–4): 253–258. doi:10.1080/00949657808810232. +
  92. +
  93. ^ Gnanadesikan, R.,Pinkham and Hughes (1967). "Maximum likelihood estimation of the parameters of the beta distribution from smallest order statistics". Technometrics. 9 (4): 607–620. doi:10.2307/1266199. JSTOR 1266199.{{cite journal}}: CS1 maint: multiple names: authors list (link) +
  94. +
  95. ^ Fackler, Paul. "Inverse Digamma Function (Matlab)". Harvard University School of Engineering and Applied Sciences. Retrieved 2012-08-18. +
  96. +
  97. ^ a b c Silvey, S.D. (1975). Statistical Inference. Chapman and Hal. p. 40. ISBN 978-0412138201. +
  98. +
  99. ^ Edwards, A. W. F. (1992). Likelihood. The Johns Hopkins University Press. ISBN 978-0801844430. +
  100. +
  101. ^ a b c d e f Jaynes, E.T. (2003). Probability theory, the logic of science. Cambridge University Press. ISBN 978-0521592710. +
  102. +
  103. ^ Costa, Max, and Cover, Thomas (September 1983). On the similarity of the entropy power inequality and the Brunn Minkowski inequality (PDF). Tech.Report 48, Dept. Statistics, Stanford University.{{cite book}}: CS1 maint: multiple names: authors list (link) +
  104. +
  105. ^ a b c Aryal, Gokarna; Saralees Nadarajah (2004). "Information matrix for beta distributions" (PDF). Serdica Mathematical Journal (Bulgarian Academy of Science). 30: 513–526. +
  106. +
  107. ^ a b Laplace, Pierre Simon, marquis de (1902). A philosophical essay on probabilities. New York : J. Wiley ; London : Chapman & Hall. ISBN 978-1-60206-328-0.{{cite book}}: CS1 maint: multiple names: authors list (link) +
  108. +
  109. ^ Cox, Richard T. (1961). Algebra of Probable Inference. The Johns Hopkins University Press. ISBN 978-0801869822. +
  110. +
  111. ^ a b Keynes, John Maynard (2010) [1921]. A Treatise on Probability: The Connection Between Philosophy and the History of Science. Wildside Press. ISBN 978-1434406965. +
  112. +
  113. ^ Pearson, Karl (1907). "On the Influence of Past Experience on Future Expectation". Philosophical Magazine. 6 (13): 365–378. +
  114. +
  115. ^ a b c d Jeffreys, Harold (1998). Theory of Probability. Oxford University Press, 3rd edition. ISBN 978-0198503682. +
  116. +
  117. ^ Broad, C. D. (October 1918). "On the relation between induction and probability". MIND, A Quarterly Review of Psychology and Philosophy. 27 (New Series) (108): 389–404. doi:10.1093/mind/XXVII.4.389. JSTOR 2249035. +
  118. +
  119. ^ a b c d Perks, Wilfred (January 1947). "Some observations on inverse probability including a new indifference rule". Journal of the Institute of Actuaries. 73 (2): 285–334. doi:10.1017/S0020268100012270. +
  120. +
  121. ^ a b Bayes, Thomas; communicated by Richard Price (1763). "An Essay towards solving a Problem in the Doctrine of Chances". Philosophical Transactions of the Royal Society. 53: 370–418. doi:10.1098/rstl.1763.0053. JSTOR 105741. +
  122. +
  123. ^ Haldane, J.B.S. (1932). "A note on inverse probability". Mathematical Proceedings of the Cambridge Philosophical Society. 28 (1): 55–61. Bibcode:1932PCPS...28...55H. doi:10.1017/s0305004100010495. S2CID 122773707. +
  124. +
  125. ^ Zellner, Arnold (1971). An Introduction to Bayesian Inference in Econometrics. Wiley-Interscience. ISBN 978-0471169376. +
  126. +
  127. ^ Jeffreys, Harold (September 1946). "An Invariant Form for the Prior Probability in Estimation Problems". Proceedings of the Royal Society. A 24. 186 (1007): 453–461. Bibcode:1946RSPSA.186..453J. doi:10.1098/rspa.1946.0056. PMID 20998741. +
  128. +
  129. ^ Berger, James; Bernardo, Jose; Sun, Dongchu (2009). "The formal definition of reference priors". The Annals of Statistics. 37 (2): 905–938. arXiv:0904.0156. Bibcode:2009arXiv0904.0156B. doi:10.1214/07-AOS587. S2CID 3221355. +
  130. +
  131. ^ Clarke, Bertrand S.; Andrew R. Barron (1994). "Jeffreys' prior is asymptotically least favorable under entropy risk" (PDF). Journal of Statistical Planning and Inference. 41: 37–60. doi:10.1016/0378-3758(94)90153-8. +
  132. +
  133. ^ Pearson, Karl (1892). The Grammar of Science. Walter Scott, London. +
  134. +
  135. ^ Pearson, Karl (2009). The Grammar of Science. BiblioLife. ISBN 978-1110356119. +
  136. +
  137. ^ Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian Data Analysis. Chapman and Hall/CRC. ISBN 978-1584883883.{{cite book}}: CS1 maint: multiple names: authors list (link) +
  138. +
  139. ^ A. Jøsang. A Logic for Uncertain Probabilities. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 9(3), pp.279-311, June 2001. PDF[permanent dead link] +
  140. +
  141. ^ H.M. de Oliveira and G.A.A. Araújo,. Compactly Supported One-cyclic Wavelets Derived from Beta Distributions. Journal of Communication and Information Systems. vol.20, n.3, pp.27-33, 2005. +
  142. +
  143. ^ Balding, David J.; Nichols, Richard A. (1995). "A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity". Genetica. Springer. 96 (1–2): 3–12. doi:10.1007/BF01441146. PMID 7607457. S2CID 30680826. +
  144. +
  145. ^ Keefer, Donald L. and Verdini, William A. (1993). Better Estimation of PERT Activity Time Parameters. Management Science 39(9), p. 1086–1091. +
  146. +
  147. ^ Keefer, Donald L. and Bodily, Samuel E. (1983). Three-point Approximations for Continuous Random variables. Management Science 29(5), p. 595–609. +
  148. +
  149. ^ "Defense Resource Management Institute - Naval Postgraduate School". www.nps.edu. +
  150. +
  151. ^ van der Waerden, B. L., "Mathematical Statistics", Springer, ISBN 978-3-540-04507-6. +
  152. +
  153. ^ On normalizing the incomplete beta-function for fitting to dose-response curves M.E. Wise Biometrika vol 47, No. 1/2, June 1960, pp. 173-175 +
  154. +
  155. ^ Yule, G. U.; Filon, L. N. G. (1936). "Karl Pearson. 1857-1936". Obituary Notices of Fellows of the Royal Society. 2 (5): 72. doi:10.1098/rsbm.1936.0007. JSTOR 769130. +
  156. +
  157. ^ "Library and Archive catalogue". Sackler Digital Archive. Royal Society. Archived from the original on 2011-10-25. Retrieved 2011-07-01. +
  158. +
  159. ^ David, H. A. and A.W.F. Edwards (2001). Annotated Readings in the History of Statistics. Springer; 1 edition. ISBN 978-0387988443. +
  160. +
  161. ^ Gini, Corrado (1911). "Considerazioni Sulle Probabilità Posteriori e Applicazioni al Rapporto dei Sessi Nelle Nascite Umane". Studi Economico-Giuridici della Università de Cagliari. Anno III (reproduced in Metron 15, 133, 171, 1949): 5–41. +
  162. +
  163. ^ Johnson, Norman L. and Samuel Kotz, ed. (1997). Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present (Wiley Series in Probability and Statistics. Wiley. ISBN 978-0471163817. +
  164. +
  165. ^ Metron journal. "Biography of Corrado Gini". Metron Journal. Archived from the original on 2012-07-16. Retrieved 2012-08-18. +
  166. +
+

External links[edit]

+ + + + + + + +
+
+ +
+
+ +
+ +
+
+
+ +
+ + + + \ No newline at end of file diff --git a/references/Box–Muller_transform b/references/Box–Muller_transform new file mode 100644 index 0000000..6a54da9 --- /dev/null +++ b/references/Box–Muller_transform @@ -0,0 +1,1653 @@ + + + + +Box–Muller transform - Wikipedia + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Jump to content +
+
+
+ + + + +
+
+ + + + + +
+
+
+
+
+
+
+
+ +
+
+ +
+
+ + +
+
+
+ +

Box–Muller transform

+ + +
+
+
+
+ +
+
+ + + +
+
+
+
+ +
+
+
+
+
+ +
From Wikipedia, the free encyclopedia
+
+
+ + +
+
Visualisation of the Box–Muller transform — the coloured points in the unit square (u1, u2), drawn as circles, are mapped to a 2D Gaussian (z0, z1), drawn as crosses. The plots at the margins are the probability distribution functions of z0 and z1. z0 and z1 are unbounded; they appear to be in [-2.5,2.5] due to the choice of the illustrated points. In the SVG file, hover over a point to highlight it and its corresponding point.
+

The Box–Muller transform, by George Edward Pelham Box and Mervin Edgar Muller,[1] is a random number sampling method for generating pairs of independent, standard, normally distributed (zero expectation, unit variance) random numbers, given a source of uniformly distributed random numbers. The method was in fact first mentioned explicitly by Raymond E. A. C. Paley and Norbert Wiener in 1934.[2] +

The Box–Muller transform is commonly expressed in two forms. The basic form as given by Box and Muller takes two samples from the uniform distribution on the interval [0, 1] and maps them to two standard, normally distributed samples. The polar form takes two samples from a different interval, [−1, +1], and maps them to two normally distributed samples without the use of sine or cosine functions. +

The Box–Muller transform was developed as a more computationally efficient alternative to the inverse transform sampling method.[3] The ziggurat algorithm gives a more efficient method for scalar processors (e.g. old CPUs), while the Box–Muller transform is superior for processors with vector units (e.g. GPUs or modern CPUs).[4] +

+ +

Basic form[edit]

+

Suppose U1 and U2 are independent samples chosen from the uniform distribution on the unit interval (0, 1). Let +

+
+

and +

+
+

Then Z0 and Z1 are independent random variables with a standard normal distribution. +

The derivation[5] is based on a property of a two-dimensional Cartesian system, where X and Y coordinates are described by two independent and normally distributed random variables, the random variables for R2 and Θ (shown above) in the corresponding polar coordinates are also independent and can be expressed as +

+
+

and +

+
+

Because R2 is the square of the norm of the standard bivariate normal variable (XY), it has the chi-squared distribution with two degrees of freedom. In the special case of two degrees of freedom, the chi-squared distribution coincides with the exponential distribution, and the equation for R2 above is a simple way of generating the required exponential variate. +

+

Polar form[edit]

+ +
Two uniformly distributed values, u and v are used to produce the value s = R2, which is likewise uniformly distributed. The definitions of the sine and cosine are then applied to the basic form of the Box–Muller transform to avoid using trigonometric functions.

The polar form was first proposed by J. Bell[6] and then modified by R. Knop.[7] While several different versions of the polar method have been described, the version of R. Knop will be described here because it is the most widely used, in part due to its inclusion in Numerical Recipes. A slightly different form is described as "Algorithm P" by D. Knuth in The Art of Computer Programming.[8] +

Given u and v, independent and uniformly distributed in the closed interval [−1, +1], set s = R2 = u2 + v2. If s = 0 or s ≥ 1, discard u and v, and try another pair (uv). Because u and v are uniformly distributed and because only points within the unit circle have been admitted, the values of s will be uniformly distributed in the open interval (0, 1), too. The latter can be seen by calculating the cumulative distribution function for s in the interval (0, 1). This is the area of a circle with radius , divided by . From this we find the probability density function to have the constant value 1 on the interval (0, 1). Equally so, the angle θ divided by is uniformly distributed in the interval [0, 1) and independent of s. +

We now identify the value of s with that of U1 and with that of U2 in the basic form. As shown in the figure, the values of and in the basic form can be replaced with the ratios and , respectively. The advantage is that calculating the trigonometric functions directly can be avoided. This is helpful when trigonometric functions are more expensive to compute than the single division that replaces each one. +

Just as the basic form produces two standard normal deviates, so does this alternate calculation. +

+
+

and +

+
+

Contrasting the two forms[edit]

+

The polar method differs from the basic method in that it is a type of rejection sampling. It discards some generated random numbers, but can be faster than the basic method because it is simpler to compute (provided that the random number generator is relatively fast) and is more numerically robust.[9] Avoiding the use of expensive trigonometric functions improves speed over the basic form.[6] It discards 1 − π/4 ≈ 21.46% of the total input uniformly distributed random number pairs generated, i.e. discards 4/π − 1 ≈ 27.32% uniformly distributed random number pairs per Gaussian random number pair generated, requiring 4/π ≈ 1.2732 input random numbers per output random number. +

The basic form requires two multiplications, 1/2 logarithm, 1/2 square root, and one trigonometric function for each normal variate.[10] On some processors, the cosine and sine of the same argument can be calculated in parallel using a single instruction. Notably for Intel-based machines, one can use the fsincos assembler instruction or the expi instruction (usually available from C as an intrinsic function), to calculate complex +

+
+

and just separate the real and imaginary parts. +

Note: +To explicitly calculate the complex-polar form use the following substitutions in the general form, +

Let and Then +

+
+

The polar form requires 3/2 multiplications, 1/2 logarithm, 1/2 square root, and 1/2 division for each normal variate. The effect is to replace one multiplication and one trigonometric function with a single division and a conditional loop. +

+

Tails truncation[edit]

+

When a computer is used to produce a uniform random variable it will inevitably have some inaccuracies because there is a lower bound on how close numbers can be to 0. If the generator uses 32 bits per output value, the smallest non-zero number that can be generated is . When and are equal to this the Box–Muller transform produces a normal random deviate equal to . This means that the algorithm will not produce random variables more than 6.660 standard deviations from the mean. This corresponds to a proportion of lost due to the truncation, where is the standard cumulative normal distribution. With 64 bits the limit is pushed to standard deviations, for which . +

+

Implementation[edit]

+

The standard Box–Muller transform generates values from the standard normal distribution (i.e. standard normal deviates) with mean 0 and standard deviation 1. The implementation below in standard C++ generates values from any normal distribution with mean and variance . If is a standard normal deviate, then will have a normal distribution with mean and standard deviation . The random number generator has been seeded to ensure that new, pseudo-random values will be returned from sequential calls to the generateGaussianNoise function. +

+
#include <cmath>
+#include <limits>
+#include <random>
+#include <utility>
+
+//"mu" is the mean of the distribution, and "sigma" is the standard deviation.
+std::pair<double, double> generateGaussianNoise(double mu, double sigma)
+{
+    constexpr double epsilon = std::numeric_limits<double>::epsilon();
+    constexpr double two_pi = 2.0 * M_PI;
+
+    //initialize the random uniform number generator (runif) in a range 0 to 1
+    static std::mt19937 rng(std::random_device{}()); // Standard mersenne_twister_engine seeded with rd()
+    static std::uniform_real_distribution<> runif(0.0, 1.0);
+
+    //create two random numbers, make sure u1 is greater than epsilon
+    double u1, u2;
+    do
+    {
+        u1 = runif(rng);
+    }
+    while (u1 <= epsilon);
+    u2 = runif(rng);
+
+    //compute z0 and z1
+    auto mag = sigma * sqrt(-2.0 * log(u1));
+    auto z0  = mag * cos(two_pi * u2) + mu;
+    auto z1  = mag * sin(two_pi * u2) + mu;
+
+    return std::make_pair(z0, z1);
+}
+
+

See also[edit]

+ +

References[edit]

+
  • Howes, Lee; Thomas, David (2008). GPU Gems 3 - Efficient Random Number Generation and Application Using CUDA. Pearson Education, Inc. ISBN 978-0-321-51526-1.
+
    +
  1. ^ Box, G. E. P.; Muller, Mervin E. (1958). "A Note on the Generation of Random Normal Deviates". The Annals of Mathematical Statistics. 29 (2): 610–611. doi:10.1214/aoms/1177706645. JSTOR 2237361. +
  2. +
  3. ^ Raymond E. A. C. Paley and Norbert Wiener Fourier Transforms in the Complex Domain, New York: American Mathematical Society (1934) §37. +
  4. +
  5. ^ Kloeden and Platen, Numerical Solutions of Stochastic Differential Equations, pp. 11–12 +
  6. +
  7. ^ Howes & Thomas 2008. +
  8. +
  9. ^ Sheldon Ross, A First Course in Probability, (2002), pp. 279–281 +
  10. +
  11. ^ a b Bell, James R. (1968). "Algorithm 334: Normal random deviates". Communications of the ACM. 11 (7): 498. doi:10.1145/363397.363547. +
  12. +
  13. ^ Knop, R. (1969). "Remark on algorithm 334 [G5]: Normal random deviates". Communications of the ACM. 12 (5): 281. doi:10.1145/362946.362996. +
  14. +
  15. ^ Knuth, Donald (1998). The Art of Computer Programming: Volume 2: Seminumerical Algorithms. p. 122. ISBN 0-201-89684-2. +
  16. +
  17. ^ Everett F. Carter, Jr., The Generation and Application of Random Numbers, Forth Dimensions (1994), Vol. 16, No. 1 & 2. +
  18. +
  19. ^ The evaluation of 2πU1 is counted as one multiplication because the value of 2π can be computed in advance and used repeatedly. +
  20. +
+

External links[edit]

+ + + + + +
+
+ +
+
+ +
+ +
+
+
+ +
+ + + + \ No newline at end of file diff --git a/references/Gamma_distribution b/references/Gamma_distribution new file mode 100644 index 0000000..a582253 --- /dev/null +++ b/references/Gamma_distribution @@ -0,0 +1,9034 @@ + + + + +Gamma distribution - Wikipedia + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Jump to content +
+
+
+ + + + +
+
+ + + + + +
+
+
+
+
+
+
+
+ +
+
+ +
+
+
+ +
+ +
+
+
+ +

Gamma distribution

+ + +
+
+
+
+ +
+
+ + + +
+
+
+
+
+ + +
+
+
+
+
+
+ +
From Wikipedia, the free encyclopedia
+
+
+ + +
+ +
Gamma
+
Probability density function
Probability density plots of gamma distributions
+
Cumulative distribution function
Cumulative distribution plots of gamma distributions
Parameters + +
Support + +
PDF + +
CDF + +
Mean + +
Median +No simple closed form +No simple closed form
Mode +, +
Variance + +
Skewness + +
Ex. kurtosis + +
Entropy + +
MGF + +
CF + +
Fisher information + +
Method of Moments + +
+

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the gamma distribution. There are two equivalent parameterizations in common use: +

+
  1. With a shape parameter and a scale parameter .
  2. +
  3. With a shape parameter and an inverse scale parameter , called a rate parameter.
+

In each of these forms, both parameters are positive real numbers. +

The gamma distribution is the maximum entropy probability distribution (both with respect to a uniform base measure and a base measure) for a random variable for which E[X] = = α/β is fixed and greater than zero, and E[ln(X)] = ψ(k) + ln(θ) = ψ(α) − ln(β) is fixed (ψ is the digamma function).[1] +

+ +

Definitions[edit]

+

The parameterization with k and θ appears to be more common in econometrics and other applied fields, where the gamma distribution is frequently used to model waiting times. For instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution. See Hogg and Craig[2] for an explicit motivation. +

The parameterization with and is more common in Bayesian statistics, where the gamma distribution is used as a conjugate prior distribution for various types of inverse scale (rate) parameters, such as the λ of an exponential distribution or a Poisson distribution[3] – or for that matter, the β of the gamma distribution itself. The closely related inverse-gamma distribution is used as a conjugate prior for scale parameters, such as the variance of a normal distribution. +

If k is a positive integer, then the distribution represents an Erlang distribution; i.e., the sum of k independent exponentially distributed random variables, each of which has a mean of θ. +

+

Characterization using shape α and rate β[edit]

+

The gamma distribution can be parameterized in terms of a shape parameter α = k and an inverse scale parameter β = 1/θ, called a rate parameter. A random variable X that is gamma-distributed with shape α and rate β is denoted +

+
+

The corresponding probability density function in the shape-rate parameterization is +

+
+

where is the gamma function. +For all positive integers, . +

The cumulative distribution function is the regularized gamma function: +

+
+

where is the lower incomplete gamma function. +

If α is a positive integer (i.e., the distribution is an Erlang distribution), the cumulative distribution function has the following series expansion:[4] +

+
+

Characterization using shape k and scale θ[edit]

+

A random variable X that is gamma-distributed with shape k and scale θ is denoted by +

+
+
Illustration of the gamma PDF for parameter values over k and x with θ set to 1, 2, 3, 4, 5 and 6. One can see each θ layer by itself here [2] as well as by k [3] and x. [4].
+

The probability density function using the shape-scale parametrization is +

+
+

Here Γ(k) is the gamma function evaluated at k. +

The cumulative distribution function is the regularized gamma function: +

+
+

where is the lower incomplete gamma function. +

It can also be expressed as follows, if k is a positive integer (i.e., the distribution is an Erlang distribution):[4] +

+
+

Both parametrizations are common because either can be more convenient depending on the situation. +

+

Properties[edit]

+

Mean and variance[edit]

+

The mean of gamma distribution is given by the product of its shape and scale parameters: +

+
+

The variance is: +

+
+

The square root of the inverse shape parameter gives the coefficient of variation: +

+
+

Skewness[edit]

+

The skewness of the gamma distribution only depends on its shape parameter, k, and it is equal to +

+

Higher moments[edit]

+

The nth raw moment is given by: +

+
+

Median approximations and bounds[edit]

+
Bounds and asymptotic approximations to the median of the gamma distribution. The cyan-colored region indicates the large gap between published lower and upper bounds.
+

Unlike the mode and the mean, which have readily calculable formulas based on the parameters, the median does not have a closed-form equation. The median for this distribution is the value such that +

+
+

A rigorous treatment of the problem of determining an asymptotic expansion and bounds for the median of the gamma distribution was handled first by Chen and Rubin, who proved that (for ) +

+
+

where is the mean and is the median of the distribution.[5] For other values of the scale parameter, the mean scales to , and the median bounds and approximations would be similarly scaled by . +

K. P. Choi found the first five terms in a Laurent series asymptotic approximation of the median by comparing the median to Ramanujan's function.[6] Berg and Pedersen found more terms:[7] +

+
+
Two gamma distribution median asymptotes which were proved in 2023 to be bounds (upper solid red and lower dashed red), of the from , and an interpolation between them that makes an approximation (dotted red) that is exact at k = 1 and has maximum relative error of about 0.6%. The cyan shaded region is the remaining gap between upper and lower bounds (or conjectured bounds), including these new bounds and the bounds in the previous figure.
+
Log–log plot of upper (solid) and lower (dashed) bounds to the median of a gamma distribution and the gaps between them. The green, yellow, and cyan regions represent the gap before the Lyon 2021 paper. The green and yellow narrow that gap with the lower bounds that Lyon proved. Lyon's conjectured bounds further narrow the yellow. Mostly within the yellow, closed-form rational-function-interpolated bounds are plotted along with the numerically calculated median (dotted) value. Tighter interpolated bounds exist but are not plotted, as they would not be resolved at this scale.
+

Partial sums of these series are good approximations for high enough ; they are not plotted in the figure, which is focused on the low- region that is less well approximated. +

Berg and Pedersen also proved many properties of the median, showing that it is a convex function of ,[8] and that the asymptotic behavior near is (where is the Euler–Mascheroni constant), and that for all the median is bounded by .[7] +

A closer linear upper bound, for only, was provided in 2021 by Gaunt and Merkle,[9] relying on the Berg and Pedersen result that the slope of is everywhere less than 1: +

+
for (with equality at )
+

which can be extended to a bound for all by taking the max with the chord shown in the figure, since the median was proved convex.[8] +

An approximation to the median that is asymptotically accurate at high and reasonable down to or a bit lower follows from the Wilson–Hilferty transformation: +

+
+

which goes negative for . +

In 2021, Lyon proposed several approximations of the form . He conjectured values of and for which this approximation is an asymptotically tight upper or lower bound for all .[10] In particular, he proposed these closed-form bounds, which he proved in 2023:[11] +

+
is a lower bound, asymptotically tight as
+
is an upper bound, asymptotically tight as
+

Lyon also showed (informally in 2021, rigorously in 2023) two other lower bounds that are not closed-form expressions, including this one involving the gamma function, based on solving the integral expression substituting 1 for : +

+
(approaching equality as )
+

and the tangent line at where the derivative was found to be : +

+
(with equality at )
+
+

where Ei is the exponential integral.[10][11] +

Additionally, he showed that interpolations between bounds could provide excellent approximations or tighter bounds to the median, including an approximation that is exact at (where ) and has a maximum relative error less than 0.6%. Interpolated approximations and bounds are all of the form +

+
+

where is an interpolating function running monotonically from 0 at low to 1 at high , approximating an ideal, or exact, interpolator : +

+
+

For the simplest interpolating function considered, a first-order rational function +

+
+

the tightest lower bound has +

+
+

and the tightest upper bound has +

+
+

The interpolated bounds are plotted (mostly inside the yellow region) in the log–log plot shown. Even tighter bounds are available using different interpolating functions, but not usually with closed-form parameters like these.[10] +

+

Summation[edit]

+

If Xi has a Gamma(ki, θ) distribution for i = 1, 2, ..., N (i.e., all distributions have the same scale parameter θ), then +

+
+

provided all Xi are independent. +

For the cases where the Xi are independent but have different scale parameters, see Mathai [12] or Moschopoulos.[13] +

The gamma distribution exhibits infinite divisibility. +

+

Scaling[edit]

+

If +

+
+

then, for any c > 0, +

+
by moment generating functions,
+

or equivalently, if +

+
(shape-rate parameterization)
+
+

Indeed, we know that if X is an exponential r.v. with rate λ, then cX is an exponential r.v. with rate λ/c; the same thing is valid with Gamma variates (and this can be checked using the moment-generating function, see, e.g.,these notes, 10.4-(ii)): multiplication by a positive constant c divides the rate (or, equivalently, multiplies the scale). +

+

Exponential family[edit]

+

The gamma distribution is a two-parameter exponential family with natural parameters k − 1 and −1/θ (equivalently, α − 1 and −β), and natural statistics X and ln(X). +

If the shape parameter k is held fixed, the resulting one-parameter family of distributions is a natural exponential family. +

+

Logarithmic expectation and variance[edit]

+

One can show that +

+
+

or equivalently, +

+
+

where is the digamma function. Likewise, +

+
+

where is the trigamma function. +

This can be derived using the exponential family formula for the moment generating function of the sufficient statistic, because one of the sufficient statistics of the gamma distribution is ln(x). +

+

Information entropy[edit]

+

The information entropy is +

+
+

In the k, θ parameterization, the information entropy is given by +

+
+

Kullback–Leibler divergence[edit]

+
Illustration of the Kullback–Leibler (KL) divergence for two gamma PDFs. Here β = β0 + 1 which are set to 1, 2, 3, 4, 5 and 6. The typical asymmetry for the KL divergence is clearly visible.
+

The Kullback–Leibler divergence (KL-divergence), of Gamma(αp, βp) ("true" distribution) from Gamma(αq, βq) ("approximating" distribution) is given by[14] +

+
+

Written using the k, θ parameterization, the KL-divergence of Gamma(kp, θp) from Gamma(kq, θq) is given by +

+
+

Laplace transform[edit]

+

The Laplace transform of the gamma PDF is +

+
+

Related distributions[edit]

+

General[edit]

+
  • Let be independent and identically distributed random variables following an exponential distribution with rate parameter λ, then ~ Gamma(n, λ) where n is the shape parameter and λ is the rate, and .
  • +
  • If X ~ Gamma(1, λ) (in the shape–rate parametrization), then X has an exponential distribution with rate parameter λ. In the shape-scale parametrization, X ~ Gamma(1, λ) has an exponential distribution with rate parameter 1/λ.
  • +
  • If X ~ Gamma(ν/2, 2) (in the shape–scale parametrization), then X is identical to χ2(ν), the chi-squared distribution with ν degrees of freedom. Conversely, if Q ~ χ2(ν) and c is a positive constant, then cQ ~ Gamma(ν/2, 2c).
  • +
  • If θ=1/k, one obtains the Schulz-Zimm distribution, which is most prominently used to model polymer chain lengths.
  • +
  • If k is an integer, the gamma distribution is an Erlang distribution and is the probability distribution of the waiting time until the kth "arrival" in a one-dimensional Poisson process with intensity 1/θ. If
+
+
then +
+ +
+
  • If X ~ Gamma(k, θ), then follows an exponential-gamma (abbreviated exp-gamma) distribution.[15] It is sometimes referred to as the log-gamma distribution.[16] Formulas for its mean and variance are in the section #Logarithmic expectation and variance.
  • +
  • If X ~ Gamma(k, θ), then follows a generalized gamma distribution with parameters p = 2, d = 2k, and [citation needed].
  • +
  • More generally, if X ~ Gamma(k,θ), then for follows a generalized gamma distribution with parameters p = 1/q, d = k/q, and .
  • +
  • If X ~ Gamma(k, θ) with shape k and scale θ, then 1/X ~ Inv-Gamma(k, θ−1) (see Inverse-gamma distribution for derivation).
  • +
  • Parametrization 1: If are independent, then , or equivalently,
  • +
  • Parametrization 2: If are independent, then , or equivalently,
  • +
  • If X ~ Gamma(α, θ) and Y ~ Gamma(β, θ) are independently distributed, then X/(X + Y) has a beta distribution with parameters α and β, and X/(X + Y) is independent of X + Y, which is Gamma(α + β, θ)-distributed.
  • +
  • If Xi ~ Gamma(αi, 1) are independently distributed, then the vector (X1/S, ..., Xn/S), where S = X1 + ... + Xn, follows a Dirichlet distribution with parameters α1, ..., αn.
  • +
  • For large k the gamma distribution converges to normal distribution with mean μ = and variance σ2 = 2.
  • +
  • The gamma distribution is the conjugate prior for the precision of the normal distribution with known mean.
  • +
  • The matrix gamma distribution and the Wishart distribution are multivariate generalizations of the gamma distribution (samples are positive-definite matrices rather than positive real numbers).
  • +
  • The gamma distribution is a special case of the generalized gamma distribution, the generalized integer gamma distribution, and the generalized inverse Gaussian distribution.
  • +
  • Among the discrete distributions, the negative binomial distribution is sometimes considered the discrete analog of the gamma distribution.
  • +
  • Tweedie distributions – the gamma distribution is a member of the family of Tweedie exponential dispersion models.
  • +
  • Modified Half-normal distribution – the Gamma distribution is a member of the family of Modified half-normal distribution.[17] The corresponding density is , where denotes the Fox–Wright Psi function.
  • +
  • For the shape-scale parameterization , if the scale parameter where denotes the Inverse-gamma distribution, then the marginal distribution where denotes the Beta prime distribution.
+

Compound gamma[edit]

+

If the shape parameter of the gamma distribution is known, but the inverse-scale parameter is unknown, then a gamma distribution for the inverse scale forms a conjugate prior. The compound distribution, which results from integrating out the inverse scale, has a closed-form solution known as the compound gamma distribution.[18] +

If, instead, the shape parameter is known but the mean is unknown, with the prior of the mean being given by another gamma distribution, then it results in K-distribution. +

+

Weibull and stable count[edit]

+

The gamma distribution can be expressed as the product distribution of a Weibull distribution and a variant form of the stable count distribution. +Its shape parameter can be regarded as the inverse of Lévy's stability parameter in the stable count distribution: +

+where is a standard stable count distribution of shape , and is a standard Weibull distribution of shape . +

+

Statistical inference[edit]

+

Parameter estimation[edit]

+

Maximum likelihood estimation[edit]

+

The likelihood function for N iid observations (x1, ..., xN) is +

+
+

from which we calculate the log-likelihood function +

+
+

Finding the maximum with respect to θ by taking the derivative and setting it equal to zero yields the maximum likelihood estimator of the θ parameter, which equals the sample mean divided by the shape parameter k: +

+
+

Substituting this into the log-likelihood function gives +

+
+

We need at least two samples: , because for , the function increases without bounds as . For , it can be verified that is strictly concave, by using inequality properties of the polygamma function. Finding the maximum with respect to k by taking the derivative and setting it equal to zero yields +

+
+

where is the digamma function and is the sample mean of ln(x). There is no closed-form solution for k. The function is numerically very well behaved, so if a numerical solution is desired, it can be found using, for example, Newton's method. An initial value of k can be found either using the method of moments, or using the approximation +

+
+

If we let +

+
+

then k is approximately +

+
+

which is within 1.5% of the correct value.[19] An explicit form for the Newton–Raphson update of this initial guess is:[20] +

+
+

At the maximum-likelihood estimate , the expected values for and agree with the empirical averages: +

+
+
Caveat for small shape parameter[edit]
+

For data, , that is represented in a floating point format that underflows to 0 for values smaller than , the logarithms that are needed for the maximum-likelihood estimate will cause failure if there are any underflows. If we assume the data was generated by a gamma distribution with cdf , then the probability that there is at least one underflow is: +

+
+

This probability will approach 1 for small and large . For example, at , and , . A workaround is to instead have the data in logarithmic format. +

In order to test an implementation of a maximum-likelihood estimator that takes logarithmic data as input, it is useful to be able to generate non-underflowing logarithms of random gamma variates, when . Following the implementation in scipy.stats.loggamma, this can be done as follows:[21] sample and independently. Then the required logarithmic sample is , so that . +

+

Closed-form estimators[edit]

+

Consistent closed-form estimators of k and θ exists that are derived from the likelihood of the generalized gamma distribution.[22] +

The estimate for the shape k is +

+
+

and the estimate for the scale θ is +

+
+

Using the sample mean of x, the sample mean of ln(x), and the sample mean of the product x·ln(x) simplifies the expressions to: +

+
+
+

If the rate parameterization is used, the estimate of . +

These estimators are not strictly maximum likelihood estimators, but are instead referred to as mixed type log-moment estimators. They have however similar efficiency as the maximum likelihood estimators. +

Although these estimators are consistent, they have a small bias. A bias-corrected variant of the estimator for the scale θ is +

+
+

A bias correction for the shape parameter k is given as[23] +

+
+

Bayesian minimum mean squared error[edit]

+

With known k and unknown θ, the posterior density function for theta (using the standard scale-invariant prior for θ) is +

+
+

Denoting +

+
+

Integration with respect to θ can be carried out using a change of variables, revealing that 1/θ is gamma-distributed with parameters α = Nk, β = y. +

+
+

The moments can be computed by taking the ratio (m by m = 0) +

+
+

which shows that the mean ± standard deviation estimate of the posterior distribution for θ is +

+
+

Bayesian inference[edit]

+

Conjugate prior[edit]

+

In Bayesian inference, the gamma distribution is the conjugate prior to many likelihood distributions: the Poisson, exponential, normal (with known mean), Pareto, gamma with known shape σ, inverse gamma with known shape parameter, and Gompertz with known scale parameter. +

The gamma distribution's conjugate prior is:[24] +

+
+

where Z is the normalizing constant with no closed-form solution. +The posterior distribution can be found by updating the parameters as follows: +

+
+

where n is the number of observations, and xi is the ith observation. +

+

Occurrence and applications[edit]

+

Consider a sequence of events, with the waiting time for each event being an exponential distribution with rate . Then the waiting time for the -th event to occur is the gamma distribution with integer shape . This construction of the gamma distribution allows it to model a wide variety of phenomena where several sub-events, each taking time with exponential distribution, must happen in sequence for a major event to occur.[25] Examples include the waiting time of cell-division events,[26] number of compensatory mutations for a given mutation,[27] waiting time until a repair is necessary for a hydraulic system,[28] and so on. +

In biophysics, the dwell time between steps of a molecular motor like ATP synthase is nearly exponential at constant ATP concentration, revealing that each step of the motor takes a single ATP hydrolysis. If there were n ATP hydrolysis events, then it would be a gamma distribution with degree n.[29] +

The gamma distribution has been used to model the size of insurance claims[30] and rainfalls.[31] This means that aggregate insurance claims and the amount of rainfall accumulated in a reservoir are modelled by a gamma process – much like the exponential distribution generates a Poisson process. +

The gamma distribution is also used to model errors in multi-level Poisson regression models because a mixture of Poisson distributions with gamma-distributed rates has a known closed form distribution, called negative binomial. +

In wireless communication, the gamma distribution is used to model the multi-path fading of signal power;[citation needed] see also Rayleigh distribution and Rician distribution. +

In oncology, the age distribution of cancer incidence often follows the gamma distribution, wherein the shape and scale parameters predict, respectively, the number of driver events and the time interval between them.[32][33] +

In neuroscience, the gamma distribution is often used to describe the distribution of inter-spike intervals.[34][35] +

In bacterial gene expression, the copy number of a constitutively expressed protein often follows the gamma distribution, where the scale and shape parameter are, respectively, the mean number of bursts per cell cycle and the mean number of protein molecules produced by a single mRNA during its lifetime.[36] +

In genomics, the gamma distribution was applied in peak calling step (i.e., in recognition of signal) in ChIP-chip[37] and ChIP-seq[38] data analysis. +

In Bayesian statistics, the gamma distribution is widely used as a conjugate prior. It is the conjugate prior for the precision (i.e. inverse of the variance) of a normal distribution. It is also the conjugate prior for the exponential distribution. +

In phylogenetics, the gamma distribution is the most commonly used approach to model among-sites rate variation[39] when maximum likelihood, Bayesian, or distance matrix methods are used to estimate phylogenetic trees. Phylogenetic analyzes that use the gamma distribution to model rate variation estimate a single parameter from the data because they limit consideration to distributions where α=β. This parameterization means that the mean of this distribution is 1 and the variance is 1/α. Maximum likelihood and Bayesian methods typically use a discrete approximation to the continuous gamma distribution.[40][41] +

+

Random variate generation[edit]

+

Given the scaling property above, it is enough to generate gamma variables with θ = 1, as we can later convert to any value of β with a simple division. +

Suppose we wish to generate random variables from Gamma(n + δ, 1), where n is a non-negative integer and 0 < δ < 1. Using the fact that a Gamma(1, 1) distribution is the same as an Exp(1) distribution, and noting the method of generating exponential variables, we conclude that if U is uniformly distributed on (0, 1], then −ln(U) is distributed Gamma(1, 1) (i.e. inverse transform sampling). Now, using the "α-addition" property of gamma distribution, we expand this result: +

+
+

where Uk are all uniformly distributed on (0, 1] and independent. All that is left now is to generate a variable distributed as Gamma(δ, 1) for 0 < δ < 1 and apply the "α-addition" property once more. This is the most difficult part. +

Random generation of gamma variates is discussed in detail by Devroye,[42]: 401–428  noting that none are uniformly fast for all shape parameters. For small values of the shape parameter, the algorithms are often not valid.[42]: 406  For arbitrary values of the shape parameter, one can apply the Ahrens and Dieter[43] modified acceptance-rejection method Algorithm GD (shape k ≥ 1), or transformation method[44] when 0 < k < 1. Also see Cheng and Feast Algorithm GKM 3[45] or Marsaglia's squeeze method.[46] +

The following is a version of the Ahrens-Dieter acceptance–rejection method:[43] +

+
  1. Generate U, V and W as iid uniform (0, 1] variates.
  2. +
  3. If then and . Otherwise, and .
  4. +
  5. If then go to step 1.
  6. +
  7. ξ is distributed as Γ(δ, 1).
+

A summary of this is +

+
+

where is the integer part of k, ξ is generated via the algorithm above with δ = {k} (the fractional part of k) and the Uk are all independent. +

While the above approach is technically correct, Devroye notes that it is linear in the value of k and generally is not a good choice. Instead, he recommends using either rejection-based or table-based methods, depending on context.[42]: 401–428  +

For example, Marsaglia's simple transformation-rejection method relying on one normal variate X and one uniform variate U:[21] +

+
  1. Set and .
  2. +
  3. Set .
  4. +
  5. If and return , else go back to step 2.
+

With generates a gamma distributed random number in time that is approximately constant with k. The acceptance rate does depend on k, with an acceptance rate of 0.95, 0.98, and 0.99 for k=1, 2, and 4. For k < 1, one can use to boost k to be usable with this method. +

+

References[edit]

+
+
    +
  1. ^ Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (PDF). Journal of Econometrics. 150 (2): 219–230. CiteSeerX 10.1.1.511.9750. doi:10.1016/j.jeconom.2008.12.014. Archived from the original (PDF) on 2016-03-07. Retrieved 2011-06-02. +
  2. +
  3. ^ Hogg, R. V.; Craig, A. T. (1978). Introduction to Mathematical Statistics (4th ed.). New York: Macmillan. pp. Remark 3.3.1. ISBN 0023557109. +
  4. +
  5. ^ Gopalan, Prem; Hofman, Jake M.; Blei, David M. (2013). "Scalable Recommendation with Poisson Factorization". arXiv:1311.1704 [cs.IR]. +
  6. +
  7. ^ a b Papoulis, Pillai, Probability, Random Variables, and Stochastic Processes, Fourth Edition +
  8. +
  9. ^ Jeesen Chen, Herman Rubin, Bounds for the difference between median and mean of gamma and Poisson distributions, Statistics & Probability Letters, Volume 4, Issue 6, October 1986, Pages 281–283, ISSN 0167-7152, [1]. +
  10. +
  11. ^ Choi, K. P. "On the Medians of the Gamma Distributions and an Equation of Ramanujan", Proceedings of the American Mathematical Society, Vol. 121, No. 1 (May, 1994), pp. 245–251. +
  12. +
  13. ^ a b Berg, Christian & Pedersen, Henrik L. (March 2006). "The Chen–Rubin conjecture in a continuous setting" (PDF). Methods and Applications of Analysis. 13 (1): 63–88. doi:10.4310/MAA.2006.v13.n1.a4. S2CID 6704865. Retrieved 1 April 2020. +
  14. +
  15. ^ a b Berg, Christian and Pedersen, Henrik L. "Convexity of the median in the gamma distribution". +
  16. +
  17. ^ Gaunt, Robert E., and Milan Merkle (2021). "On bounds for the mode and median of the generalized hyperbolic and related distributions". Journal of Mathematical Analysis and Applications. 493 (1): 124508. arXiv:2002.01884. doi:10.1016/j.jmaa.2020.124508. S2CID 221103640.{{cite journal}}: CS1 maint: multiple names: authors list (link) +
  18. +
  19. ^ a b c Lyon, Richard F. (13 May 2021). "On closed-form tight bounds and approximations for the median of a gamma distribution". PLOS One. 16 (5): e0251626. arXiv:2011.04060. Bibcode:2021PLoSO..1651626L. doi:10.1371/journal.pone.0251626. PMC 8118309. PMID 33984053. +
  20. +
  21. ^ a b Lyon, Richard F. (13 May 2021). "Tight bounds for the median of a gamma distribution". PLOS One. 18 (9): e0288601. doi:10.1371/journal.pone.0288601. +
  22. +
  23. ^ Mathai, A. M. (1982). "Storage capacity of a dam with gamma type inputs". Annals of the Institute of Statistical Mathematics. 34 (3): 591–597. doi:10.1007/BF02481056. ISSN 0020-3157. S2CID 122537756. +
  24. +
  25. ^ Moschopoulos, P. G. (1985). "The distribution of the sum of independent gamma random variables". Annals of the Institute of Statistical Mathematics. 37 (3): 541–544. doi:10.1007/BF02481123. S2CID 120066454. +
  26. +
  27. ^ W.D. Penny, [www.fil.ion.ucl.ac.uk/~wpenny/publications/densities.ps KL-Divergences of Normal, Gamma, Dirichlet, and Wishart densities][full citation needed] +
  28. +
  29. ^ "ExpGammaDistribution—Wolfram Language Documentation". +
  30. +
  31. ^ "scipy.stats.loggamma — SciPy v1.8.0 Manual". docs.scipy.org. +
  32. +
  33. ^ Sun, Jingchao; Kong, Maiying; Pal, Subhadip (22 June 2021). "The Modified-Half-Normal distribution: Properties and an efficient sampling scheme". Communications in Statistics - Theory and Methods. 52 (5): 1591–1613. doi:10.1080/03610926.2021.1934700. ISSN 0361-0926. S2CID 237919587. +
  34. +
  35. ^ Dubey, Satya D. (December 1970). "Compound gamma, beta and F distributions". Metrika. 16: 27–31. doi:10.1007/BF02613934. S2CID 123366328. +
  36. +
  37. ^ Minka, Thomas P. (2002). "Estimating a Gamma distribution" (PDF). {{cite journal}}: Cite journal requires |journal= (help) +
  38. +
  39. ^ Choi, S. C.; Wette, R. (1969). "Maximum Likelihood Estimation of the Parameters of the Gamma Distribution and Their Bias". Technometrics. 11 (4): 683–690. doi:10.1080/00401706.1969.10490731. +
  40. +
  41. ^ a b Marsaglia, G.; Tsang, W. W. (2000). "A simple method for generating gamma variables". ACM Transactions on Mathematical Software. 26 (3): 363–372. doi:10.1145/358407.358414. S2CID 2634158. +
  42. +
  43. ^ Ye, Zhi-Sheng; Chen, Nan (2017). "Closed-Form Estimators for the Gamma Distribution Derived from Likelihood Equations". The American Statistician. 71 (2): 177–181. doi:10.1080/00031305.2016.1209129. S2CID 124682698. +
  44. +
  45. ^ Louzada, Francisco; Ramos, Pedro L.; Ramos, Eduardo (2019). "A Note on Bias of Closed-Form Estimators for the Gamma Distribution Derived from Likelihood Equations". The American Statistician. 73 (2): 195–199. doi:10.1080/00031305.2018.1513376. S2CID 126086375. +
  46. +
  47. ^ Fink, D. 1995 A Compendium of Conjugate Priors. In progress report: Extension and enhancement of methods for setting data quality objectives. (DOE contract 95‑831). +
  48. +
  49. ^ Jessica., Scheiner, Samuel M., 1956- Gurevitch (2001). "13. Failure-time analysis". Design and analysis of ecological experiments. Oxford University Press. ISBN 0-19-513187-8. OCLC 43694448.{{cite book}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link) +
  50. +
  51. ^ Golubev, A. (March 2016). "Applications and implications of the exponentially modified gamma distribution as a model for time variabilities related to cell proliferation and gene expression". Journal of Theoretical Biology. 393: 203–217. Bibcode:2016JThBi.393..203G. doi:10.1016/j.jtbi.2015.12.027. ISSN 0022-5193. PMID 26780652. +
  52. +
  53. ^ Poon, Art; Davis, Bradley H; Chao, Lin (2005-07-01). "The Coupon Collector and the Suppressor Mutation". Genetics. 170 (3): 1323–1332. doi:10.1534/genetics.104.037259. ISSN 1943-2631. PMC 1451182. PMID 15879511. +
  54. +
  55. ^ Vineyard, Michael; Amoako-Gyampah, Kwasi; Meredith, Jack R (July 1999). "Failure rate distributions for flexible manufacturing systems: An empirical study". European Journal of Operational Research. 116 (1): 139–155. doi:10.1016/s0377-2217(98)00096-4. ISSN 0377-2217. +
  56. +
  57. ^ Rief, Matthias; Rock, Ronald S.; Mehta, Amit D.; Mooseker, Mark S.; Cheney, Richard E.; Spudich, James A. (2000-08-15). "Myosin-V stepping kinetics: A molecular model for processivity". Proceedings of the National Academy of Sciences. 97 (17): 9482–9486. doi:10.1073/pnas.97.17.9482. ISSN 0027-8424. PMC 16890. PMID 10944217. +
  58. +
  59. ^ p. 43, Philip J. Boland, Statistical and Probabilistic Methods in Actuarial Science, Chapman & Hall CRC 2007 +
  60. +
  61. ^ Wilks, Daniel S. (1990). "Maximum Likelihood Estimation for the Gamma Distribution Using Data Containing Zeros". Journal of Climate. 3 (12): 1495–1501. Bibcode:1990JCli....3.1495W. doi:10.1175/1520-0442(1990)003<1495:MLEFTG>2.0.CO;2. ISSN 0894-8755. JSTOR 26196366. +
  62. +
  63. ^ Belikov, Aleksey V. (22 September 2017). "The number of key carcinogenic events can be predicted from cancer incidence". Scientific Reports. 7 (1): 12170. Bibcode:2017NatSR...712170B. doi:10.1038/s41598-017-12448-7. PMC 5610194. PMID 28939880. +
  64. +
  65. ^ Belikov, Aleksey V.; Vyatkin, Alexey; Leonov, Sergey V. (2021-08-06). "The Erlang distribution approximates the age distribution of incidence of childhood and young adulthood cancers". PeerJ. 9: e11976. doi:10.7717/peerj.11976. ISSN 2167-8359. PMC 8351573. PMID 34434669. +
  66. +
  67. ^ J. G. Robson and J. B. Troy, "Nature of the maintained discharge of Q, X, and Y retinal ganglion cells of the cat", J. Opt. Soc. Am. A 4, 2301–2307 (1987) +
  68. +
  69. ^ M.C.M. Wright, I.M. Winter, J.J. Forster, S. Bleeck "Response to best-frequency tone bursts in the ventral cochlear nucleus is governed by ordered inter-spike interval statistics", Hearing Research 317 (2014) +
  70. +
  71. ^ N. Friedman, L. Cai and X. S. Xie (2006) "Linking stochastic dynamics to population distribution: An analytical framework of gene expression", Phys. Rev. Lett. 97, 168302. +
  72. +
  73. ^ DJ Reiss, MT Facciotti and NS Baliga (2008) "Model-based deconvolution of genome-wide DNA binding", Bioinformatics, 24, 396–403 +
  74. +
  75. ^ MA Mendoza-Parra, M Nowicka, W Van Gool, H Gronemeyer (2013) "Characterising ChIP-seq binding patterns by model-based peak shape deconvolution", BMC Genomics, 14:834 +
  76. +
  77. ^ Yang, Ziheng (September 1996). "Among-site rate variation and its impact on phylogenetic analyses". Trends in Ecology & Evolution. 11 (9): 367–372. doi:10.1016/0169-5347(96)10041-0. PMID 21237881. +
  78. +
  79. ^ Yang, Ziheng (September 1994). "Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods". Journal of Molecular Evolution. 39 (3): 306–314. Bibcode:1994JMolE..39..306Y. doi:10.1007/BF00160154. ISSN 0022-2844. PMID 7932792. S2CID 17911050. +
  80. +
  81. ^ Felsenstein, Joseph (2001-10-01). "Taking Variation of Evolutionary Rates Between Sites into Account in Inferring Phylogenies". Journal of Molecular Evolution. 53 (4–5): 447–455. Bibcode:2001JMolE..53..447F. doi:10.1007/s002390010234. ISSN 0022-2844. PMID 11675604. S2CID 9791493. +
  82. +
  83. ^ a b c Devroye, Luc (1986). Non-Uniform Random Variate Generation. New York: Springer-Verlag. ISBN 978-0-387-96305-1. See Chapter 9, Section 3. +
  84. +
  85. ^ a b Ahrens, J. H.; Dieter, U (January 1982). "Generating gamma variates by a modified rejection technique". Communications of the ACM. 25 (1): 47–54. doi:10.1145/358315.358390. S2CID 15128188.. See Algorithm GD, p. 53. +
  86. +
  87. ^ Ahrens, J. H.; Dieter, U. (1974). "Computer methods for sampling from gamma, beta, Poisson and binomial distributions". Computing. 12 (3): 223–246. CiteSeerX 10.1.1.93.3828. doi:10.1007/BF02293108. S2CID 37484126. +
  88. +
  89. ^ Cheng, R. C. H.; Feast, G. M. (1979). "Some Simple Gamma Variate Generators". Journal of the Royal Statistical Society. Series C (Applied Statistics). 28 (3): 290–295. doi:10.2307/2347200. JSTOR 2347200. +
  90. +
  91. ^ Marsaglia, G. The squeeze method for generating gamma variates. Comput, Math. Appl. 3 (1977), 321–325. +
  92. +
+

External links[edit]

+ + + + + + + +
+
+ +
+
+ +
+ +
+
+
+ +
+ + + + \ No newline at end of file diff --git a/references/Normal_distribution?lang=en b/references/Normal_distribution?lang=en new file mode 100644 index 0000000..dda7b92 --- /dev/null +++ b/references/Normal_distribution?lang=en @@ -0,0 +1,22485 @@ + + + + +Normal distribution - Wikipedia + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Jump to content +
+
+
+ + + + +
+
+ + + + + +
+
+
+
+
+
+
+
+ +
+
+ +
+
+
+ +
+ +
+
+
+ +

Normal distribution

+ + +
+
+
+
+ +
+
+ + + +
+
+
+
+
+ + +
+
+
+
+
+
+ +
From Wikipedia, the free encyclopedia
+
+
+ + +
+ +

+

+
Normal distribution
+
Probability density function
The red curve is the standard normal distribution
+
Cumulative distribution function
Notation +
Parameters + = mean (location)
= variance (squared scale)
Support +
PDF +
CDF +
Quantile +
Mean +
Median +
Mode +
Variance +
MAD +
Skewness +
Ex. kurtosis +
Entropy +
MGF +
CF +
Fisher information +

+

+
Kullback–Leibler divergence +
+ + + + + + +

In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is +

+
+

The parameter is the mean or expectation of the distribution (and also its median and mode), while the parameter is its standard deviation. The variance of the distribution is . A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate. +

Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.[2][3] Their importance is partly due to the central limit theorem. It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases. Therefore, physical quantities that are expected to be the sum of many independent processes, such as measurement errors, often have distributions that are nearly normal.[4] +

Moreover, Gaussian distributions have some unique properties that are valuable in analytic studies. For instance, any linear combination of a fixed collection of independent normal deviates is a normal deviate. Many results and methods, such as propagation of uncertainty and least squares[5] parameter fitting, can be derived analytically in explicit form when the relevant variables are normally distributed. +

A normal distribution is sometimes informally called a bell curve.[6] However, many other distributions are bell-shaped (such as the Cauchy, Student's t, and logistic distributions). For other names, see Naming. +

The univariate probability distribution is generalized for vectors in the multivariate normal distribution and for matrices in the matrix normal distribution. +

+ +

Definitions[edit]

+

Standard normal distribution[edit]

+

The simplest case of a normal distribution is known as the standard normal distribution or unit normal distribution. This is a special case when and , and it is described by this probability density function (or density): +

+
+

The variable has a mean of 0 and a variance and standard deviation of 1. The density has its peak at and inflection points at and . +

Although the density above is most commonly known as the standard normal, a few authors have used that term to describe other versions of the normal distribution. Carl Friedrich Gauss, for example, once defined the standard normal as +

+
+

which has a variance of 1/2, and Stephen Stigler[7] once defined the standard normal as +

+
+

which has a simple functional form and a variance of +

+

General normal distribution[edit]

+

Every normal distribution is a version of the standard normal distribution, whose domain has been stretched by a factor (the standard deviation) and then translated by (the mean value): +

+
+

The probability density must be scaled by so that the integral is still 1. +

If is a standard normal deviate, then will have a normal distribution with expected value and standard deviation . This is equivalent to saying that the standard normal distribution can be scaled/stretched by a factor of and shifted by to yield a different normal distribution, called . Conversely, if is a normal deviate with parameters and , then this distribution can be re-scaled and shifted via the formula to convert it to the standard normal distribution. This variate is also called the standardized form of . +

+

Notation[edit]

+

The probability density of the standard Gaussian distribution (standard normal distribution, with zero mean and unit variance) is often denoted with the Greek letter (phi).[8] The alternative form of the Greek letter phi, , is also used quite often. +

The normal distribution is often referred to as or .[9] Thus when a random variable is normally distributed with mean and standard deviation , one may write +

+
+

Alternative parameterizations[edit]

+

Some authors advocate using the precision as the parameter defining the width of the distribution, instead of the deviation or the variance . The precision is normally defined as the reciprocal of the variance, .[10] The formula for the distribution then becomes +

+
+

This choice is claimed to have advantages in numerical computations when is very close to zero, and simplifies formulas in some contexts, such as in the Bayesian inference of variables with multivariate normal distribution. +

Alternatively, the reciprocal of the standard deviation might be defined as the precision, in which case the expression of the normal distribution becomes +

+
+

According to Stigler, this formulation is advantageous because of a much simpler and easier-to-remember formula, and simple approximate formulas for the quantiles of the distribution. +

Normal distributions form an exponential family with natural parameters and , and natural statistics x and x2. The dual expectation parameters for normal distribution are η1 = μ and η2 = μ2 + σ2. +

+

Cumulative distribution function[edit]

+

The cumulative distribution function (CDF) of the standard normal distribution, usually denoted with the capital Greek letter (phi), is the integral +

+
+

Error Function[edit]

+

The related error function gives the probability of a random variable, with normal distribution of mean 0 and variance 1/2 falling in the range . That is: +

+
+

These integrals cannot be expressed in terms of elementary functions, and are often said to be special functions. However, many numerical approximations are known; see below for more. +

The two functions are closely related, namely +

+
+

For a generic normal distribution with density , mean and deviation , the cumulative distribution function is +

+
+

The complement of the standard normal cumulative distribution function, , is often called the Q-function, especially in engineering texts.[11][12] It gives the probability that the value of a standard normal random variable will exceed : . Other definitions of the -function, all of which are simple transformations of , are also used occasionally.[13] +

The graph of the standard normal cumulative distribution function has 2-fold rotational symmetry around the point (0,1/2); that is, . Its antiderivative (indefinite integral) can be expressed as follows: +

+
+

The cumulative distribution function of the standard normal distribution can be expanded by Integration by parts into a series: +

+
+

where denotes the double factorial. +

An asymptotic expansion of the cumulative distribution function for large x can also be derived using integration by parts. For more, see Error function#Asymptotic expansion.[14] +

A quick approximation to the standard normal distribution's cumulative distribution function can be found by using a Taylor series approximation: +

+
+

Recursive computation with Taylor series expansion[edit]

+

The recursive nature of the family of derivatives may be used to easily construct a rapidly converging Taylor series expansion using recursive entries about any point of known value of the distribution,: +

+
+

where: +

+
+
+
, for all n ≥ 2.
+

Using the Taylor series and Newton's method for the inverse function[edit]

+

An application for the above Taylor series expansion is to use Newton's method to reverse the computation. That is, if we have a value for the cumulative distribution function, , but do not know the x needed to obtain the , we can use Newton's method to find x, and use the Taylor series expansion above to minimize the number of computations. Newton's method is ideal to solve this problem because the first derivative of , which is an integral of the normal standard distribution, is the normal standard distribution, and is readily available to use in the Newton's method solution. +

To solve, select a known approximate solution, , to the desired . may be a value from a distribution table, or an intelligent estimate followed by a computation of using any desired means to compute. Use this value of and the Taylor series expansion above to minimize computations. +

Repeat the following process until the difference between the computed and the desired , which we will call , is below a chosen acceptably small error, such as 10−5, 10−15, etc.: +

+

where +

+
is the from a Taylor series solution using and
+
+

When the repeated computations converge to an error below the chosen acceptably small value, x will be the value needed to obtain a of the desired value, . If is a good beginning estimate, convergence should be rapid with only a small number of iterations needed.[citation needed] +

+

Standard deviation and coverage[edit]

+ +
For the normal distribution, the values less than one standard deviation away from the mean account for 68.27% of the set; while two standard deviations from the mean account for 95.45%; and three standard deviations account for 99.73%.
+

About 68% of values drawn from a normal distribution are within one standard deviation σ away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations.[6] This fact is known as the 68-95-99.7 (empirical) rule, or the 3-sigma rule. +

More precisely, the probability that a normal deviate lies in the range between and is given by +

+
+

To 12 significant digits, the values for are:[citation needed] +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OEIS +
10.6826894921370.317310507863 + + + +
3.15148718753 +
+
OEISA178647 +
20.9544997361040.045500263896 + + + +
21.9778945080 +
+
OEISA110894 +
30.9973002039370.002699796063 + + + +
370.398347345 +
+
OEISA270712 +
40.9999366575160.000063342484 + + + +
15787.1927673 +
+
50.9999994266970.000000573303 + + + +
1744277.89362 +
+
60.9999999980270.000000001973 + + + +
506797345.897 +
+
+

For large , one can use the approximation . +

+

Quantile function[edit]

+ +

The quantile function of a distribution is the inverse of the cumulative distribution function. The quantile function of the standard normal distribution is called the probit function, and can be expressed in terms of the inverse error function: +

+
+

For a normal random variable with mean and variance , the quantile function is +

+
+

The quantile of the standard normal distribution is commonly denoted as . These values are used in hypothesis testing, construction of confidence intervals and Q–Q plots. A normal random variable will exceed with probability , and will lie outside the interval with probability . In particular, the quantile is 1.96; therefore a normal random variable will lie outside the interval in only 5% of cases. +

The following table gives the quantile such that will lie in the range with a specified probability . These values are useful to determine tolerance interval for sample averages and other statistical estimators with normal (or asymptotically normal) distributions.[citation needed] The following table shows , not as defined above. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+  + +
0.801.2815515655450.9993.290526731492 +
0.901.6448536269510.99993.890591886413 +
0.951.9599639845400.999994.417173413469 +
0.982.3263478740410.9999994.891638475699 +
0.992.5758293035490.99999995.326723886384 +
0.9952.8070337683440.999999995.730728868236 +
0.9983.0902323061680.9999999996.109410204869 +
+

For small , the quantile function has the useful asymptotic expansion +[citation needed] +

+

Properties[edit]

+

The normal distribution is the only distribution whose cumulants beyond the first two (i.e., other than the mean and variance) are zero. It is also the continuous distribution with the maximum entropy for a specified mean and variance.[15][16] Geary has shown, assuming that the mean and variance are finite, that the normal distribution is the only distribution where the mean and variance calculated from a set of independent draws are independent of each other.[17][18] +

The normal distribution is a subclass of the elliptical distributions. The normal distribution is symmetric about its mean, and is non-zero over the entire real line. As such it may not be a suitable model for variables that are inherently positive or strongly skewed, such as the weight of a person or the price of a share. Such variables may be better described by other distributions, such as the log-normal distribution or the Pareto distribution. +

The value of the normal distribution is practically zero when the value lies more than a few standard deviations away from the mean (e.g., a spread of three standard deviations covers all but 0.27% of the total distribution). Therefore, it may not be an appropriate model when one expects a significant fraction of outliers—values that lie many standard deviations away from the mean—and least squares and other statistical inference methods that are optimal for normally distributed variables often become highly unreliable when applied to such data. In those cases, a more heavy-tailed distribution should be assumed and the appropriate robust statistical inference methods applied. +

The Gaussian distribution belongs to the family of stable distributions which are the attractors of sums of independent, identically distributed distributions whether or not the mean or variance is finite. Except for the Gaussian which is a limiting case, all stable distributions have heavy tails and infinite variance. It is one of the few distributions that are stable and that have probability density functions that can be expressed analytically, the others being the Cauchy distribution and the Lévy distribution. +

+

Symmetries and derivatives[edit]

+

The normal distribution with density (mean and standard deviation ) has the following properties: +

+
  • It is symmetric around the point which is at the same time the mode, the median and the mean of the distribution.[19]
  • +
  • It is unimodal: its first derivative is positive for negative for and zero only at
  • +
  • The area bounded by the curve and the -axis is unity (i.e. equal to one).
  • +
  • Its first derivative is
  • +
  • Its second derivative is
  • +
  • Its density has two inflection points (where the second derivative of is zero and changes sign), located one standard deviation away from the mean, namely at and [19]
  • +
  • Its density is log-concave.[19]
  • +
  • Its density is infinitely differentiable, indeed supersmooth of order 2.[20]
+

Furthermore, the density of the standard normal distribution (i.e. and ) also has the following properties: +

+
  • Its first derivative is
  • +
  • Its second derivative is
  • +
  • More generally, its nth derivative is where is the nth (probabilist) Hermite polynomial.[21]
  • +
  • The probability that a normally distributed variable with known and is in a particular set, can be calculated by using the fact that the fraction has a standard normal distribution.
+

Moments[edit]

+ +

The plain and absolute moments of a variable are the expected values of and , respectively. If the expected value of is zero, these parameters are called central moments; otherwise, these parameters are called non-central moments. Usually we are interested only in moments with integer order . +

If has a normal distribution, the non-central moments exist and are finite for any whose real part is greater than −1. For any non-negative integer , the plain central moments are:[22] +

+
+

Here denotes the double factorial, that is, the product of all numbers from to 1 that have the same parity as +

The central absolute moments coincide with plain moments for all even orders, but are nonzero for odd orders. For any non-negative integer +

+
+

The last formula is valid also for any non-integer When the mean the plain and absolute moments can be expressed in terms of confluent hypergeometric functions and [citation needed] +

+
+

These expressions remain valid even if is not an integer. See also generalized Hermite polynomials. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OrderNon-central momentCentral moment +
1 + + +
2 + + +
3 + + +
4 + + +
5 + + +
6 + + +
7 + + +
8 + + +
+

The expectation of conditioned on the event that lies in an interval is given by +

+
+

where and respectively are the density and the cumulative distribution function of . For this is known as the inverse Mills ratio. Note that above, density of is used instead of standard normal density as in inverse Mills ratio, so here we have instead of . +

+

Fourier transform and characteristic function[edit]

+

The Fourier transform of a normal density with mean and standard deviation is[23] +

+
+

where is the imaginary unit. If the mean , the first factor is 1, and the Fourier transform is, apart from a constant factor, a normal density on the frequency domain, with mean 0 and standard deviation . In particular, the standard normal distribution is an eigenfunction of the Fourier transform. +

In probability theory, the Fourier transform of the probability distribution of a real-valued random variable is closely connected to the characteristic function of that variable, which is defined as the expected value of , as a function of the real variable (the frequency parameter of the Fourier transform). This definition can be analytically extended to a complex-value variable .[24] The relation between both is: +

+
+

Moment- and cumulant-generating functions[edit]

+

The moment generating function of a real random variable is the expected value of , as a function of the real parameter . For a normal distribution with density , mean and deviation , the moment generating function exists and is equal to +

+
+

The cumulant generating function is the logarithm of the moment generating function, namely +

+
+

Since this is a quadratic polynomial in , only the first two cumulants are nonzero, namely the mean  and the variance . +

+

Stein operator and class[edit]

+

Within Stein's method the Stein operator and class of a random variable are and the class of all absolutely continuous functions . +

+

Zero-variance limit[edit]

+

In the limit when tends to zero, the probability density eventually tends to zero at any , but grows without limit if , while its integral remains equal to 1. Therefore, the normal distribution cannot be defined as an ordinary function when . +

However, one can define the normal distribution with zero variance as a generalized function; specifically, as a Dirac delta function translated by the mean , that is +Its cumulative distribution function is then the Heaviside step function translated by the mean , namely +

+
+

Maximum entropy[edit]

+

Of all probability distributions over the reals with a specified finite mean and finite variance , the normal distribution is the one with maximum entropy.[25] To see this, let be a continuous random variable with probability density . The entropy of is defined as[26][27][28] +

+
+

where is understood to be zero whenever . This functional can be maximized, subject to the constraints that the distribution is properly normalized and has a specified mean and variance, by using variational calculus. A function with three Lagrange multipliers is defined: +

+
+

At maximum entropy, a small variation about will produce a variation about which is equal to 0: +

+
+

Since this must hold for any small , the factor multiplying must be zero, and solving for yields: +

+
+

The Lagrange constraints that is properly normalized and has the specified mean and variance are satisfied if and only if , , and are chosen so that +

+
+

The entropy of a normal distribution is equal to +

+
+

which is independent of the mean . +

+

Other properties[edit]

+
  1. If the characteristic function of some random variable is of the form , where is a polynomial, then the Marcinkiewicz theorem (named after Józef Marcinkiewicz) asserts that can be at most a quadratic polynomial, and therefore is a normal random variable.[29] The consequence of this result is that the normal distribution is the only distribution with a finite number (two) of non-zero cumulants.
  2. If and are jointly normal and uncorrelated, then they are independent. The requirement that and should be jointly normal is essential; without it the property does not hold.[30][31][proof] For non-normal random variables uncorrelatedness does not imply independence.
  3. The Kullback–Leibler divergence of one normal distribution from another is given by:[32]
    +The Hellinger distance between the same distributions is equal to
  4. The Fisher information matrix for a normal distribution w.r.t. and is diagonal and takes the form
  5. The conjugate prior of the mean of a normal distribution is another normal distribution.[33] Specifically, if are iid and the prior is , then the posterior distribution for the estimator of will be
  6. The family of normal distributions not only forms an exponential family (EF), but in fact forms a natural exponential family (NEF) with quadratic variance function (NEF-QVF). Many properties of normal distributions generalize to properties of NEF-QVF distributions, NEF distributions, or EF distributions generally. NEF-QVF distributions comprises 6 families, including Poisson, Gamma, binomial, and negative binomial distributions, while many of the common families studied in probability and statistics are NEF or EF.
  7. In information geometry, the family of normal distributions forms a statistical manifold with constant curvature . The same family is flat with respect to the (±1)-connections and .[34]
+

Related distributions[edit]

+

Central limit theorem[edit]

+
As the number of discrete events increases, the function begins to resemble a normal distribution
+
Comparison of probability density functions, for the sum of fair 6-sided dice to show their convergence to a normal distribution with increasing , in accordance to the central limit theorem. In the bottom-right graph, smoothed profiles of the previous graphs are rescaled, superimposed and compared with a normal distribution (black curve).
+ +

The central limit theorem states that under certain (fairly common) conditions, the sum of many random variables will have an approximately normal distribution. More specifically, where are independent and identically distributed random variables with the same arbitrary distribution, zero mean, and variance and is their +mean scaled by +

+
+

Then, as increases, the probability distribution of will tend to the normal distribution with zero mean and variance . +

The theorem can be extended to variables that are not independent and/or not identically distributed if certain constraints are placed on the degree of dependence and the moments of the distributions. +

Many test statistics, scores, and estimators encountered in practice contain sums of certain random variables in them, and even more estimators can be represented as sums of random variables through the use of influence functions. The central limit theorem implies that those statistical parameters will have asymptotically normal distributions. +

The central limit theorem also implies that certain distributions can be approximated by the normal distribution, for example: +

+
  • The binomial distribution is approximately normal with mean and variance for large and for not too close to 0 or 1.
  • +
  • The Poisson distribution with parameter is approximately normal with mean and variance , for large values of .[35]
  • +
  • The chi-squared distribution is approximately normal with mean and variance , for large .
  • +
  • The Student's t-distribution is approximately normal with mean 0 and variance 1 when is large.
+

Whether these approximations are sufficiently accurate depends on the purpose for which they are needed, and the rate of convergence to the normal distribution. It is typically the case that such approximations are less accurate in the tails of the distribution. +

A general upper bound for the approximation error in the central limit theorem is given by the Berry–Esseen theorem, improvements of the approximation are given by the Edgeworth expansions. +

This theorem can also be used to justify modeling the sum of many uniform noise sources as Gaussian noise. See AWGN. +

+

Operations and functions of normal variables[edit]

+
a: Probability density of a function of a normal variable with and . b: Probability density of a function of two normal variables and , where , , , , and . c: Heat map of the joint probability density of two functions of two correlated normal variables and , where , , , , and . d: Probability density of a function of 4 iid standard normal variables. These are computed by the numerical method of ray-tracing.[36]
+

The probability density, cumulative distribution, and inverse cumulative distribution of any function of one or more independent or correlated normal variables can be computed with the numerical method of ray-tracing[36] (Matlab code). In the following sections we look at some special cases. +

+

Operations on a single normal variable[edit]

+

If is distributed normally with mean and variance , then +

+
  • , for any real numbers and , is also normally distributed, with mean and standard deviation . That is, the family of normal distributions is closed under linear transformations.
  • +
  • The exponential of is distributed log-normally: .
  • +
  • The absolute value of has folded normal distribution: . If this is known as the half-normal distribution.
  • +
  • The absolute value of normalized residuals, , has chi distribution with one degree of freedom: .
  • +
  • The square of has the noncentral chi-squared distribution with one degree of freedom: . If , the distribution is called simply chi-squared.
  • +
  • The log-likelihood of a normal variable is simply the log of its probability density function:
    Since this is a scaled and shifted square of a standard normal variable, it is distributed as a scaled and shifted chi-squared variable.
  • +
  • The distribution of the variable restricted to an interval is called the truncated normal distribution.
  • +
  • has a Lévy distribution with location 0 and scale .
+
Operations on two independent normal variables[edit]
+
  • If and are two independent normal random variables, with means , and standard deviations , , then their sum will also be normally distributed,[proof] with mean and variance .
  • +
  • In particular, if and are independent normal deviates with zero mean and variance , then and are also independent and normally distributed, with zero mean and variance . This is a special case of the polarization identity.[37]
  • +
  • If , are two independent normal deviates with mean and deviation , and , are arbitrary real numbers, then the variable
    is also normally distributed with mean and deviation . It follows that the normal distribution is stable (with exponent ).
  • +
  • If , are normal distributions, then their normalized geometric mean is a normal distribution with and (see here for a visualization).
+
Operations on two independent standard normal variables[edit]
+

If and are two independent standard normal random variables with mean 0 and variance 1, then +

+
  • Their sum and difference is distributed normally with mean zero and variance two: .
  • +
  • Their product follows the product distribution[38] with density function where is the modified Bessel function of the second kind. This distribution is symmetric around zero, unbounded at , and has the characteristic function .
  • +
  • Their ratio follows the standard Cauchy distribution: .
  • +
  • Their Euclidean norm has the Rayleigh distribution.
+

Operations on multiple independent normal variables[edit]

+
  • Any linear combination of independent normal deviates is a normal deviate.
  • +
  • If are independent standard normal random variables, then the sum of their squares has the chi-squared distribution with degrees of freedom
  • +
  • If are independent normally distributed random variables with means and variances , then their sample mean is independent from the sample standard deviation,[39] which can be demonstrated using Basu's theorem or Cochran's theorem.[40] The ratio of these two quantities will have the Student's t-distribution with degrees of freedom:
  • +
  • If , are independent standard normal random variables, then the ratio of their normalized sums of squares will have the F-distribution with (n, m) degrees of freedom:[41]
+

Operations on multiple correlated normal variables[edit]

+
  • A quadratic form of a normal vector, i.e. a quadratic function of multiple independent or correlated normal variables, is a generalized chi-square variable.
+

Operations on the density function[edit]

+

The split normal distribution is most directly defined in terms of joining scaled sections of the density functions of different normal distributions and rescaling the density to integrate to one. The truncated normal distribution results from rescaling a section of a single density function. +

+

Infinite divisibility and Cramér's theorem[edit]

+

For any positive integer , any normal distribution with mean and variance is the distribution of the sum of independent normal deviates, each with mean and variance . This property is called infinite divisibility.[42] +

Conversely, if and are independent random variables and their sum has a normal distribution, then both and must be normal deviates.[43] +

This result is known as Cramér's decomposition theorem, and is equivalent to saying that the convolution of two distributions is normal if and only if both are normal. Cramér's theorem implies that a linear combination of independent non-Gaussian variables will never have an exactly normal distribution, although it may approach it arbitrarily closely.[29] +

+

Bernstein's theorem[edit]

+

Bernstein's theorem states that if and are independent and and are also independent, then both X and Y must necessarily have normal distributions.[44][45] +

More generally, if are independent random variables, then two distinct linear combinations and will be independent if and only if all are normal and , where denotes the variance of .[44] +

+

Extensions[edit]

+

The notion of normal distribution, being one of the most important distributions in probability theory, has been extended far beyond the standard framework of the univariate (that is one-dimensional) case (Case 1). All these extensions are also called normal or Gaussian laws, so a certain ambiguity in names exists. +

+ +

A random variable X has a two-piece normal distribution if it has a distribution +

+
+
+

where μ is the mean and σ1 and σ2 are the standard deviations of the distribution to the left and right of the mean respectively. +

The mean, variance and third central moment of this distribution have been determined[46] +

+
+
+
+

where E(X), V(X) and T(X) are the mean, variance, and third central moment respectively. +

One of the main practical uses of the Gaussian law is to model the empirical distributions of many different random variables encountered in practice. In such case a possible extension would be a richer family of distributions, having more than two parameters and therefore being able to fit the empirical distribution more accurately. The examples of such extensions are: +

+
  • Pearson distribution — a four-parameter family of probability distributions that extend the normal law to include different skewness and kurtosis values.
  • +
  • The generalized normal distribution, also known as the exponential power distribution, allows for distribution tails with thicker or thinner asymptotic behaviors.
+

Statistical inference[edit]

+

Estimation of parameters[edit]

+ +

It is often the case that we do not know the parameters of the normal distribution, but instead want to estimate them. That is, having a sample from a normal population we would like to learn the approximate values of parameters and . The standard approach to this problem is the maximum likelihood method, which requires maximization of the log-likelihood function: +

+
+

Taking derivatives with respect to and and solving the resulting system of first order conditions yields the maximum likelihood estimates: +

+
+

Then is as follows: +

+
+

Sample mean[edit]

+ +

Estimator is called the sample mean, since it is the arithmetic mean of all observations. The statistic is complete and sufficient for , and therefore by the Lehmann–Scheffé theorem, is the uniformly minimum variance unbiased (UMVU) estimator.[47] In finite samples it is distributed normally: +

+
+

The variance of this estimator is equal to the μμ-element of the inverse Fisher information matrix . This implies that the estimator is finite-sample efficient. Of practical importance is the fact that the standard error of is proportional to , that is, if one wishes to decrease the standard error by a factor of 10, one must increase the number of points in the sample by a factor of 100. This fact is widely used in determining sample sizes for opinion polls and the number of trials in Monte Carlo simulations. +

From the standpoint of the asymptotic theory, is consistent, that is, it converges in probability to as . The estimator is also asymptotically normal, which is a simple corollary of the fact that it is normal in finite samples: +

+
+

Sample variance[edit]

+ +

The estimator is called the sample variance, since it is the variance of the sample (). In practice, another estimator is often used instead of the . This other estimator is denoted , and is also called the sample variance, which represents a certain ambiguity in terminology; its square root is called the sample standard deviation. The estimator differs from by having (n − 1) instead of n in the denominator (the so-called Bessel's correction): +

+
+

The difference between and becomes negligibly small for large n's. In finite samples however, the motivation behind the use of is that it is an unbiased estimator of the underlying parameter , whereas is biased. Also, by the Lehmann–Scheffé theorem the estimator is uniformly minimum variance unbiased (UMVU),[47] which makes it the "best" estimator among all unbiased ones. However it can be shown that the biased estimator is better than the in terms of the mean squared error (MSE) criterion. In finite samples both and have scaled chi-squared distribution with (n − 1) degrees of freedom: +

+
+

The first of these expressions shows that the variance of is equal to , which is slightly greater than the σσ-element of the inverse Fisher information matrix . Thus, is not an efficient estimator for , and moreover, since is UMVU, we can conclude that the finite-sample efficient estimator for does not exist. +

Applying the asymptotic theory, both estimators and are consistent, that is they converge in probability to as the sample size . The two estimators are also both asymptotically normal: +

+
+

In particular, both estimators are asymptotically efficient for . +

+

Confidence intervals[edit]

+ +

By Cochran's theorem, for normal distributions the sample mean and the sample variance s2 are independent, which means there can be no gain in considering their joint distribution. There is also a converse theorem: if in a sample the sample mean and sample variance are independent, then the sample must have come from the normal distribution. The independence between and s can be employed to construct the so-called t-statistic: +

+
+

This quantity t has the Student's t-distribution with (n − 1) degrees of freedom, and it is an ancillary statistic (independent of the value of the parameters). Inverting the distribution of this t-statistics will allow us to construct the confidence interval for μ;[48] similarly, inverting the χ2 distribution of the statistic s2 will give us the confidence interval for σ2:[49] +

+
+
+

where tk,p and χ 2
k,p
 
are the pth quantiles of the t- and χ2-distributions respectively. These confidence intervals are of the confidence level 1 − α, meaning that the true values μ and σ2 fall outside of these intervals with probability (or significance level) α. In practice people usually take α = 5%, resulting in the 95% confidence intervals. +

Approximate formulas can be derived from the asymptotic distributions of and s2: +

+
+
+

The approximate formulas become valid for large values of n, and are more convenient for the manual calculation since the standard normal quantiles zα/2 do not depend on n. In particular, the most popular value of α = 5%, results in |z0.025| = 1.96. +

+

Normality tests[edit]

+ +

Normality tests assess the likelihood that the given data set {x1, ..., xn} comes from a normal distribution. Typically the null hypothesis H0 is that the observations are distributed normally with unspecified mean μ and variance σ2, versus the alternative Ha that the distribution is arbitrary. Many tests (over 40) have been devised for this problem. The more prominent of them are outlined below: +

Diagnostic plots are more intuitively appealing but subjective at the same time, as they rely on informal human judgement to accept or reject the null hypothesis. +

+
  • Q–Q plot, also known as normal probability plot or rankit plot—is a plot of the sorted values from the data set against the expected values of the corresponding quantiles from the standard normal distribution. That is, it's a plot of point of the form (Φ−1(pk), x(k)), where plotting points pk are equal to pk = (k − α)/(n + 1 − 2α) and α is an adjustment constant, which can be anything between 0 and 1. If the null hypothesis is true, the plotted points should approximately lie on a straight line.
  • +
  • P–P plot – similar to the Q–Q plot, but used much less frequently. This method consists of plotting the points (Φ(z(k)), pk), where . For normally distributed data this plot should lie on a 45° line between (0, 0) and (1, 1).
+

Goodness-of-fit tests: +

Moment-based tests: +

+
  • D'Agostino's K-squared test
  • +
  • Jarque–Bera test
  • +
  • Shapiro–Wilk test: This is based on the fact that the line in the Q–Q plot has the slope of σ. The test compares the least squares estimate of that slope with the value of the sample variance, and rejects the null hypothesis if these two quantities differ significantly.
+

Tests based on the empirical distribution function: +

+ +

Bayesian analysis of the normal distribution[edit]

+

Bayesian analysis of normally distributed data is complicated by the many different possibilities that may be considered: +

+
  • Either the mean, or the variance, or neither, may be considered a fixed quantity.
  • +
  • When the variance is unknown, analysis may be done directly in terms of the variance, or in terms of the precision, the reciprocal of the variance. The reason for expressing the formulas in terms of precision is that the analysis of most cases is simplified.
  • +
  • Both univariate and multivariate cases need to be considered.
  • +
  • Either conjugate or improper prior distributions may be placed on the unknown variables.
  • +
  • An additional set of cases occurs in Bayesian linear regression, where in the basic model the data is assumed to be normally distributed, and normal priors are placed on the regression coefficients. The resulting analysis is similar to the basic cases of independent identically distributed data.
+

The formulas for the non-linear-regression cases are summarized in the conjugate prior article. +

+

Sum of two quadratics[edit]

+
Scalar form[edit]
+

The following auxiliary formula is useful for simplifying the posterior update equations, which otherwise become fairly tedious. +

+
+

This equation rewrites the sum of two quadratics in x by expanding the squares, grouping the terms in x, and completing the square. Note the following about the complex constant factors attached to some of the terms: +

+
  1. The factor has the form of a weighted average of y and z.
  2. +
  3. This shows that this factor can be thought of as resulting from a situation where the reciprocals of quantities a and b add directly, so to combine a and b themselves, it's necessary to reciprocate, add, and reciprocate the result again to get back into the original units. This is exactly the sort of operation performed by the harmonic mean, so it is not surprising that is one-half the harmonic mean of a and b.
+
Vector form[edit]
+

A similar formula can be written for the sum of two vector quadratics: If x, y, z are vectors of length k, and A and B are symmetric, invertible matrices of size , then +

+
+

where +

+
+

The form xA x is called a quadratic form and is a scalar: +

+
+

In other words, it sums up all possible combinations of products of pairs of elements from x, with a separate coefficient for each. In addition, since , only the sum matters for any off-diagonal elements of A, and there is no loss of generality in assuming that A is symmetric. Furthermore, if A is symmetric, then the form +

+

Sum of differences from the mean[edit]

+

Another useful formula is as follows: +

+where +

+

With known variance[edit]

+

For a set of i.i.d. normally distributed data points X of size n where each individual point x follows with known variance σ2, the conjugate prior distribution is also normally distributed. +

This can be shown more easily by rewriting the variance as the precision, i.e. using τ = 1/σ2. Then if and we proceed as follows. +

First, the likelihood function is (using the formula above for the sum of differences from the mean): +

+
+

Then, we proceed as follows: +

+
+

In the above derivation, we used the formula above for the sum of two quadratics and eliminated all constant factors not involving μ. The result is the kernel of a normal distribution, with mean and precision , i.e. +

+
+

This can be written as a set of Bayesian update equations for the posterior parameters in terms of the prior parameters: +

+
+

That is, to combine n data points with total precision of (or equivalently, total variance of n/σ2) and mean of values , derive a new total precision simply by adding the total precision of the data to the prior total precision, and form a new mean through a precision-weighted average, i.e. a weighted average of the data mean and the prior mean, each weighted by the associated total precision. This makes logical sense if the precision is thought of as indicating the certainty of the observations: In the distribution of the posterior mean, each of the input components is weighted by its certainty, and the certainty of this distribution is the sum of the individual certainties. (For the intuition of this, compare the expression "the whole is (or is not) greater than the sum of its parts". In addition, consider that the knowledge of the posterior comes from a combination of the knowledge of the prior and likelihood, so it makes sense that we are more certain of it than of either of its components.) +

The above formula reveals why it is more convenient to do Bayesian analysis of conjugate priors for the normal distribution in terms of the precision. The posterior precision is simply the sum of the prior and likelihood precisions, and the posterior mean is computed through a precision-weighted average, as described above. The same formulas can be written in terms of variance by reciprocating all the precisions, yielding the more ugly formulas +

+
+

With known mean[edit]

+

For a set of i.i.d. normally distributed data points X of size n where each individual point x follows with known mean μ, the conjugate prior of the variance has an inverse gamma distribution or a scaled inverse chi-squared distribution. The two are equivalent except for having different parameterizations. Although the inverse gamma is more commonly used, we use the scaled inverse chi-squared for the sake of convenience. The prior for σ2 is as follows: +

+
+

The likelihood function from above, written in terms of the variance, is: +

+
+

where +

+
+

Then: +

+
+

The above is also a scaled inverse chi-squared distribution where +

+
+

or equivalently +

+
+

Reparameterizing in terms of an inverse gamma distribution, the result is: +

+
+

With unknown mean and unknown variance[edit]

+

For a set of i.i.d. normally distributed data points X of size n where each individual point x follows with unknown mean μ and unknown variance σ2, a combined (multivariate) conjugate prior is placed over the mean and variance, consisting of a normal-inverse-gamma distribution. +Logically, this originates as follows: +

+
  1. From the analysis of the case with unknown mean but known variance, we see that the update equations involve sufficient statistics computed from the data consisting of the mean of the data points and the total variance of the data points, computed in turn from the known variance divided by the number of data points.
  2. +
  3. From the analysis of the case with unknown variance but known mean, we see that the update equations involve sufficient statistics over the data consisting of the number of data points and sum of squared deviations.
  4. +
  5. Keep in mind that the posterior update values serve as the prior distribution when further data is handled. Thus, we should logically think of our priors in terms of the sufficient statistics just described, with the same semantics kept in mind as much as possible.
  6. +
  7. To handle the case where both mean and variance are unknown, we could place independent priors over the mean and variance, with fixed estimates of the average mean, total variance, number of data points used to compute the variance prior, and sum of squared deviations. Note however that in reality, the total variance of the mean depends on the unknown variance, and the sum of squared deviations that goes into the variance prior (appears to) depend on the unknown mean. In practice, the latter dependence is relatively unimportant: Shifting the actual mean shifts the generated points by an equal amount, and on average the squared deviations will remain the same. This is not the case, however, with the total variance of the mean: As the unknown variance increases, the total variance of the mean will increase proportionately, and we would like to capture this dependence.
  8. +
  9. This suggests that we create a conditional prior of the mean on the unknown variance, with a hyperparameter specifying the mean of the pseudo-observations associated with the prior, and another parameter specifying the number of pseudo-observations. This number serves as a scaling parameter on the variance, making it possible to control the overall variance of the mean relative to the actual variance parameter. The prior for the variance also has two hyperparameters, one specifying the sum of squared deviations of the pseudo-observations associated with the prior, and another specifying once again the number of pseudo-observations. Each of the priors has a hyperparameter specifying the number of pseudo-observations, and in each case this controls the relative variance of that prior. These are given as two separate hyperparameters so that the variance (aka the confidence) of the two priors can be controlled separately.
  10. +
  11. This leads immediately to the normal-inverse-gamma distribution, which is the product of the two distributions just defined, with conjugate priors used (an inverse gamma distribution over the variance, and a normal distribution over the mean, conditional on the variance) and with the same four parameters just defined.
+

The priors are normally defined as follows: +

+
+

The update equations can be derived, and look as follows: +

+
+

The respective numbers of pseudo-observations add the number of actual observations to them. The new mean hyperparameter is once again a weighted average, this time weighted by the relative numbers of observations. Finally, the update for is similar to the case with known mean, but in this case the sum of squared deviations is taken with respect to the observed data mean rather than the true mean, and as a result a new interaction term needs to be added to take care of the additional error source stemming from the deviation between prior and data mean. +

+
Proof +

The prior distributions are +

+
+

Therefore, the joint prior is +

+
+

The likelihood function from the section above with known variance is: +

+
+

Writing it in terms of variance rather than precision, we get: +

+
+

where +

Therefore, the posterior is (dropping the hyperparameters as conditioning factors): +

+
+

In other words, the posterior distribution has the form of a product of a normal distribution over times an inverse gamma distribution over , with parameters that are the same as the update equations above. +

+
+


+

+

Occurrence and applications[edit]

+

The occurrence of normal distribution in practical problems can be loosely classified into four categories: +

+
  1. Exactly normal distributions;
  2. +
  3. Approximately normal laws, for example when such approximation is justified by the central limit theorem; and
  4. +
  5. Distributions modeled as normal – the normal distribution being the distribution with maximum entropy for a given mean and variance.
  6. +
  7. Regression problems – the normal distribution being found after systematic effects have been modeled sufficiently well.
+

Exact normality[edit]

+
The ground state of a quantum harmonic oscillator has the Gaussian distribution.
+

Certain quantities in physics are distributed normally, as was first demonstrated by James Clerk Maxwell. Examples of such quantities are: +

+
  • Probability density function of a ground state in a quantum harmonic oscillator.
  • +
  • The position of a particle that experiences diffusion. If initially the particle is located at a specific point (that is its probability distribution is the Dirac delta function), then after time t its location is described by a normal distribution with variance t, which satisfies the diffusion equation . If the initial location is given by a certain density function , then the density at time t is the convolution of g and the normal probability density function.
+

Approximate normality[edit]

+

Approximately normal distributions occur in many situations, as explained by the central limit theorem. When the outcome is produced by many small effects acting additively and independently, its distribution will be close to normal. The normal approximation will not be valid if the effects act multiplicatively (instead of additively), or if there is a single external influence that has a considerably larger magnitude than the rest of the effects. +

+ +

Assumed normality[edit]

+
Histogram of sepal widths for Iris versicolor from Fisher's Iris flower data set, with superimposed best-fitting normal distribution.
+

I can only recognize the occurrence of the normal curve – the Laplacian curve of errors – as a very abnormal phenomenon. It is roughly approximated to in certain distributions; for this reason, and on account for its beautiful simplicity, we may, perhaps, use it as a first approximation, particularly in theoretical investigations.

+

There are statistical methods to empirically test that assumption; see the above Normality tests section. +

+
  • In biology, the logarithm of various variables tend to have a normal distribution, that is, they tend to have a log-normal distribution (after separation on male/female subpopulations), with examples including: +
    • Measures of size of living tissue (length, height, skin area, weight);[50]
    • +
    • The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth; presumably the thickness of tree bark also falls under this category;
    • +
    • Certain physiological measurements, such as blood pressure of adult humans.
  • +
  • In finance, in particular the Black–Scholes model, changes in the logarithm of exchange rates, price indices, and stock market indices are assumed normal (these variables behave like compound interest, not like simple interest, and so are multiplicative). Some mathematicians such as Benoit Mandelbrot have argued that log-Levy distributions, which possesses heavy tails would be a more appropriate model, in particular for the analysis for stock market crashes. The use of the assumption of normal distribution occurring in financial models has also been criticized by Nassim Nicholas Taleb in his works.
  • +
  • Measurement errors in physical experiments are often modeled by a normal distribution. This use of a normal distribution does not imply that one is assuming the measurement errors are normally distributed, rather using the normal distribution produces the most conservative predictions possible given only knowledge about the mean and variance of the errors.[51]
  • +
  • In standardized testing, results can be made to have a normal distribution by either selecting the number and difficulty of questions (as in the IQ test) or transforming the raw test scores into output scores by fitting them to the normal distribution. For example, the SAT's traditional range of 200–800 is based on a normal distribution with a mean of 500 and a standard deviation of 100.
+
Fitted cumulative normal distribution to October rainfalls, see distribution fitting
+ +

Methodological problems and peer review[edit]

+

John Ioannidis argues that using normally distributed standard deviations as standards for validating research findings leave falsifiable predictions about phenomena that are not normally distributed untested. This includes, for example, phenomena that only appear when all necessary conditions are present and one cannot be a substitute for another in an addition-like way and phenomena that are not randomly distributed. Ioannidis argues that standard deviation-centered validation gives a false appearance of validity to hypotheses and theories where some but not all falsifiable predictions are normally distributed since the portion of falsifiable predictions that there is evidence against may and in some cases are in the non-normally distributed parts of the range of falsifiable predictions, as well as baselessly dismissing hypotheses for which none of the falsifiable predictions are normally distributed as if were they unfalsifiable when in fact they do make falsifiable predictions. It is argued by Ioannidis that many cases of mutually exclusive theories being accepted as validated by research journals are caused by failure of the journals to take in empirical falsifications of non-normally distributed predictions, and not because mutually exclusive theories are true, which they cannot be, although two mutually exclusive theories can both be wrong and a third one correct.[53] +

+

Computational methods[edit]

+

Generating values from normal distribution[edit]

+
The bean machine, a device invented by Francis Galton, can be called the first generator of normal random variables. This machine consists of a vertical board with interleaved rows of pins. Small balls are dropped from the top and then bounce randomly left or right as they hit the pins. The balls are collected into bins at the bottom and settle down into a pattern resembling the Gaussian curve.
+

In computer simulations, especially in applications of the Monte-Carlo method, it is often desirable to generate values that are normally distributed. The algorithms listed below all generate the standard normal deviates, since a N(μ, σ2) can be generated as X = μ + σZ, where Z is standard normal. All these algorithms rely on the availability of a random number generator U capable of producing uniform random variates. +

+
  • The most straightforward method is based on the probability integral transform property: if U is distributed uniformly on (0,1), then Φ−1(U) will have the standard normal distribution. The drawback of this method is that it relies on calculation of the probit function Φ−1, which cannot be done analytically. Some approximate methods are described in Hart (1968) and in the erf article. Wichura gives a fast algorithm for computing this function to 16 decimal places,[54] which is used by R to compute random variates of the normal distribution.
  • +
  • An easy-to-program approximate approach that relies on the central limit theorem is as follows: generate 12 uniform U(0,1) deviates, add them all up, and subtract 6 – the resulting random variable will have approximately standard normal distribution. In truth, the distribution will be Irwin–Hall, which is a 12-section eleventh-order polynomial approximation to the normal distribution. This random deviate will have a limited range of (−6, 6).[55] Note that in a true normal distribution, only 0.00034% of all samples will fall outside ±6σ.
  • +
  • The Box–Muller method uses two independent random numbers U and V distributed uniformly on (0,1). Then the two random variables X and Y
    will both have the standard normal distribution, and will be independent. This formulation arises because for a bivariate normal random vector (X, Y) the squared norm X2 + Y2 will have the chi-squared distribution with two degrees of freedom, which is an easily generated exponential random variable corresponding to the quantity −2 ln(U) in these equations; and the angle is distributed uniformly around the circle, chosen by the random variable V.
  • +
  • The Marsaglia polar method is a modification of the Box–Muller method which does not require computation of the sine and cosine functions. In this method, U and V are drawn from the uniform (−1,1) distribution, and then S = U2 + V2 is computed. If S is greater or equal to 1, then the method starts over, otherwise the two quantities
    are returned. Again, X and Y are independent, standard normal random variables.
  • +
  • The Ratio method[56] is a rejection method. The algorithm proceeds as follows: +
    • Generate two independent uniform deviates U and V;
    • +
    • Compute X = 8/e (V − 0.5)/U;
    • +
    • Optional: if X2 ≤ 5 − 4e1/4U then accept X and terminate algorithm;
    • +
    • Optional: if X2 ≥ 4e−1.35/U + 1.4 then reject X and start over from step 1;
    • +
    • If X2 ≤ −4 lnU then accept X, otherwise start over the algorithm.
    +
    The two optional steps allow the evaluation of the logarithm in the last step to be avoided in most cases. These steps can be greatly improved[57] so that the logarithm is rarely evaluated.
  • +
  • The ziggurat algorithm[58] is faster than the Box–Muller transform and still exact. In about 97% of all cases it uses only two random numbers, one random integer and one random uniform, one multiplication and an if-test. Only in 3% of the cases, where the combination of those two falls outside the "core of the ziggurat" (a kind of rejection sampling using logarithms), do exponentials and more uniform random numbers have to be employed.
  • +
  • Integer arithmetic can be used to sample from the standard normal distribution.[59] This method is exact in the sense that it satisfies the conditions of ideal approximation;[60] i.e., it is equivalent to sampling a real number from the standard normal distribution and rounding this to the nearest representable floating point number.
  • +
  • There is also some investigation[61] into the connection between the fast Hadamard transform and the normal distribution, since the transform employs just addition and subtraction and by the central limit theorem random numbers from almost any distribution will be transformed into the normal distribution. In this regard a series of Hadamard transforms can be combined with random permutations to turn arbitrary data sets into a normally distributed data.
+

Numerical approximations for the normal cumulative distribution function and normal quantile function[edit]

+

The standard normal cumulative distribution function is widely used in scientific and statistical computing. +

The values Φ(x) may be approximated very accurately by a variety of methods, such as numerical integration, Taylor series, asymptotic series and continued fractions. Different approximations are used depending on the desired level of accuracy. +

+
  • Zelen & Severo (1964) give the approximation for Φ(x) for x > 0 with the absolute error |ε(x)| < 7.5·10−8 (algorithm 26.2.17):
    where ϕ(x) is the standard normal probability density function, and b0 = 0.2316419, b1 = 0.319381530, b2 = −0.356563782, b3 = 1.781477937, b4 = −1.821255978, b5 = 1.330274429.
  • +
  • Hart (1968) lists some dozens of approximations – by means of rational functions, with or without exponentials – for the erfc() function. His algorithms vary in the degree of complexity and the resulting precision, with maximum absolute precision of 24 digits. An algorithm by West (2009) combines Hart's algorithm 5666 with a continued fraction approximation in the tail to provide a fast computation algorithm with a 16-digit precision.
  • +
  • Cody (1969) after recalling Hart68 solution is not suited for erf, gives a solution for both erf and erfc, with maximal relative error bound, via Rational Chebyshev Approximation.
  • +
  • Marsaglia (2004) suggested a simple algorithm[note 1] based on the Taylor series expansion
    for calculating Φ(x) with arbitrary precision. The drawback of this algorithm is comparatively slow calculation time (for example it takes over 300 iterations to calculate the function with 16 digits of precision when x = 10).
  • +
  • The GNU Scientific Library calculates values of the standard normal cumulative distribution function using Hart's algorithms and approximations with Chebyshev polynomials.
  • +
  • Dia (2023) proposes the following approximation of with a maximum relative error less than in absolute value: for and for ,
+

+

Shore (1982) introduced simple approximations that may be incorporated in stochastic optimization models of engineering and operations research, like reliability engineering and inventory analysis. Denoting p = Φ(z), the simplest approximation for the quantile function is: +

+

This approximation delivers for z a maximum absolute error of 0.026 (for 0.5 ≤ p ≤ 0.9999, corresponding to 0 ≤ z ≤ 3.719). For p < 1/2 replace p by 1 − p and change sign. Another approximation, somewhat less accurate, is the single-parameter approximation: +

+

The latter had served to derive a simple approximation for the loss integral of the normal distribution, defined by +

+

This approximation is particularly accurate for the right far-tail (maximum error of 10−3 for z≥1.4). Highly accurate approximations for the cumulative distribution function, based on Response Modeling Methodology (RMM, Shore, 2011, 2012), are shown in Shore (2005). +

Some more approximations can be found at: Error function#Approximation with elementary functions. In particular, small relative error on the whole domain for the cumulative distribution function and the quantile function as well, is achieved via an explicitly invertible formula by Sergei Winitzki in 2008. +

+

History[edit]

+

Development[edit]

+

Some authors[62][63] attribute the credit for the discovery of the normal distribution to de Moivre, who in 1738[note 2] published in the second edition of his The Doctrine of Chances the study of the coefficients in the binomial expansion of (a + b)n. De Moivre proved that the middle term in this expansion has the approximate magnitude of , and that "If m or 1/2n be a Quantity infinitely great, then the Logarithm of the Ratio, which a Term distant from the middle by the Interval , has to the middle Term, is ."[64] Although this theorem can be interpreted as the first obscure expression for the normal probability law, Stigler points out that de Moivre himself did not interpret his results as anything more than the approximate rule for the binomial coefficients, and in particular de Moivre lacked the concept of the probability density function.[65] +

+
Carl Friedrich Gauss discovered the normal distribution in 1809 as a way to rationalize the method of least squares.
+

In 1823 Gauss published his monograph "Theoria combinationis observationum erroribus minimis obnoxiae" where among other things he introduces several important statistical concepts, such as the method of least squares, the method of maximum likelihood, and the normal distribution. Gauss used M, M, M′′, ... to denote the measurements of some unknown quantity V, and sought the most probable estimator of that quantity: the one that maximizes the probability φ(MV) · φ(M′V) · φ(M′′ − V) · ... of obtaining the observed experimental results. In his notation φΔ is the probability density function of the measurement errors of magnitude Δ. Not knowing what the function φ is, Gauss requires that his method should reduce to the well-known answer: the arithmetic mean of the measured values.[note 3] Starting from these principles, Gauss demonstrates that the only law that rationalizes the choice of arithmetic mean as an estimator of the location parameter, is the normal law of errors:[66] +

+where h is "the measure of the precision of the observations". Using this normal law as a generic model for errors in the experiments, Gauss formulates what is now known as the non-linear weighted least squares method.[67] +

+
Pierre-Simon Laplace proved the central limit theorem in 1810, consolidating the importance of the normal distribution in statistics.
+

Although Gauss was the first to suggest the normal distribution law, Laplace made significant contributions.[note 4] It was Laplace who first posed the problem of aggregating several observations in 1774,[68] although his own solution led to the Laplacian distribution. It was Laplace who first calculated the value of the integral et2 dt = π in 1782, providing the normalization constant for the normal distribution.[69] Finally, it was Laplace who in 1810 proved and presented to the Academy the fundamental central limit theorem, which emphasized the theoretical importance of the normal distribution.[70] +

It is of interest to note that in 1809 an Irish-American mathematician Robert Adrain published two insightful but flawed derivations of the normal probability law, simultaneously and independently from Gauss.[71] His works remained largely unnoticed by the scientific community, until in 1871 they were exhumed by Abbe.[72] +

In the middle of the 19th century Maxwell demonstrated that the normal distribution is not just a convenient mathematical tool, but may also occur in natural phenomena:[73] The number of particles whose velocity, resolved in a certain direction, lies between x and x + dx is +

+

+

Naming[edit]

+

Today, the concept is usually known in English as the normal distribution or Gaussian distribution. Other less common names include Gauss distribution, Laplace-Gauss distribution, the law of error, the law of facility of errors, Laplace's second law, and Gaussian law. +

Gauss himself apparently coined the term with reference to the "normal equations" involved in its applications, with normal having its technical meaning of orthogonal rather than usual.[74] However, by the end of the 19th century some authors[note 5] had started using the name normal distribution, where the word "normal" was used as an adjective – the term now being seen as a reflection of the fact that this distribution was seen as typical, common – and thus normal. Peirce (one of those authors) once defined "normal" thus: "...the 'normal' is not the average (or any other kind of mean) of what actually occurs, but of what would, in the long run, occur under certain circumstances."[75] Around the turn of the 20th century Pearson popularized the term normal as a designation for this distribution.[76] +

+

Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another 'abnormal'.

+

Also, it was Pearson who first wrote the distribution in terms of the standard deviation σ as in modern notation. Soon after this, in year 1915, Fisher added the location parameter to the formula for normal distribution, expressing it in the way it is written nowadays: +

+

The term "standard normal", which denotes the normal distribution with zero mean and unit variance came into general use around the 1950s, appearing in the popular textbooks by P. G. Hoel (1947) Introduction to Mathematical Statistics and A. M. Mood (1950) Introduction to the Theory of Statistics.[77] +

+

See also[edit]

+ + +

Notes[edit]

+
+
    +
  1. ^ For example, this algorithm is given in the article Bc programming language. +
  2. +
  3. ^ De Moivre first published his findings in 1733, in a pamphlet Approximatio ad Summam Terminorum Binomii (a + b)n in Seriem Expansi that was designated for private circulation only. But it was not until the year 1738 that he made his results publicly available. The original pamphlet was reprinted several times, see for example Walker (1985). +
  4. +
  5. ^ "It has been customary certainly to regard as an axiom the hypothesis that if any quantity has been determined by several direct observations, made under the same circumstances and with equal care, the arithmetical mean of the observed values affords the most probable value, if not rigorously, yet very nearly at least, so that it is always most safe to adhere to it." — Gauss (1809, section 177) +
  6. +
  7. ^ "My custom of terming the curve the Gauss–Laplacian or normal curve saves us from proportioning the merit of discovery between the two great astronomer mathematicians." quote from Pearson (1905, p. 189) +
  8. +
  9. ^ Besides those specifically referenced here, such use is encountered in the works of Peirce, Galton (Galton (1889, chapter V)) and Lexis (Lexis (1878), Rohrbasser & Véron (2003)) c. 1875.[citation needed] +
  10. +
+

References[edit]

+

Citations[edit]

+
+
    +
  1. ^ Norton, Matthew; Khokhlov, Valentyn; Uryasev, Stan (2019). "Calculating CVaR and bPOE for common probability distributions with application to portfolio optimization and density estimation" (PDF). Annals of Operations Research. Springer. 299 (1–2): 1281–1315. arXiv:1811.11301. doi:10.1007/s10479-019-03373-1. S2CID 254231768. Retrieved February 27, 2023. +
  2. +
  3. ^ Normal Distribution, Gale Encyclopedia of Psychology +
  4. +
  5. ^ Casella & Berger (2001, p. 102) +
  6. +
  7. ^ Lyon, A. (2014). Why are Normal Distributions Normal?, The British Journal for the Philosophy of Science. +
  8. +
  9. ^ Jorge, Nocedal; Stephan, J. Wright (2006). Numerical Optimization (2nd ed.). Springer. p. 249. ISBN 978-0387-30303-1. +
  10. +
  11. ^ a b "Normal Distribution". www.mathsisfun.com. Retrieved August 15, 2020. +
  12. +
  13. ^ Stigler (1982) +
  14. +
  15. ^ Halperin, Hartley & Hoel (1965, item 7) +
  16. +
  17. ^ McPherson (1990, p. 110) +
  18. +
  19. ^ Bernardo & Smith (2000, p. 121) +
  20. +
  21. ^ Scott, Clayton; Nowak, Robert (August 7, 2003). "The Q-function". Connexions. +
  22. +
  23. ^ Barak, Ohad (April 6, 2006). "Q Function and Error Function" (PDF). Tel Aviv University. Archived from the original (PDF) on March 25, 2009. +
  24. +
  25. ^ Weisstein, Eric W. "Normal Distribution Function". MathWorld. +
  26. +
  27. ^ Abramowitz, Milton; Stegun, Irene Ann, eds. (1983) [June 1964]. "Chapter 26, eqn 26.2.12". Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Applied Mathematics Series. Vol. 55 (Ninth reprint with additional corrections of tenth original printing with corrections (December 1972); first ed.). Washington D.C.; New York: United States Department of Commerce, National Bureau of Standards; Dover Publications. p. 932. ISBN 978-0-486-61272-0. LCCN 64-60036. MR 0167642. LCCN 65-12253. +
  28. +
  29. ^ Cover, Thomas M.; Thomas, Joy A. (2006). Elements of Information Theory. John Wiley and Sons. p. 254. ISBN 9780471748816. +
  30. +
  31. ^ Park, Sung Y.; Bera, Anil K. (2009). "Maximum Entropy Autoregressive Conditional Heteroskedasticity Model" (PDF). Journal of Econometrics. 150 (2): 219–230. CiteSeerX 10.1.1.511.9750. doi:10.1016/j.jeconom.2008.12.014. Archived from the original (PDF) on March 7, 2016. Retrieved June 2, 2011. +
  32. +
  33. ^ Geary RC(1936) The distribution of the "Student's ratio for the non-normal samples". Supplement to the Journal of the Royal Statistical Society 3 (2): 178–184 +
  34. +
  35. ^ Lukacs, Eugene (March 1942). "A Characterization of the Normal Distribution". Annals of Mathematical Statistics. 13 (1): 91–93. doi:10.1214/AOMS/1177731647. ISSN 0003-4851. JSTOR 2236166. MR 0006626. Zbl 0060.28509. Wikidata Q55897617. +
  36. +
  37. ^ a b c Patel & Read (1996, [2.1.4]) +
  38. +
  39. ^ Fan (1991, p. 1258) +
  40. +
  41. ^ Patel & Read (1996, [2.1.8]) +
  42. +
  43. ^ Papoulis, Athanasios. Probability, Random Variables and Stochastic Processes (4th ed.). p. 148. +
  44. +
  45. ^ Bryc (1995, p. 23) +
  46. +
  47. ^ Bryc (1995, p. 24) +
  48. +
  49. ^ Cover & Thomas (2006, p. 254) +
  50. +
  51. ^ Williams, David (2001). Weighing the odds : a course in probability and statistics (Reprinted. ed.). Cambridge [u.a.]: Cambridge Univ. Press. pp. 197–199. ISBN 978-0-521-00618-7. +
  52. +
  53. ^ Smith, José M. Bernardo; Adrian F. M. (2000). Bayesian theory (Reprint ed.). Chichester [u.a.]: Wiley. pp. 209, 366. ISBN 978-0-471-49464-5.{{cite book}}: CS1 maint: multiple names: authors list (link) +
  54. +
  55. ^ O'Hagan, A. (1994) Kendall's Advanced Theory of statistics, Vol 2B, Bayesian Inference, Edward Arnold. ISBN 0-340-52922-9 (Section 5.40) +
  56. +
  57. ^ a b Bryc (1995, p. 35) +
  58. +
  59. ^ UIUC, Lecture 21. The Multivariate Normal Distribution, 21.6:"Individually Gaussian Versus Jointly Gaussian". +
  60. +
  61. ^ Edward L. Melnick and Aaron Tenenbein, "Misspecifications of the Normal Distribution", The American Statistician, volume 36, number 4 November 1982, pages 372–373 +
  62. +
  63. ^ "Kullback Leibler (KL) Distance of Two Normal (Gaussian) Probability Distributions". Allisons.org. December 5, 2007. Retrieved March 3, 2017. +
  64. +
  65. ^ Jordan, Michael I. (February 8, 2010). "Stat260: Bayesian Modeling and Inference: The Conjugate Prior for the Normal Distribution" (PDF). +
  66. +
  67. ^ Amari & Nagaoka (2000) +
  68. +
  69. ^ "Normal Approximation to Poisson Distribution". Stat.ucla.edu. Retrieved March 3, 2017. +
  70. +
  71. ^ a b Das, Abhranil (2021). "A method to integrate and classify normal distributions". Journal of Vision. 21 (10): 1. arXiv:2012.14331. doi:10.1167/jov.21.10.1. PMC 8419883. PMID 34468706. +
  72. +
  73. ^ Bryc (1995, p. 27) +
  74. +
  75. ^ Weisstein, Eric W. "Normal Product Distribution". MathWorld. wolfram.com. +
  76. +
  77. ^ Lukacs, Eugene (1942). "A Characterization of the Normal Distribution". The Annals of Mathematical Statistics. 13 (1): 91–3. doi:10.1214/aoms/1177731647. ISSN 0003-4851. JSTOR 2236166. +
  78. +
  79. ^ Basu, D.; Laha, R. G. (1954). "On Some Characterizations of the Normal Distribution". Sankhyā. 13 (4): 359–62. ISSN 0036-4452. JSTOR 25048183. +
  80. +
  81. ^ Lehmann, E. L. (1997). Testing Statistical Hypotheses (2nd ed.). Springer. p. 199. ISBN 978-0-387-94919-2. +
  82. +
  83. ^ Patel & Read (1996, [2.3.6]) +
  84. +
  85. ^ Galambos & Simonelli (2004, Theorem 3.5) +
  86. +
  87. ^ a b Lukacs & King (1954) +
  88. +
  89. ^ Quine, M.P. (1993). "On three characterisations of the normal distribution". Probability and Mathematical Statistics. 14 (2): 257–263. +
  90. +
  91. ^ John, S (1982). "The three parameter two-piece normal family of distributions and its fitting". Communications in Statistics - Theory and Methods. 11 (8): 879–885. doi:10.1080/03610928208828279. +
  92. +
  93. ^ a b Krishnamoorthy (2006, p. 127) +
  94. +
  95. ^ Krishnamoorthy (2006, p. 130) +
  96. +
  97. ^ Krishnamoorthy (2006, p. 133) +
  98. +
  99. ^ Huxley (1932) +
  100. +
  101. ^ Jaynes, Edwin T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. pp. 592–593. ISBN 9780521592710. +
  102. +
  103. ^ Oosterbaan, Roland J. (1994). "Chapter 6: Frequency and Regression Analysis of Hydrologic Data" (PDF). In Ritzema, Henk P. (ed.). Drainage Principles and Applications, Publication 16 (second revised ed.). Wageningen, The Netherlands: International Institute for Land Reclamation and Improvement (ILRI). pp. 175–224. ISBN 978-90-70754-33-4. +
  104. +
  105. ^ Why Most Published Research Findings Are False, John P. A. Ioannidis, 2005 +
  106. +
  107. ^ Wichura, Michael J. (1988). "Algorithm AS241: The Percentage Points of the Normal Distribution". Applied Statistics. 37 (3): 477–84. doi:10.2307/2347330. JSTOR 2347330. +
  108. +
  109. ^ Johnson, Kotz & Balakrishnan (1995, Equation (26.48)) +
  110. +
  111. ^ Kinderman & Monahan (1977) +
  112. +
  113. ^ Leva (1992) +
  114. +
  115. ^ Marsaglia & Tsang (2000) +
  116. +
  117. ^ Karney (2016) +
  118. +
  119. ^ Monahan (1985, section 2) +
  120. +
  121. ^ Wallace (1996) +
  122. +
  123. ^ Johnson, Kotz & Balakrishnan (1994, p. 85) +
  124. +
  125. ^ Le Cam & Lo Yang (2000, p. 74) +
  126. +
  127. ^ De Moivre, Abraham (1733), Corollary I – see Walker (1985, p. 77) +
  128. +
  129. ^ Stigler (1986, p. 76) +
  130. +
  131. ^ Gauss (1809, section 177) +
  132. +
  133. ^ Gauss (1809, section 179) +
  134. +
  135. ^ Laplace (1774, Problem III) +
  136. +
  137. ^ Pearson (1905, p. 189) +
  138. +
  139. ^ Stigler (1986, p. 144) +
  140. +
  141. ^ Stigler (1978, p. 243) +
  142. +
  143. ^ Stigler (1978, p. 244) +
  144. +
  145. ^ Maxwell (1860, p. 23) +
  146. +
  147. ^ Jaynes, Edwin J.; Probability Theory: The Logic of Science, Ch. 7. +
  148. +
  149. ^ Peirce, Charles S. (c. 1909 MS), Collected Papers v. 6, paragraph 327. +
  150. +
  151. ^ Kruskal & Stigler (1997). +
  152. +
  153. ^ "Earliest Uses... (Entry Standard Normal Curve)". +
  154. +
  155. ^ Sun, Jingchao; Kong, Maiying; Pal, Subhadip (June 22, 2021). "The Modified-Half-Normal distribution: Properties and an efficient sampling scheme". Communications in Statistics - Theory and Methods. 52 (5): 1591–1613. doi:10.1080/03610926.2021.1934700. ISSN 0361-0926. S2CID 237919587. +
  156. +
+

Sources[edit]

+
+
+ +
+
+

External links[edit]

+ + + + + + + + +
+
+ +
+
+ +
+ +
+
+
+ +
+ + + + \ No newline at end of file diff --git a/references/Quickselect b/references/Quickselect new file mode 100644 index 0000000..a9cd76e --- /dev/null +++ b/references/Quickselect @@ -0,0 +1,940 @@ + + + + +Quickselect - Wikipedia + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Jump to content +
+
+
+ + + + +
+
+ + + + + +
+
+
+
+
+
+
+
+ +
+
+ +
+
+ + +
+
+
+ +

Quickselect

+ + +
+
+
+
+ +
+
+ + + +
+
+
+
+ +
+
+
+
+
+ +
From Wikipedia, the free encyclopedia
+
+
+ + +
+ +
Quickselect
Animated visualization of the quickselect algorithm. Selecting the 22st smallest value.
Animated visualization of the quickselect algorithm. Selecting the 22nd smallest value.
ClassSelection algorithm
Data structureArray
Worst-case performance(n2)
Best-case performance(n)
Average performance(n)
OptimalYes
+

In computer science, quickselect is a selection algorithm to find the kth smallest element in an unordered list, also known as the kth order statistic. Like the related quicksort sorting algorithm, it was developed by Tony Hoare, and thus is also known as Hoare's selection algorithm.[1] Like quicksort, it is efficient in practice and has good average-case performance, but has poor worst-case performance. Quickselect and its variants are the selection algorithms most often used in efficient real-world implementations. +

Quickselect uses the same overall approach as quicksort, choosing one element as a pivot and partitioning the data in two based on the pivot, accordingly as less than or greater than the pivot. However, instead of recursing into both sides, as in quicksort, quickselect only recurses into one side – the side with the element it is searching for. This reduces the average complexity from to , with a worst case of . +

As with quicksort, quickselect is generally implemented as an in-place algorithm, and beyond selecting the kth element, it also partially sorts the data. See selection algorithm for further discussion of the connection with sorting. +

+ +

Algorithm[edit]

+

In quicksort, there is a subprocedure called partition that can, in linear time, group a list (ranging from indices left to right) into two parts: those less than a certain element, and those greater than or equal to the element. Here is pseudocode that performs a partition about the element list[pivotIndex]: +

+
function partition(list, left, right, pivotIndex) is
+    pivotValue := list[pivotIndex]
+    swap list[pivotIndex] and list[right]  // Move pivot to end
+    storeIndex := left
+    for i from left to right − 1 do
+        if list[i] < pivotValue then
+            swap list[storeIndex] and list[i]
+            increment storeIndex
+    swap list[right] and list[storeIndex]  // Move pivot to its final place
+    return storeIndex
+
+

This is known as the Lomuto partition scheme, which is simpler but less efficient than Hoare's original partition scheme. +

In quicksort, we recursively sort both branches, leading to best-case time. However, when doing selection, we already know which partition our desired element lies in, since the pivot is in its final sorted position, with all those preceding it in an unsorted order and all those following it in an unsorted order. Therefore, a single recursive call locates the desired element in the correct partition, and we build upon this for quickselect: +

+
// Returns the k-th smallest element of list within left..right inclusive
+// (i.e. left <= k <= right).
+function select(list, left, right, k) is
+    if left = right then   // If the list contains only one element,
+        return list[left]  // return that element
+    pivotIndex  := ...     // select a pivotIndex between left and right,
+                           // e.g., left + floor(rand() % (right − left + 1))
+    pivotIndex  := partition(list, left, right, pivotIndex)
+    // The pivot is in its final sorted position
+    if k = pivotIndex then
+        return list[k]
+    else if k < pivotIndex then
+        return select(list, left, pivotIndex − 1, k)
+    else
+        return select(list, pivotIndex + 1, right, k) 
+
+

Note the resemblance to quicksort: just as the minimum-based selection algorithm is a partial selection sort, this is a partial quicksort, generating and partitioning only of its partitions. This simple procedure has expected linear performance, and, like quicksort, has quite good performance in practice. It is also an in-place algorithm, requiring only constant memory overhead if tail call optimization is available, or if eliminating the tail recursion with a loop: +

function select(list, left, right, k) is
+    loop
+        if left = right then
+            return list[left]
+        pivotIndex := ...     // select pivotIndex between left and right
+        pivotIndex := partition(list, left, right, pivotIndex)
+        if k = pivotIndex then
+            return list[k]
+        else if k < pivotIndex then
+            right := pivotIndex − 1
+        else
+            left := pivotIndex + 1
+
+

Time complexity[edit]

+

Like quicksort, quickselect has good average performance, but is sensitive to the pivot that is chosen. If good pivots are chosen, meaning ones that consistently decrease the search set by a given fraction, then the search set decreases in size exponentially and by induction (or summing the geometric series) one sees that performance is linear, as each step is linear and the overall time is a constant times this (depending on how quickly the search set reduces). However, if bad pivots are consistently chosen, such as decreasing by only a single element each time, then worst-case performance is quadratic: This occurs for example in searching for the maximum element of a set, using the first element as the pivot, and having sorted data. However, for randomly chosen pivots, this worst case is very unlikely: the probability of using more than comparisons, for any sufficiently large constant , is superexponentially small as a function of .[2] +

+

Variants[edit]

+

The easiest solution is to choose a random pivot, which yields almost certain linear time. Deterministically, one can use median-of-3 pivot strategy (as in the quicksort), which yields linear performance on partially sorted data, as is common in the real world. However, contrived sequences can still cause worst-case complexity; David Musser describes a "median-of-3 killer" sequence that allows an attack against that strategy, which was one motivation for his introselect algorithm. +

One can assure linear performance even in the worst case by using a more sophisticated pivot strategy; this is done in the median of medians algorithm. However, the overhead of computing the pivot is high, and thus this is generally not used in practice. One can combine basic quickselect with median of medians as fallback to get both fast average case performance and linear worst-case performance; this is done in introselect. +

Finer computations of the average time complexity yield a worst case of for random pivots (in the case of the median; other k are faster).[3] The constant can be improved to 3/2 by a more complicated pivot strategy, yielding the Floyd–Rivest algorithm, which has average complexity of for median, with other k being faster. +

+

See also[edit]

+ +

References[edit]

+
+
    +
  1. ^ Hoare, C. A. R. (1961). "Algorithm 65: Find". Comm. ACM. 4 (7): 321–322. doi:10.1145/366622.366647. +
  2. +
  3. ^ Devroye, Luc (1984). "Exponential bounds for the running time of a selection algorithm" (PDF). Journal of Computer and System Sciences. 29 (1): 1–7. doi:10.1016/0022-0000(84)90009-6. MR 0761047. Devroye, Luc (2001). "On the probabilistic worst-case time of 'find'" (PDF). Algorithmica. 31 (3): 291–303. doi:10.1007/s00453-001-0046-2. MR 1855252. +
  4. +
  5. ^ Blum-style analysis of Quickselect, David Eppstein, October 9, 2007. +
  6. +
+

External links[edit]

+
  • "qselect", Quickselect algorithm in Matlab, Manolis Lourakis
+ + + + +
+
+ +
+
+ +
+ +
+
+
+ +
+ + + + \ No newline at end of file diff --git a/references/Xorshift b/references/Xorshift new file mode 100644 index 0000000..44ed916 --- /dev/null +++ b/references/Xorshift @@ -0,0 +1,1045 @@ + + + + +Xorshift - Wikipedia + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Jump to content +
+
+
+ + + + +
+
+ + + + + +
+
+
+
+
+
+
+
+ +
+
+ +
+
+ + +
+
+
+ +

Xorshift

+ +
+ + + +
+
+
+
+
+ +
+
+ + + +
+
+
+
+ +
+
+
+
+
+ +
From Wikipedia, the free encyclopedia
+
+
+ + +
+
Example random distribution of Xorshift128
+

Xorshift random number generators, also called shift-register generators, are a class of pseudorandom number generators that were invented by George Marsaglia.[1] They are a subset of linear-feedback shift registers (LFSRs) which allow a particularly efficient implementation in software without the excessive use of sparse polynomials.[2] They generate the next number in their sequence by repeatedly taking the exclusive or of a number with a bit-shifted version of itself. This makes execution extremely efficient on modern computer architectures, but it does not benefit efficiency in a hardware implementation. Like all LFSRs, the parameters have to be chosen very carefully in order to achieve a long period.[3] +

For execution in software, xorshift generators are among the fastest PRNGs, requiring very small code and state. However, they do not pass every statistical test without further refinement. This weakness is amended by combining them with a non-linear function, as described in the original paper. Because plain xorshift generators (without a non-linear step) fail some statistical tests, they have been accused of being unreliable.[3]: 360  +

+ +

Example implementation[edit]

+

A C version[a] of three xorshift algorithms[1]: 4,5  is given here. The first has one 32-bit word of state, and period 232−1. The second has one 64-bit word of state and period 264−1. The last one has four 32-bit words of state, and period 2128−1. The 128-bit algorithm passes the diehard tests. However, it fails the MatrixRank and LinearComp tests of the BigCrush test suite from the TestU01 framework. +

All use three shifts and three or four exclusive-or operations: +

+
#include <stdint.h>
+
+struct xorshift32_state {
+    uint32_t a;
+};
+
+/* The state must be initialized to non-zero */
+uint32_t xorshift32(struct xorshift32_state *state)
+{
+	/* Algorithm "xor" from p. 4 of Marsaglia, "Xorshift RNGs" */
+	uint32_t x = state->a;
+	x ^= x << 13;
+	x ^= x >> 17;
+	x ^= x << 5;
+	return state->a = x;
+}
+
+struct xorshift64_state {
+    uint64_t a;
+};
+
+uint64_t xorshift64(struct xorshift64_state *state)
+{
+	uint64_t x = state->a;
+	x ^= x << 13;
+	x ^= x >> 7;
+	x ^= x << 17;
+	return state->a = x;
+}
+
+/* struct xorshift128_state can alternatively be defined as a pair
+   of uint64_t or a uint128_t where supported */
+struct xorshift128_state {
+    uint32_t x[4];
+};
+
+/* The state must be initialized to non-zero */
+uint32_t xorshift128(struct xorshift128_state *state)
+{
+	/* Algorithm "xor128" from p. 5 of Marsaglia, "Xorshift RNGs" */
+	uint32_t t  = state->x[3];
+    
+    uint32_t s  = state->x[0];  /* Perform a contrived 32-bit shift. */
+	state->x[3] = state->x[2];
+	state->x[2] = state->x[1];
+	state->x[1] = s;
+
+	t ^= t << 11;
+	t ^= t >> 8;
+	return state->x[0] = t ^ s ^ (s >> 19);
+}
+
+

Non-linear variations[edit]

+

All xorshift generators fail some tests in the BigCrush test suite. This is true for all generators based on linear recurrences, such as the Mersenne Twister or WELL. However, it is easy to scramble the output of such generators to improve their quality. +

The scramblers known as + and * still leave weakness in the low bits,[4] so they are intended for floating point use, as double-precision floating-point numbers only use 53 bits, so the lower 11 bits are not used. For general purpose, the scrambler ** (pronounced starstar) makes the LFSR generators pass in all bits. +

+

xorwow[edit]

+

Marsaglia suggested scrambling the output by combining it with a simple additive counter modulo 232 (which he calls a "Weyl sequence" after Weyl's equidistribution theorem). This also increases the period by a factor of 232, to 2192−232: +

+
#include <stdint.h>
+
+struct xorwow_state {
+    uint32_t x[5];
+    uint32_t counter;
+};
+
+/* The state array must be initialized to not be all zero in the first four words */
+uint32_t xorwow(struct xorwow_state *state)
+{
+    /* Algorithm "xorwow" from p. 5 of Marsaglia, "Xorshift RNGs" */
+    uint32_t t  = state->x[4];
+ 
+    uint32_t s  = state->x[0];  /* Perform a contrived 32-bit shift. */
+    state->x[4] = state->x[3];
+    state->x[3] = state->x[2];
+    state->x[2] = state->x[1];
+    state->x[1] = s;
+ 
+    t ^= t >> 2;
+    t ^= t << 1;
+    t ^= s ^ (s << 4);
+    state->x[0] = t;
+    state->counter += 362437;
+    return t + state->counter;
+}
+
+

This performs well, but fails a few tests in BigCrush.[5] This generator is the default in Nvidia's CUDA toolkit.[6] +

+

xorshift*[edit]

+

An xorshift* generator applies an invertible multiplication (modulo the word size) as a non-linear transformation to the output of an xorshift generator, as suggested by Marsaglia.[1] All xorshift* generators emit a sequence of values that is equidistributed in the maximum possible dimension (except that they will never output zero for 16 calls, i.e. 128 bytes, in a row).[7] +

The following 64-bit generator has a maximal period of 264−1.[7] +

+
#include <stdint.h>
+
+/* xorshift64s, variant A_1(12,25,27) with multiplier M_32 from line 3 of table 5 */
+uint64_t xorshift64star(void) {
+    /* initial seed must be nonzero, don't use a static variable for the state if multithreaded */
+    static uint64_t x = 1;
+    x ^= x >> 12;
+    x ^= x << 25;
+    x ^= x >> 27;
+    return x * 0x2545F4914F6CDD1DULL;
+}
+
+

The generator fails only the MatrixRank test of BigCrush, however if the generator is modified to return only the high 32 bits, then it passes BigCrush with zero failures.[8]: 7  In fact, a reduced version with only 40 bits of internal state passes the suite, suggesting a large safety margin.[8]: 19  A similar generator suggested in Numerical Recipes[9] as RanQ1 also fails the BirthdaySpacings test. +

Vigna[7] suggests the following xorshift1024* generator with 1024 bits of state and a maximal period of 21024−1; however, it does not always pass BigCrush.[4] xoshiro256** is therefore a much better option. +

+
#include <stdint.h>
+
+/* The state must be seeded so that there is at least one non-zero element in array */
+struct xorshift1024s_state {
+	uint64_t x[16];
+	int index;
+};
+
+uint64_t xorshift1024s(struct xorshift1024s_state *state)
+{
+	int index = state->index;
+	uint64_t const s = state->x[index++];
+	uint64_t t = state->x[index &= 15];
+	t ^= t << 31;		// a
+	t ^= t >> 11;		// b  -- Again, the shifts and the multipliers are tunable
+	t ^= s ^ (s >> 30);	// c
+	state->x[index] = t;
+	state->index = index;
+	return t * 1181783497276652981ULL;
+}
+
+

xorshift+[edit]

+

An xorshift+ generator can achieve an order of magnitude fewer failures than Mersenne Twister or WELL. A native C implementation of an xorshift+ generator that passes all tests from the BigCrush suite can typically generate a random number in fewer than 10 clock cycles on x86, thanks to instruction pipelining.[10] +

Rather than using multiplication, it is possible to use addition as a faster non-linear transformation. The idea was first proposed by Saito and Matsumoto (also responsible for the Mersenne Twister) in the XSadd generator, which adds two consecutive outputs of an underlying xorshift generator based on 32-bit shifts.[11] However, one disadvantage of adding consecutive outputs is that, while the underlying xorshift128 generator is 2-dimensionally equidistributed, the xorshift128+ generator is only 1-dimensionally equidistributed.[12] +

XSadd has some weakness in the low-order bits of its output; it fails several BigCrush tests when the output words are bit-reversed. To correct this problem, Vigna introduced the xorshift+ family,[12] based on 64-bit shifts. xorshift+ generators, even as large as xorshift1024+, exhibit some detectable linearity in the low-order bits of their output;[4] it passes BigCrush, but doesn't when the 32 lowest-order bits are used in reverse order from each 64-bit word.[4] This generator is one of the fastest generators passing BigCrush.[10] +

The following xorshift128+ generator uses 128 bits of state and has a maximal period of 2128−1. +

+
#include <stdint.h>
+
+struct xorshift128p_state {
+    uint64_t x[2];
+};
+
+/* The state must be seeded so that it is not all zero */
+uint64_t xorshift128p(struct xorshift128p_state *state)
+{
+	uint64_t t = state->x[0];
+	uint64_t const s = state->x[1];
+	state->x[0] = s;
+	t ^= t << 23;		// a
+	t ^= t >> 18;		// b -- Again, the shifts and the multipliers are tunable
+	t ^= s ^ (s >> 5);	// c
+	state->x[1] = t;
+	return t + s;
+}
+
+

xoshiro[edit]

+

xoshiro and xoroshiro use rotations in addition to shifts. According to Vigna, they are faster and produce better quality output than xorshift.[13][14] +

This class of generator has variants for 32-bit and 64-bit integer and floating point output; for floating point, one takes the upper 53 bits (for binary64) or the upper 23 bits (for binary32), since the upper bits are of better quality than the lower bits in the floating point generators. The algorithms also include a jump function, which sets the state forward by some number of steps – usually a power of two that allows many threads of execution to start at distinct initial states. +

For 32-bit output, xoshiro128** and xoshiro128+ are exactly equivalent to xoshiro256** and xoshiro256+, with uint32_t in place of uint64_t, and with different shift/rotate constants. +

More recently, the xoshiro++ generators have been made as an alternative to the xoshiro** generators. They are used in some implementations of Fortran compilers such as GNU Fortran, Java, and Julia.[15] +

+

xoshiro256**[edit]

+

xoshiro256** is the family's general-purpose random 64-bit number generator. It is used in GNU Fortran compiler, Lua (as of Lua 5.4), and the .NET framework (as of .NET 6.0).[15] +

+
/*  Adapted from the code included on Sebastiano Vigna's website */
+
+#include <stdint.h>
+
+uint64_t rol64(uint64_t x, int k)
+{
+	return (x << k) | (x >> (64 - k));
+}
+
+struct xoshiro256ss_state {
+	uint64_t s[4];
+};
+
+uint64_t xoshiro256ss(struct xoshiro256ss_state *state)
+{
+	uint64_t *s = state->s;
+	uint64_t const result = rol64(s[1] * 5, 7) * 9;
+	uint64_t const t = s[1] << 17;
+
+	s[2] ^= s[0];
+	s[3] ^= s[1];
+	s[1] ^= s[2];
+	s[0] ^= s[3];
+
+	s[2] ^= t;
+	s[3] = rol64(s[3], 45);
+
+	return result;
+}
+
+

xoshiro256+[edit]

+

xoshiro256+ is approximately 15% faster than xoshiro256**, but the lowest three bits have low linear complexity; therefore, it should be used only for floating point results by extracting the upper 53 bits. +

+
#include <stdint.h>
+
+uint64_t rol64(uint64_t x, int k)
+{
+	return (x << k) | (x >> (64 - k));
+}
+
+struct xoshiro256p_state {
+	uint64_t s[4];
+};
+
+uint64_t xoshiro256p(struct xoshiro256p_state *state)
+{
+	uint64_t* s = state->s;
+	uint64_t const result = s[0] + s[3];
+	uint64_t const t = s[1] << 17;
+
+	s[2] ^= s[0];
+	s[3] ^= s[1];
+	s[1] ^= s[2];
+	s[0] ^= s[3];
+
+	s[2] ^= t;
+	s[3] = rol64(s[3], 45);
+
+	return result;
+}
+
+

xoroshiro[edit]

+

If space is at a premium, xoroshiro128** and xoroshiro128+ are equivalent to xoshiro256** and xoshiro256+. These have smaller state spaces, and thus are less useful for massively parallel programs. xoroshiro128+ also exhibits a mild dependency in the population count, generating a failure after TB of output. The authors do not believe that this can be detected in real world programs. +

xoroshiro64** and xoroshiro64* are equivalent to xoroshiro128** and xoroshiro128+. Unlike the xoshiro generators, they are not straightforward ports of their higher-precision counterparts. +

+

Initialization[edit]

+

In the xoshiro paper, it is recommended to initialize the state of the generators using a generator which is radically different from the initialized generators, as well as one which will never give the "all-zero" state; for shift-register generators, this state is impossible to escape from.[14][16] The authors specifically recommend using the SplitMix64 generator, from a 64-bit seed, as follows: +

+
#include <stdint.h>
+
+struct splitmix64_state {
+	uint64_t s;
+};
+
+uint64_t splitmix64(struct splitmix64_state *state) {
+	uint64_t result = (state->s += 0x9E3779B97f4A7C15);
+	result = (result ^ (result >> 30)) * 0xBF58476D1CE4E5B9;
+	result = (result ^ (result >> 27)) * 0x94D049BB133111EB;
+	return result ^ (result >> 31);
+}
+
+struct xorshift128_state {
+    uint32_t x[4];
+};
+
+// one could do the same for any of the other generators
+void xorshift128_init(struct xorshift128_state *state, uint64_t seed) {
+	struct splitmix64_state smstate = {seed};
+
+	uint64_t tmp = splitmix64(&smstate);
+	state->x[0] = (uint32_t)tmp;
+	state->x[1] = (uint32_t)(tmp >> 32);
+
+	tmp = splitmix64(&smstate);
+	state->x[2] = (uint32_t)tmp;
+	state->x[3] = (uint32_t)(tmp >> 32);
+}
+
+

Notes[edit]

+
+
    +
  1. ^ In C and most other C-based languages, ^ represents bitwise XOR, and << and >> represent bitwise shifts. +
  2. +
+

References[edit]

+
+
    +
  1. ^ a b c Marsaglia, George (July 2003). "Xorshift RNGs". Journal of Statistical Software. 8 (14). doi:10.18637/jss.v008.i14. +
  2. +
  3. ^ Brent, Richard P. (August 2004). "Note on Marsaglia's Xorshift Random Number Generators". Journal of Statistical Software. 11 (5). doi:10.18637/jss.v011.i05. hdl:1885/34049. +
  4. +
  5. ^ a b Panneton, François; L'Ecuyer, Pierre (October 2005). "On the xorshift random number generators" (PDF). ACM Transactions on Modeling and Computer Simulation. 15 (4): 346–361. doi:10.1145/1113316.1113319. S2CID 11136098. +
  6. +
  7. ^ a b c d Lemire, Daniel; O’Neill, Melissa E. (April 2019). "Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ Fail Statistical Tests for Linearity". Computational and Applied Mathematics. 350: 139–142. arXiv:1810.05313. doi:10.1016/j.cam.2018.10.019. S2CID 52983294. We report that these scrambled generators systematically fail Big Crush—specifically the linear-complexity and matrix-rank tests that detect linearity—when taking the 32 lowest-order bits in reverse order from each 64-bit word. +
  8. +
  9. ^ Le Floc'h, Fabien (12 January 2011). "XORWOW L'ecuyer TestU01 Results". Chase The Devil (blog). Retrieved 2017-11-02. +
  10. +
  11. ^ "cuRAND Testing". Nvidia. Retrieved 2017-11-02. +
  12. +
  13. ^ a b c +Vigna, Sebastiano (July 2016). "An experimental exploration of Marsaglia's xorshift generators, scrambled" (PDF). ACM Transactions on Mathematical Software. 42 (4): 30. arXiv:1402.6246. doi:10.1145/2845077. S2CID 13936073. Proposes xorshift* generators, adding a final multiplication by a constant. +
  14. +
  15. ^ a b O'Neill, Melissa E. (5 September 2014). PCG: A Family of Simple Fast Space-Efficient Statistically Good Algorithms for Random Number Generation (PDF) (Technical report). Harvey Mudd College. pp. 6–8. HMC-CS-2014-0905. +
  16. +
  17. ^ Press, WH; Teukolsky, SA; Vetterling, WT; Flannery, BP (2007). "Section 7.1.2.A. 64-bit Xorshift Method". Numerical Recipes: The Art of Scientific Computing (3rd ed.). New York: Cambridge University Press. ISBN 978-0-521-88068-8. +
  18. +
  19. ^ a b Vigna, Sebastiano. "xorshift*/xorshift+ generators and the PRNG shootout". Retrieved 2014-10-25. +
  20. +
  21. ^ Saito, Mutsuo; Matsumoto, Makoto (2014). "XORSHIFT-ADD (XSadd): A variant of XORSHIFT". Retrieved 2014-10-25. +
  22. +
  23. ^ a b Vigna, Sebastiano (May 2017). "Further scramblings of Marsaglia's xorshift generators" (PDF). Journal of Computational and Applied Mathematics. 315 (C): 175–181. arXiv:1404.0390. doi:10.1016/j.cam.2016.11.006. S2CID 6876444. Describes xorshift+ generators, a generalization of XSadd. +
  24. +
  25. ^ Vigna, Sebastiano. "xoshiro/xoroshiro generators and the PRNG shootout". Retrieved 2019-07-07. +
  26. +
  27. ^ a b Blackman, David; Vigna, Sebastiano (2018). "Scrambled Linear Pseudorandom Number Generators". Data Structures and Algorithms. arXiv:1805.01407. +
  28. +
  29. ^ a b "xoshiro / xoroshiro generators and the PRNG shootout". Retrieved 2023-09-07. +
  30. +
  31. ^ Matsumoto, Makoto; Wada, Isaku; Kuramoto, Ai; Ashihara, Hyo (September 2007). "Common defects in initialization of pseudorandom number generators". ACM Transactions on Modeling and Computer Simulation. 17 (4): 15–es. doi:10.1145/1276927.1276928. S2CID 1721554. +
  32. +
+

Further reading[edit]

+ + + + + +
+
+ +
+
+ +
+ +
+
+
+ +
+ + + + \ No newline at end of file diff --git a/references/how-does-xorshift32-works b/references/how-does-xorshift32-works new file mode 100644 index 0000000..d38791d --- /dev/null +++ b/references/how-does-xorshift32-works @@ -0,0 +1,2380 @@ + + + + + + + + c - How does XorShift32 works? - Stack Overflow + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + +
+
+ +
+ +
+ + + + + + + + + + + + + +
+
+ + + + + + + +
+ + + + + + +
+ + + +
+ + +
+ + + + +
+
+ Asked + +
+
+ Modified + 4 years, 11 months ago +
+
+ Viewed + 6k times +
+
+
+ + +
+ +
+
+
+
+ +
+
+
+ +
+ 1 +
+ + + + + + + + + + + + + + + + +
+ +
+ + + +
+ +
+ +

I have this homework where i need to implement xorshift32(i can t use anything else) so i can generate some numbers but i don t understand how the algorithm works or how to implement it.

+ +

I am trying to print the generated number but i don t know how to call the xorshift32 function because of the state[static 1] argument.

+ +
uint32_t xorshift32(uint32_t state[static 1])
+{
+    uint32_t x = state[0];
+    x ^= x << 13;
+    x ^= x >> 17;
+    x ^= x << 5;
+    state[0] = x;
+    return x;
+}
+
+ +

I do not have much information on xorshft32 other that what is on wikipedia(en.wikipedia.org/wiki/Xorshift).

+
+ +
+ +
+ +
+
+
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+ +
+ +
+
+ + + +
+
+
+ +
+ + + + + 4 +
+
+
    + +
  • +
    +
    + 1 +
    +
    +
    +
    + + Can you even compile that function? It does not conform to standard C (with respect to the function parameter), so if your compiler accepts it then some language extension is in play. You'll need to check your implementation's documentation (or maybe your class notes) to find out what it means. On the other hand, maybe there's simply a typo there. It would make more sense if the static keyword were removed, or perhaps moved to the beginning of the function declaration. + +
    +– John Bollinger +
    + Dec 21, 2018 at 14:16 +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + You need to tell us what the xorshift32 is supposed to do. + +
    +– Jabberwocky +
    + Dec 21, 2018 at 14:21 +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + @Jabberwocky here is the wikipeda: en.wikipedia.org/wiki/Xorshift . I don t even know to explain it to you. It s a number generator using xor and shift made by a guy. The teacher didn t tell us much either + +
    +– Predescu Eduard +
    + Dec 21, 2018 at 14:30 +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + @PredescuEduard that information belongs into the question. You can edit your question. + +
    +– Jabberwocky +
    + Dec 21, 2018 at 14:31 +
    +
    +
  • + +
+
+ + +
+
+ +
+ + +
+
+
+
+ + + +
+ +
+
+
+

+ 2 Answers + 2 +

+
+
+ + +
+
+ + + Reset to default + +
+
+ +
+
+ + +
+
+ +
+ + + + +
+
+
+
+ +
+ 3 +
+ + + + + + + + + + + + +
+
+ +
+
+ + + + +
+ +
+ + + +
+ +
+

This is an extended comment to the good answer by Jabberwocky.

+ +

The Xorshift variants, rand(), and basically all random number generator functions, are actually pseudorandom number generators. They are not "real random", because the sequence of numbers they generate depends on their internal state; but they are "pseudorandom", because if you do not know the generator internal state, the sequence of numbers they generate is random in the statistical sense.

+ +

George Marsaglia, the author of the Xorshift family of pseudorandom number generators, also developed a set of statistical tools called Diehard tests that can be used to analyse the "randomness" of the sequences generated. Currently, the TestU01 tests are probably the most widely used and trusted; in particular, the 160-test BigCrush set.

+ +

The sequence generated by ordinary pseudorandom number generators often allows one to determine the internal state of the generator. This means that observing a long enough generated sequence, allows one to fairly reliably predict the future sequence. Cryptographically secure pseudorandom number generators avoid that, usually by applying a cryptographically secure hash function to the output; one would need a catalog of the entire sequence to be able to follow it. When the periods are longer than 2256 or so, there is not enough baryonic matter in the entire observable universe to store the sequence.

+ +

My own favourite PRNG is Xorshift64*, which has a period of 264-1, and passes all but the MatrixRank test in BigCrush. In C99 and later, you can implement it using

+ +
#include <inttypes.h>
+
+typedef struct {
+    uint64_t  state;
+} prng_state;
+
+static inline uint64_t prng_u64(prng_state *const p)
+{
+    uint64_t  state = p->state;
+    state ^= state >> 12;
+    state ^= state << 25;
+    state ^= state >> 27;
+    p->state = state;
+    return state * UINT64_C(2685821657736338717);
+}
+
+ +

The state can be initialized to any nonzero uint64_t. (A zero state will lead the generator to generate all zeros till infinity. The period is 264-1, because the generator will have each 64-bit state (excluding zero) exactly once during each period.)

+ +

It is good enough for most use cases, and extremely fast. It belongs to the class of linear-feedback shift register pseudorandom number generators.

+ +

Note that the variant which returns an uniform distribution between 0 and 1,

+ +
static inline double prng_one(prng_state *p)
+{
+    return prng_u64(p) / 18446744073709551616.0;
+}
+
+ +

uses the high bits; the high 32 bits of the sequence does pass all BigCrunch tests in TestU01 suite, so this is a surprisingly good (randomness and efficiency) generator for double-precision uniform random numbers -- my typical use case.

+ +

The format above allows multiple independent generators in a single process, by specifying the generator state as a parameter. If the basic generator is implemented in a header file (thus the static inline; it is a preprocessor macro-like function), you can switch between generators by switching between header files, and recompiling the binary.

+ +

(You are usually better off by using a single generator, unless you use multiple threads in a pseudorandom number heavy simulator, in which case using a separate generator for each thread will help a lot; avoids cacheline ping-pong between threads competing for the generator state, in particular.)

+ +

The rand() function in most C standard library implementations is a linear-congruential generator. They often suffer from poor choices of the coefficients, and nowadays, also from the relative slowness of the modulo operator (when the modulus is not a power of two).

+ +

The most widely used pseudorandom number generator is the Mersenne Twister, by Makoto Matsumoto (松本 眞) and Takuji Nishimura (西村 拓士). It is a twisted generalized linear feedback shift register, and has quite a large state (about 2500 bytes) and very long period (219937-1).

+ +
+ +

When we talk of true random number generators, we usually mean a combination of a pseudorandom number generator (usually a cryptographically secure one), and a source of entropy; random bits with at least some degree of true physical randomness.

+ +

In Linux, Mac OS, and BSDs at least, the operating system kernel exposes a source of pseudorandom numbers (getentropy() in Linux and OpenBSD, getrandom() in Linux, /dev/urandom, /dev/arandom, /dev/random in many Unixes, and so on). Entropy is gathered from physical electronic sources, like internal processor latencies, physical interrupt line timings, (spinning disk) hard drive timings, possibly even keyboard and mice. Many motherboards and some processors even have hardware random number sources that can be used as sources for entropy (or even directly as "trusted randomness sources").

+ +

The exclusive-or operation (^ in C) is used to mix in randomness to the generator state. This works, because exclusive-or between a known bit and a random bit results in a random bit; XOR preserves randomness. When mixing entropy pools (with some degree of randomness in the bit states) using XOR, the result will have at least as much entropy as the sources had.

+ +

Note that that does not mean that you get "better" random numbers by mixing the output of two or more generators. The statistics of true randomness is hard for humans to grok (just look at how poor the common early rand() implementations were! HORRIBLE!). It is better to pick a generator (or a set of generators to switch between at compile time, or at run time) that passes the BigCrunch tests, and ensure it has a good random initial state on every run. That way you leverage the work of many mathematicians and others who have worked on these things for decades, and can concentrate on the other stuff, what you yourself are good at.

+
+
+
+ +
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+
+ +
+ + +
+ + + +
+
+ + +
+ +
+ + + + + +
+
+
    + +
+
+ + +
+
+
+ +
+
+
+
+ + +
+
+
+
+ +
+ 2 +
+ + + + + + + + + + + + +
+
+ +
+
+ + + + +
+ +
+ + + +
+ +
+

The C code in the wikipedia article is somewhat misleading:

+ +

Here is a working example that uses both the 32 bit and the 64 bit versions:

+ +
#include <stdio.h>
+#include <stdint.h>
+
+/* The state word must be initialized to non-zero */
+uint32_t xorshift32(uint32_t state[])
+{
+  /* Algorithm "xor" from p. 4 of Marsaglia, "Xorshift RNGs" */
+  uint32_t x = state[0];
+  x ^= x << 13;
+  x ^= x >> 17;
+  x ^= x << 5;
+  state[0] = x;
+  return x;
+}
+
+uint64_t xorshift64(uint64_t state[])
+{
+  uint64_t x = state[0];
+  x ^= x << 13;
+  x ^= x >> 7;
+  x ^= x << 17;
+  state[0] = x;
+  return x;
+}
+
+int main()
+{
+  uint32_t state[1] = {1234};  // "seed" (can be anthing but 0)
+
+  for (int i = 0; i < 50; i++)
+  {
+    printf("%u\n", xorshift32(state));
+  }
+
+  uint64_t state64[1] = { 1234 };  // "seed" (can be anthing but 0)
+
+  for (int i = 0; i < 50; i++)
+  {
+    printf("%llu\n", xorshift64(state64));
+  }
+}
+
+ +

The mathematical aspects are explained in the wikipedia article and in it's footnotes.

+ +

The rest is basic C language knowledge, ^ is the C bitwise XOR operator.

+
+
+
+ +
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+ + +
+ + + +
+
+ + +
+ +
+ + + + + +
+
+
    + +
+
+ + +
+
+
+ + + +
+ + + +

+ Your Answer +

+ + + + + + +
+ + +
+
+
+
+
+ +
+
+
+
+
+ + + + + +
+ + +
+ + +
+
+ +
+ + +
+ +
+ + +
+ + + + +
+ +
+ + +

+ By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct. +

+
+
+
+ + +

+
+Not the answer you're looking for? Browse other questions tagged or ask your own question.
+

+
+
+ + + + +
+ +
+ + + +
+
+ + + + + + + + + + + + + + + + + + +
+
+
 
+ + + + + + diff --git a/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p b/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p new file mode 100644 index 0000000..5087b72 --- /dev/null +++ b/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p @@ -0,0 +1,2498 @@ + + + + + + + + scipy - How to calculate the inverse of the normal cumulative distribution function in python? - Stack Overflow + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + +
+
+ +
+ +
+ + + + + + + + + + + + + +
+
+ + + + + + + +
+ + + + + + +
+ + + +
+ + +
+ + + + +
+
+ Asked + +
+
+ Modified + 1 year, 3 months ago +
+
+ Viewed + 199k times +
+
+
+ + +
+ +
+
+
+
+ +
+
+
+ +
+ 107 +
+ + + + + + + + + + + + + + + + +
+ +
+ + + +
+ +
+ +

How do I calculate the inverse of the cumulative distribution function (CDF) of the normal distribution in Python?

+ +

Which library should I use? Possibly scipy?

+
+ + + +
+
+
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+ +
+ +
+
+ + + +
+
+
+ +
+ + + + + 3 +
+
+
    + +
  • +
    +
    + 1 +
    +
    +
    +
    + + Do you mean the inverse Gaussian distribution (en.wikipedia.org/wiki/Inverse_Gaussian_distribution), or the inverse of the cumulative distribution function of the normal distribution (en.wikipedia.org/wiki/Normal_distribution), or something else? + + + Dec 17, 2013 at 6:30 + + + +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + @WarrenWeckesser the second one: inverse of the cumulative distribution function of the normal distribution + +
    +– Yueyoum +
    + Dec 17, 2013 at 6:32 + + + +
    +
    +
  • +
  • +
    +
    + 1 +
    +
    +
    +
    + + @WarrenWeckesser i mean the python version of "normsinv" function in excel. + +
    +– Yueyoum +
    + Dec 17, 2013 at 6:39 +
    +
    +
  • + +
+
+ + +
+
+ +
+ + +
+
+
+
+ + + +
+ +
+
+
+

+ 3 Answers + 3 +

+
+
+ + +
+
+ + + Reset to default + +
+
+ +
+
+ + +
+
+ +
+ + + + +
+
+
+
+ +
+ 171 +
+ + + + + + + + + + + + +
+
+ +
+
+ + + + +
+ +
+ + + +
+ +
+

NORMSINV (mentioned in a comment) is the inverse of the CDF of the standard normal distribution. Using scipy, you can compute this with the ppf method of the scipy.stats.norm object. The acronym ppf stands for percent point function, which is another name for the quantile function.

+
In [20]: from scipy.stats import norm
+
+In [21]: norm.ppf(0.95)
+Out[21]: 1.6448536269514722
+
+

Check that it is the inverse of the CDF:

+
In [34]: norm.cdf(norm.ppf(0.95))
+Out[34]: 0.94999999999999996
+
+

By default, norm.ppf uses mean=0 and stddev=1, which is the "standard" normal distribution. You can use a different mean and standard deviation by specifying the loc and scale arguments, respectively.

+
In [35]: norm.ppf(0.95, loc=10, scale=2)
+Out[35]: 13.289707253902945
+
+

If you look at the source code for scipy.stats.norm, you'll find that the ppf method ultimately calls scipy.special.ndtri. So to compute the inverse of the CDF of the standard normal distribution, you could use that function directly:

+
In [43]: from scipy.special import ndtri
+
+In [44]: ndtri(0.95)
+Out[44]: 1.6448536269514722
+
+

ndtri is much faster than norm.ppf:

+
In [46]: %timeit norm.ppf(0.95)
+240 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
+
+In [47]: %timeit ndtri(0.95)
+1.47 µs ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
+
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+
+ +
+ + +
+ + + +
+
+ + +
+ +
+ + + + + 4 +
+
+
    + +
  • +
    +
    + 29 +
    +
    +
    +
    + + I always think "percent point function" (ppf) is a terrible name. Most people in statistics just use "quantile function". + +
    +– William Zhang +
    + Oct 4, 2014 at 0:44 +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + Don't you need to specify the mean and the std on both ppf and cdf? + +
    +– bones.felipe +
    + Jan 29, 2021 at 19:23 + + + +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + @bones.felipe, the "standard" normal distribution has mean 0 and standard deviation 1. These are the default values for the location and scale of the scipy.stats.norm methods. + + + Jan 29, 2021 at 19:55 +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + Right, I thought I saw this norm.cdf(norm.ppf(0.95, loc=10, scale=2)) and I thought it was weird norm.cdf did not have loc=10 and scale=2 too, I guess it should. + +
    +– bones.felipe +
    + Jan 30, 2021 at 5:33 + + + +
    +
    +
  • + +
+
+ + +
+
+
+ +
+
+
+
+ + +
+
+
+
+ +
+ 39 +
+ + + + + + + + + + + + +
+
+ +
+
+ + + + +
+ +
+ + + +
+ +
+

Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.

+ +

It can be used to get the inverse cumulative distribution function (inv_cdf - inverse of the cdf), also known as the quantile function or the percent-point function for a given mean (mu) and standard deviation (sigma):

+ +
from statistics import NormalDist
+
+NormalDist(mu=10, sigma=2).inv_cdf(0.95)
+# 13.289707253902943
+
+ +

Which can be simplified for the standard normal distribution (mu = 0 and sigma = 1):

+ +
NormalDist().inv_cdf(0.95)
+# 1.6448536269514715
+
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+ + +
+ + + +
+
+ + +
+ +
+ + + + + 2 +
+
+
    + +
  • +
    +
    + 4 +
    +
    +
    +
    + + Great tip! This allows me to drop the dependency on scipy, which I needed just for the single stats.norm.ppf method + +
    +– Jethro Cao +
    + Feb 21, 2020 at 16:56 +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + can you use that to transform data with uniform distribution to normal ? + +
    +– vanetoj +
    + Mar 31, 2022 at 20:51 +
    +
    +
  • + +
+
+ + +
+
+
+ + + +
+
+
+
+ +
+ 21 +
+ + + + + + + + + + + + +
+
+ +
+
+ + + + +
+ +
+ + + +
+ +
+
# given random variable X (house price) with population muy = 60, sigma = 40
+import scipy as sc
+import scipy.stats as sct
+sc.version.full_version # 0.15.1
+
+#a. Find P(X<50)
+sct.norm.cdf(x=50,loc=60,scale=40) # 0.4012936743170763
+
+#b. Find P(X>=50)
+sct.norm.sf(x=50,loc=60,scale=40) # 0.5987063256829237
+
+#c. Find P(60<=X<=80)
+sct.norm.cdf(x=80,loc=60,scale=40) - sct.norm.cdf(x=60,loc=60,scale=40)
+
+#d. how much top most 5% expensive house cost at least? or find x where P(X>=x) = 0.05
+sct.norm.isf(q=0.05,loc=60,scale=40)
+
+#e. how much top most 5% cheapest house cost at least? or find x where P(X<=x) = 0.05
+sct.norm.ppf(q=0.05,loc=60,scale=40)
+
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+ + +
+ + + +
+
+ + +
+ +
+ + + + + 1 +
+
+
    + +
  • +
    +
    + 6 +
    +
    +
    +
    + + PS: You can assume 'loc' as 'mean' and 'scale' as 'standard deviation' + +
    +– Suresh2692 +
    + Jul 5, 2017 at 11:11 +
    +
    +
  • + +
+
+ + +
+
+
+ + + + + + +

+
+Not the answer you're looking for? Browse other questions tagged or ask your own question.
+

+
+
+ + + + +
+ +
+ + + +
+
+ + + + + + + + + + + + + + + + + + +
+
+
 
+ + + + + + diff --git a/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p.1 b/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p.1 new file mode 100644 index 0000000..159dcb0 --- /dev/null +++ b/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p.1 @@ -0,0 +1,2498 @@ + + + + + + + + scipy - How to calculate the inverse of the normal cumulative distribution function in python? - Stack Overflow + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ + +
+
+ +
+ +
+ + + + + + + + + + + + + +
+
+ + + + + + + +
+ + + + + + +
+ + + +
+ + +
+ + + + +
+
+ Asked + +
+
+ Modified + 1 year, 3 months ago +
+
+ Viewed + 199k times +
+
+
+ + +
+ +
+
+
+
+ +
+
+
+ +
+ 107 +
+ + + + + + + + + + + + + + + + +
+ +
+ + + +
+ +
+ +

How do I calculate the inverse of the cumulative distribution function (CDF) of the normal distribution in Python?

+ +

Which library should I use? Possibly scipy?

+
+ + + +
+
+
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+ +
+ +
+
+ + + +
+
+
+ +
+ + + + + 3 +
+
+
    + +
  • +
    +
    + 1 +
    +
    +
    +
    + + Do you mean the inverse Gaussian distribution (en.wikipedia.org/wiki/Inverse_Gaussian_distribution), or the inverse of the cumulative distribution function of the normal distribution (en.wikipedia.org/wiki/Normal_distribution), or something else? + + + Dec 17, 2013 at 6:30 + + + +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + @WarrenWeckesser the second one: inverse of the cumulative distribution function of the normal distribution + +
    +– Yueyoum +
    + Dec 17, 2013 at 6:32 + + + +
    +
    +
  • +
  • +
    +
    + 1 +
    +
    +
    +
    + + @WarrenWeckesser i mean the python version of "normsinv" function in excel. + +
    +– Yueyoum +
    + Dec 17, 2013 at 6:39 +
    +
    +
  • + +
+
+ + +
+
+ +
+ + +
+
+
+
+ + + +
+ +
+
+
+

+ 3 Answers + 3 +

+
+
+ + +
+
+ + + Reset to default + +
+
+ +
+
+ + +
+
+ +
+ + + + +
+
+
+
+ +
+ 171 +
+ + + + + + + + + + + + +
+
+ +
+
+ + + + +
+ +
+ + + +
+ +
+

NORMSINV (mentioned in a comment) is the inverse of the CDF of the standard normal distribution. Using scipy, you can compute this with the ppf method of the scipy.stats.norm object. The acronym ppf stands for percent point function, which is another name for the quantile function.

+
In [20]: from scipy.stats import norm
+
+In [21]: norm.ppf(0.95)
+Out[21]: 1.6448536269514722
+
+

Check that it is the inverse of the CDF:

+
In [34]: norm.cdf(norm.ppf(0.95))
+Out[34]: 0.94999999999999996
+
+

By default, norm.ppf uses mean=0 and stddev=1, which is the "standard" normal distribution. You can use a different mean and standard deviation by specifying the loc and scale arguments, respectively.

+
In [35]: norm.ppf(0.95, loc=10, scale=2)
+Out[35]: 13.289707253902945
+
+

If you look at the source code for scipy.stats.norm, you'll find that the ppf method ultimately calls scipy.special.ndtri. So to compute the inverse of the CDF of the standard normal distribution, you could use that function directly:

+
In [43]: from scipy.special import ndtri
+
+In [44]: ndtri(0.95)
+Out[44]: 1.6448536269514722
+
+

ndtri is much faster than norm.ppf:

+
In [46]: %timeit norm.ppf(0.95)
+240 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
+
+In [47]: %timeit ndtri(0.95)
+1.47 µs ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
+
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+
+ +
+ + +
+ + + +
+
+ + +
+ +
+ + + + + 4 +
+
+
    + +
  • +
    +
    + 29 +
    +
    +
    +
    + + I always think "percent point function" (ppf) is a terrible name. Most people in statistics just use "quantile function". + +
    +– William Zhang +
    + Oct 4, 2014 at 0:44 +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + Don't you need to specify the mean and the std on both ppf and cdf? + +
    +– bones.felipe +
    + Jan 29, 2021 at 19:23 + + + +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + @bones.felipe, the "standard" normal distribution has mean 0 and standard deviation 1. These are the default values for the location and scale of the scipy.stats.norm methods. + + + Jan 29, 2021 at 19:55 +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + Right, I thought I saw this norm.cdf(norm.ppf(0.95, loc=10, scale=2)) and I thought it was weird norm.cdf did not have loc=10 and scale=2 too, I guess it should. + +
    +– bones.felipe +
    + Jan 30, 2021 at 5:33 + + + +
    +
    +
  • + +
+
+ + +
+
+
+ +
+
+
+
+ + +
+
+
+
+ +
+ 39 +
+ + + + + + + + + + + + +
+
+ +
+
+ + + + +
+ +
+ + + +
+ +
+

Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.

+ +

It can be used to get the inverse cumulative distribution function (inv_cdf - inverse of the cdf), also known as the quantile function or the percent-point function for a given mean (mu) and standard deviation (sigma):

+ +
from statistics import NormalDist
+
+NormalDist(mu=10, sigma=2).inv_cdf(0.95)
+# 13.289707253902943
+
+ +

Which can be simplified for the standard normal distribution (mu = 0 and sigma = 1):

+ +
NormalDist().inv_cdf(0.95)
+# 1.6448536269514715
+
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+ + +
+ + + +
+
+ + +
+ +
+ + + + + 2 +
+
+
    + +
  • +
    +
    + 4 +
    +
    +
    +
    + + Great tip! This allows me to drop the dependency on scipy, which I needed just for the single stats.norm.ppf method + +
    +– Jethro Cao +
    + Feb 21, 2020 at 16:56 +
    +
    +
  • +
  • +
    +
    +
    +
    +
    +
    + + can you use that to transform data with uniform distribution to normal ? + +
    +– vanetoj +
    + Mar 31, 2022 at 20:51 +
    +
    +
  • + +
+
+ + +
+
+
+ + + +
+
+
+
+ +
+ 21 +
+ + + + + + + + + + + + +
+
+ +
+
+ + + + +
+ +
+ + + +
+ +
+
# given random variable X (house price) with population muy = 60, sigma = 40
+import scipy as sc
+import scipy.stats as sct
+sc.version.full_version # 0.15.1
+
+#a. Find P(X<50)
+sct.norm.cdf(x=50,loc=60,scale=40) # 0.4012936743170763
+
+#b. Find P(X>=50)
+sct.norm.sf(x=50,loc=60,scale=40) # 0.5987063256829237
+
+#c. Find P(60<=X<=80)
+sct.norm.cdf(x=80,loc=60,scale=40) - sct.norm.cdf(x=60,loc=60,scale=40)
+
+#d. how much top most 5% expensive house cost at least? or find x where P(X>=x) = 0.05
+sct.norm.isf(q=0.05,loc=60,scale=40)
+
+#e. how much top most 5% cheapest house cost at least? or find x where P(X<=x) = 0.05
+sct.norm.ppf(q=0.05,loc=60,scale=40)
+
+
+
+
+ +
+ + + +
+ +
+ +
+ +
+ + + +
+ +
+ + + + + + +
+
+
+
+ + +
+ + + +
+
+ + +
+ +
+ + + + + 1 +
+
+
    + +
  • +
    +
    + 6 +
    +
    +
    +
    + + PS: You can assume 'loc' as 'mean' and 'scale' as 'standard deviation' + +
    +– Suresh2692 +
    + Jul 5, 2017 at 11:11 +
    +
    +
  • + +
+
+ + +
+
+
+ + + + + + +

+
+Not the answer you're looking for? Browse other questions tagged or ask your own question.
+

+
+
+ + + + +
+ +
+ + + +
+
+ + + + + + + + + + + + + + + + + + +
+
+
 
+ + + + + + diff --git a/references/index.html b/references/index.html new file mode 100644 index 0000000..f87ed9d --- /dev/null +++ b/references/index.html @@ -0,0 +1,597 @@ + + + + + + + + xoshiro/xoroshiro generators and the PRNG shootout + + + + + + + + + + + + +
+ +
+ +
+ + +
+ +

Introduction

+ +

This page describes some new pseudorandom number generators (PRNGs) we (David Blackman and I) have been working on recently, and + a shootout comparing them with other generators. Details about the generators can + be found in our paper. Information about my previous xorshift-based + generators can be found here, but they have been entirely superseded by the new ones, which + are faster and better. As part of our study, we developed a very strong test for Hamming-weight dependencies + that gave a number of surprising results. + +

64-bit Generators

+ +

xoshiro256++/xoshiro256** + (XOR/shift/rotate) are our all-purpose + generators (not cryptographically secure generators, though, + like all PRNGs in these pages). They have excellent (sub-ns) speed, a state + space (256 bits) that is large enough for any parallel application, and + they pass all tests we are aware of. See the paper + for a discussion of their differences. + +

If, however, one has to generate only 64-bit floating-point numbers + (by extracting the upper 53 bits) xoshiro256+ is a slightly (≈15%) + faster generator with analogous statistical properties. For general + usage, one has to consider that its lowest bits have low linear + complexity and will fail linearity tests; however, low linear + complexity of the lowest bits can have hardly any impact in practice, and certainly has no + impact at all if you generate floating-point numbers using the upper bits (we computed a precise + estimate of the linear complexity of the lowest bits). + +

If you are tight on space, xoroshiro128++/xoroshiro128** + (XOR/rotate/shift/rotate) and xoroshiro128+ have the same + speed and use half of the space; the same comments apply. They are suitable only for + low-scale parallel applications; moreover, xoroshiro128+ + exhibits a mild dependency in Hamming weights that generates a failure + after 5 TB of output in our test. We believe + this slight bias cannot affect any application. + +

Finally, if for any reason (which reason?) you need more + state, we provide in the same + vein xoshiro512++ / xoshiro512** / xoshiro512+ and + xoroshiro1024++ / xoroshiro1024** / xoroshiro1024* (see the paper). + +

All generators, being based on linear recurrences, provide jump + functions that make it possible to simulate any number of calls to + the next-state function in constant time, once a suitable jump + polynomial has been computed. We provide ready-made jump functions for + a number of calls equal to the square root of the period, to make it easy + generating non-overlapping sequences for parallel computations, and equal + to the cube of the fourth root of the period, to make it possible to + generate independent sequences on different parallel processors. + +

We suggest to use SplitMix64 to initialize + the state of our generators starting from a 64-bit seed, as research + has shown that initialization must be performed with a generator + radically different in nature from the one initialized to avoid + correlation on similar seeds. + + +

32-bit Generators

+ +

xoshiro128++/xoshiro128** are our + 32-bit all-purpose generators, whereas xoshiro128+ is + for floating-point generation. They are the 32-bit counterpart of + xoshiro256++, xoshiro256** and xoshiro256+, so similar comments apply. + Their state is too small for + large-scale parallelism: their intended usage is inside embedded + hardware or GPUs. For an even smaller scale, you can use xoroshiro64** and xoroshiro64*. We not believe + at this point in time 32-bit generator with a larger state can be of + any use (but there are 32-bit xoroshiro generators of much larger size). + +

All 32-bit generators pass all tests we are aware of, with the + exception of linearity tests (binary rank and linear complexity) for + xoshiro128+ and xoroshiro64*: in this case, + due to the smaller number of output bits the low linear complexity of the + lowest bits is sufficient to trigger BigCrush tests when the output is bit-reversed. Analogously to + the 64-bit case, generating 32-bit floating-point number using the + upper bits will not use any of the bits with low linear complexity. + +

16-bit Generators

+ +

We do not suggest any particular 16-bit generator, but it is possible + to design relatively good ones using our techniques. For example, + Parallax has embedded in their Propeller 2 microcontroller multiple 16-bit + xoroshiro32++ generators. + +

Congruential Generators

+ +

In case you are interested in 64-bit PRNGs based on congruential arithmetic, I provide + three instances of a + Marsaglia's Multiply-With-Carry generators, + MWC128, MWC192, and MWC256, for which I computed good constants. They are some + of the fastest generator available, but they need 128-bit operations. + +

Stronger theoretical guarantees are provided by the + generalized multiply-with-carry generators defined by Goresky and Klapper: + also in this case I provide two instances, GMWC128 and GMWC256, for which I computed good constants. + This generators, however, are about twice slower than MWC generators. + +

JavaScript

+ +

xorshift128+ is presently used in the JavaScript engines of + Chrome, + Node.js, + Firefox, + Safari and + Microsoft Edge. + +

Rust

+

The SmallRng from the rand + crate is xoshiro256++ or xoshiro128++, depending + on the platform. + +

java.util.random

+ +

I worked with Guy Steele at the new + family of PRNGs available in Java 17. The family, called LXM, uses new, better + tables of multipliers for LCGs with power-of-two moduli. Moreover, + java.util.random contains ready-to-use implementations of + xoroshiro128++ and xoshiro256++. + +

.NET

+ +

In version 6, Microsoft's .NET framework has adopted + xoshiro256** and xoshiro128** as default PRNGs. + +

Erlang

+ +

The parallel functional language Erlang implements several + variants of xorshift/xoroshiro-based generators adapted in collaboration with Raimo Niskanen for Erlang's + 58/59-bit arithmetic. + +

GNU FORTRAN

+

GNU's implementation of the FORTRAN language uses + xoshiro256** as default PRNG. + +

Julia

+

The Julia programming language uses + xoshiro256++ as default PRNG. + +

Lua

+

The scripting language Lua uses xoshiro256** as default PRNG. + +

IoT

+ +

The IoT operating systems Mbed and Zephyr use + xoroshiro128+ as default PRNG. + +

A PRNG Shootout

+ +

I provide here a shootout of a few recent 64-bit PRNGs that are quite widely used. + The purpose is that of providing a consistent, reproducible assessment of two properties of the generators: speed and quality. + The code used to perform the tests and all the output from statistical test suites is available for download. + +

Speed

+ +

The speed reported in this page is the time required to emit 64 + random bits, and the number of clock cycles required to generate a byte (thanks to the PAPI library). If a generator is 32-bit in nature, I glue two + consecutive outputs. Note that + I do not report results using GPUs or SSE instructions, with an exception for the very common SFMT: for that to be + meaningful, I should have implementations for all generators. + Otherwise, with suitable hardware support I could just use AES in + counter mode and get 64 secure bits in 0.56 ns (or just use Randen). The tests were performed on a + 12th Gen Intel® Core™ i7-12700KF @3.60GHz using gcc 12.2.1. + +

A few caveats: +

+ +

To ease replicability, I distribute a harness performing the measurement. You just + have to define a next() function and include the harness. But the only realistic + suggestion is to try different generators in your application and see what happens. + +

Quality

+ +

This is probably the more elusive property + of a PRNG. Here quality is measured using the powerful + BigCrush suite of tests. BigCrush is part of TestU01, + a monumental framework for testing PRNGs developed by Pierre L'Ecuyer + and Richard Simard (“TestU01: A C library for empirical testing + of random number generators”, ACM Trans. Math. Softw. + 33(4), Article 22, 2007). + +

I run BigCrush starting from 100 equispaced points of the state space + of the generator and collect failures—tests in which the + p-value statistics is outside the interval [0.001..0.999]. A failure + is systematic if it happens at all points. + +

Note that TestU01 is a 32-bit test suite. Thus, two 32-bit integer values + are passed to the test suite for each generated 64-bit value. Floating point numbers + are generated instead by dividing the unsigned output of the generator by 264. + Since this implies a bias towards the high bits (which is anyway a known characteristic + of TestU01), I run the test suite also on the reverse + generator. More detail about the whole process can be found in this paper. + +

Beside BigCrush, I analyzed generators using a test for Hamming-weight dependencies + described in our paper. As I already remarked, our only + generator failing the test (but only after 5 TB of output) is xoroshiro128+. + +

I report the period of each generator and its footprint in bits: a generator gives “bang-for-the-buck” + if the base-2 logarithm of the period is close to the footprint. Note + that the footprint has been always padded to a multiple of 64, and it can + be significantly larger than expected because of padding and + cyclic access indices. + +

+ + +
PRNG +Footprint (bits) +Period +BigCrush Systematic Failures +HWD failure +ns/64 bits +cycles/B +
xoroshiro128+128 2128 − 15 TB0.800.36 +
xoroshiro128++128 2128 − 10.900.40 +
xoroshiro128**128 2128 − 10.780.36 +
xoshiro256+256 2256 − 10.610.27 +
xoshiro256++256 2256 − 10.750.34 +
xoshiro256**256 2256 − 10.750.34 +
xoshiro512+5122512 − 10.680.30 +
xoshiro512++5122512 − 10.790.36 +
xoshiro512**5122512 − 10.810.37 +
xoroshiro1024*106821024 − 10.820.37 +
xoroshiro1024++106821024 − 11.010.46 +
xoroshiro1024**106821024 − 10.980.44 +
MWC128128 ≈21270.830.37 +
MWC192192 ≈21911.420.19 +
MWC256256 ≈22550.450.20 +
GMWC128128 ≈21271.840.83 +
GMWC256256 ≈22551.850.83 +
SFC64256 ≥2640.660.30 +
SplitMix6464 2640.630.29 +
PCG 128 XSH RS 64 (LCG) 128 21281.700.77 +
PCG64-DXSM (NumPy) 128 21281.110.50 +
Ran192 ≈21911.370.62 +
MT19937-64 (Mersenne Twister)20032 219937 − 1LinearComp1.360.62 +
SFMT19937 (uses SSE2 instructions)20032 219937 − 1LinearComp0.930.42 +
SFMT607 (uses SSE2 instructions)672 2607 − 1MatrixRank, LinearComp400 MB0.780.34 +
Tiny Mersenne Twister (64 bits)2562127 − 190 TB→2.761.25 +
Tiny Mersenne Twister (32 bits)2242127 − 1CollisionOver, Run, SimPoker, AppearanceSpacings, MatrixRank, LinearComp, LongestHeadRun, Run of Bits (reversed)40 TB→4.271.92 +
WELL512a544 2512 − 1 MatrixRank, LinearComp3.5 PB5.422.44 +
WELL1024a1056 21024 − 1 MatrixRank, LinearComp5.302.38 +
+ +

The following table compares instead two ways of generating floating-point numbers, namely the 521-bit dSFMT, which + generates directly floating-point numbers with 52 significant bits, and + xoshiro256+ followed by a standard conversion of its upper bits to a floating-point number with 53 significant bits (see below). + +

+ + +
PRNG +Footprint (bits) +Period + BigCrush Systematic Failures +HWD failure +ns/double +cycles/B +
xoshiro256+ (returns 53 significant bits) 2562256 − 10.923.40 +
dSFMT (uses SSE2 instructions, returns only 52 significant bits)7042521 − 1MatrixRank, LinearComp6 TB0.853.07 +
+ +

xoshiro256+ is ≈8% slower than the dSFMT, but it has a doubled range of output values, does not need any extra SSE instruction (can be programmed in Java, etc.), + has a much smaller footprint, and its upper bits do not fail any test. + +

Remarks

+ +

Vectorization

+ +

Some of the generators can be very easily vectorized, so that multiple instances can be run in parallel to provide + fast bulk generation. Thanks to an interesting discussion with the Julia developers, + I've become aware that AVX2 vectorizations of multiple instances of generators using the +/++ scrambler are impressively fast (links + below point at a speed test to be used with the harness, and the result will be multiplied by 1000): + +

+ + +
PRNG +ns/64 bits +cycles/B +
xoroshiro128+ (4 parallel instances)0.360.14 +
xoroshiro128++ (4 parallel instances)0.450.18 +
xoshiro256+ (8 parallel instances)0.190.08 +
xoshiro256++ (8 parallel instances)0.260.09 +
+ +

Note that sometimes convincing the compiler to vectorize is a + slightly quirky process: for example, on gcc 12.2.1 I have to use -O3 -fdisable-tree-cunrolli -march=native + to vectorize xoshiro256-based generators + (-O3 alone will not vectorize; thanks to to Chris Elrod for pointing me at -fdisable-tree-cunrolli). + +

A long period does not imply high quality

+ +

This is a common misconception. The generator x++ has + period \(2^k\), for any \(k\geq0\), provided that x is + represented using \(k\) bits: nonetheless, it is a horrible generator. + The generator returning \(k-1\) zeroes followed by a one has period + \(k\). + +

It is however important that the period is long enough. A first heuristic rule of thumb + is that if you need to use \(t\) values, you need a generator with period at least \(t^2\). + Moreover, if you run \(n\) independent computations starting at random seeds, + the sequences used by each computation should not overlap. + +

Now, given a generator with period \(P\), the probability that \(n\) subsequences of length \(L\) starting at random points in the state space + overlap is bounded by \(n^2L/P\). If your generator has period \(2^{256}\) and you run + on \(2^{64}\) cores (you will never have them) a computation using \(2^{64}\) pseudorandom numbers (you will never have the time) + the probability of overlap would be less than \(2^{-64}\). + +

In other words: any generator with a period beyond + \(2^{256}\) has a period that is + sufficient for every imaginable application. Unless there are other motivations (e.g., provably + increased quality), a generator with a larger period is only a waste of + memory (as it needs a larger state), of cache lines, and of + precious high-entropy random bits for seeding (unless you're using + small seeds, but then it's not clear why you would want a very long + period in the first place—the computation above is valid only if you seed all bits of the state + with independent, uniformly distributed random bits). + +

In case the generator provides a jump function that lets you skip through chunks of the output in constant + time, even a period of \(2^{128}\) can be sufficient, as it provides \(2^{64}\) non-overlapping sequences of length \(2^{64}\). + +

Equidistribution

+ +

Every 64-bit generator of ours with n bits of state scrambled + with * or ** is n/64-dimensionally + equidistributed: every n/64-tuple of consecutive 64-bit + values appears exactly once in the output, except for the zero tuple + (and this is the largest possible dimension). Generators based on the + + or ++ scramblers are however only (n/64 − + 1)-dimensionally equidistributed: every (n/64 − + 1)-tuple of consecutive 64-bit values appears exactly 264 + times in the output, except for a missing zero tuple. The same considerations + apply to 32-bit generators. + +

Generating uniform doubles in the unit interval

+ +

A standard double (64-bit) floating-point number in + IEEE floating point format has 52 bits of + significand, plus an implicit bit at the left of the significand. Thus, + the representation can actually store numbers with 53 significant binary digits. + +

Because of this fact, in C99 a 64-bit unsigned integer x should be converted to a 64-bit double + using the expression +

+    #include <stdint.h>
+
+    (x >> 11) * 0x1.0p-53
+
+

In Java you can use almost the same expression for a (signed) 64-bit integer: +

+    (x >>> 11) * 0x1.0p-53
+
+ + +

This conversion guarantees that all dyadic rationals of the form k / 2−53 + will be equally likely. Note that this conversion prefers the high bits of x (usually, a good idea), but you can alternatively + use the lowest bits. + +

An alternative, multiplication-free conversion is +

+    #include <stdint.h>
+
+    static inline double to_double(uint64_t x) {
+       const union { uint64_t i; double d; } u = { .i = UINT64_C(0x3FF) << 52 | x >> 12 };
+       return u.d - 1.0;
+    }
+
+

The code above cooks up by bit manipulation + a real number in the interval [1..2), and then subtracts + one to obtain a real number in the interval [0..1). If x is chosen uniformly among 64-bit integers, + d is chosen uniformly among dyadic rationals of the form k / 2−52. This + is the same technique used by generators providing directly doubles, such as the + dSFMT. + +

This technique is supposed to be fast, but on recent hardare it does not seem to give a significant advantage. + More importantly, you will be generating half the values you could actually generate. + The same problem plagues the dSFMT. All doubles generated will have the lowest significand bit set to zero (I must + thank Raimo Niskanen from the Erlang team for making me notice this—a previous version of this site + did not mention this issue). + +

In Java you can obtain an analogous result using suitable static methods: +

+    Double.longBitsToDouble(0x3FFL << 52 | x >>> 12) - 1.0
+
+ +

To adhere to the principle of least surprise, my implementations now use the multiplicative version, everywhere. + +

Interestingly, these are not the only notions of “uniformity” you can come up with. Another possibility + is that of generating 1074-bit integers, normalize and return the nearest value representable as a + 64-bit double (this is the theory—in practice, you will almost never + use more than two integers per double as the remaining bits would not be representable). This approach guarantees that all + representable doubles could be in principle generated, albeit not every + returned double will appear with the same probability. A reference + implementation can be found here. Note that unless your generator has + at least 1074 bits of state and suitable equidistribution properties, the code above will not do what you expect + (e.g., it might never return zero). + + +

+ + +
+ + + + diff --git a/references/input?i=N[InverseCDF(normal(0,1),+0.05),{∞,100}] b/references/input?i=N[InverseCDF(normal(0,1),+0.05),{∞,100}] new file mode 100644 index 0000000..fd6be45 --- /dev/null +++ b/references/input?i=N[InverseCDF(normal(0,1),+0.05),{∞,100}] @@ -0,0 +1,143 @@ +N[InverseCDF(normal(0,1), 0.05),{∞,100}] - Wolfram|Alpha
N[InverseCDF(normal(0,1), 0.05),{∞,100}]
\ No newline at end of file diff --git a/references/on-vignas-pcg-critique.html b/references/on-vignas-pcg-critique.html new file mode 100644 index 0000000..f498d1f --- /dev/null +++ b/references/on-vignas-pcg-critique.html @@ -0,0 +1,577 @@ + + + + + +On Vigna's PCG Critique | PCG, A Better Random Number Generator + + + + + + + + + + + + + + + + + + + + + + + + + + +Skip to main content + + + +
+
+ +
+ + +

On Vigna's PCG Critique

+ + + + +
+
+

On 14 May 2018, Sebastiano Vigna added a page to his website (archived here) entitled “The wrap-up on PCG generators” that attempts to persuade readers to avoid various PCG generators.

+

That day, he also submitted a link to his critique to Reddit (archived here). I think it is fair to say that his remarks did not get quite the reception he might have hoped for. Readers mostly seemed to infer a certain animosity in his tone and his criticisms gained little traction with that audience.

+

Although I'm pleased to see readers of Reddit thinking critically about these things, it is worth taking the time to dive in and see what what lessons we can learn from all of this.

+ + +

Background

+

We have to feel a little sympathy for Vigna. On May 4, he updated his website to announce a new generation scheme, Xoshiro and accompanying paper, the product of two years of work. He posted a link to his work on Reddit (archived here and here), and although he got some praise and thanks for his work, he ended up spending quite a lot of time talking not about his new work, but about flaws in his old work and about my work.

+

Here is an example of the kind of remarks he had to contend with; Reddit user “TomatoCo” wrote:

+
+

I liked xoroshiro a lot until I read all of the dire condemnations of it, so I switched to PCG. I'm not a mathematician, I can't understand your papers and PCG's write ups are a lot easier to understand. I'm sure that you've analyzed the shit out of your previous generator and I can see on your site you've come up with new techniques to measure if xoshiro suffers the same flaws. But once bitten, twice shy. Xoroshiro was defended as great with the sole exception of the lowest bit. But then it was "the lowest bit is just a LSFR, so don't use that. Well, actually, the other low bits are also just really long period LSFRs, well, actually," and new flaws were constantly appearing. +Respectfully, I think you need to explain more and in simpler terms to earn everyone's trust back.

+

The reason I picked PCG was because its author could, in plain language, describe its behavior and why some authors witnessed patterns in your RNG.

+
+

I think it's quite understandable that Vigna would want to look for ways to take PCG (and me) down a peg or two, and in various comment replies he endeavored to express things he didn't like about PCG (and the PCG website).

+

Most of the issues he raised were, I thought, adequately addressed and refuted in the Reddit discussion, but having gone to the effort already to try to articulate the things he did not like, even writing code to do so, it makes sense that he would want circulate these thoughts more broadly.

+

Reddit Reaction #2

+

Reddit's reaction to Vigna's new PCG-critique page was perhaps not what he hoped for. From what I can tell, pretty much none of the commenters were persuaded by his claims, and much was made of his tone.

+

Regarding tone, user “notfancy” said:

+
+

Take your feud somewhere else. […] theory and practice definitely belong here. The petty squabbling and the name calling definitely don't. Seeing that Vigna himself is posting links to his own site, this is to me self-promoting spam.

+
+

and user “foofel” added:

+
+

the style in which he presents his stuff is always full of hate and despise, that's not a good way to represent it and probably why people are fed up.

+
+

and user “evand” added:

+
+

I would describe a lot of it as written very... condescendingly. There's also a lot that is written to attack her and not PCG

+
+

and user “AntiauthoritarianNow” chimed in, saying;

+
+

Yeah, it's one thing to tease other researchers a little bit, but this guy has a real problem sticking to arguments on the merits rather than derailing into reddit-esque ad-hom.

+
+

But the thread also had plenty of rebuttals. For just about every claim Vigna had made in his critique, there was a comment explaining why the claim was flawed.

+

My Reaction

+

I could settle back into my chair here, and say, “Thank you, Reddit, for keeping your wits about you!”, but since (at the time of writing) Vigna's page remains live with the same claims, it seems sensible for me to create my own writeup (this one) to address his claims directly.

+

Moreover, I believe firmly that although it's never much fun to be on the receiving end for invective or personal attacks, in academia peer critique makes everything stronger. While much of what Vigna says about PCG doesn't hold up to closer scrutiny, it is worth trying to find value of some kind in every criticism. I believe in the approach taken in the world of improvisational comedy, known as “Yes, and…”, which suggests that a participant should accept what another participant has stated (“yes”) and then expand on that line of thinking (“and”).

+

Thus, in the subsequent sections, I'll look at each of Vigna's critiques, first give a defensive response, and then endeavor to find a way to say “Yes, and…” to each one.

+

Correlations Due to Contrived Seeding

+

Vigna's first two claims relate to creating two PCG generators whose outputs are correlated because he has specifically set them up to have internal states that would cause them to be correlated.

+

PCG ext Variants: Single Bit Change to the Extension Array

+

In the first claim, he modifies the code for the PCG of the extended generation scheme to so that he can flip a single bit in the extension array that adds k-dimensional equidistribution to a base generator.

+

Vigna creates two pcg64_k32 generators that are the same in all respects except for a single bit difference in one element of the 32-element extension array, and then observes that 31 of every 32 outputs will remain identical between the generators for some considerable time. Vigna clearly considers this behavior to be problematic and notes multiple LFSR-based PRNGs where such behavior would not occur.

+

Vigna states

+
+

Said otherwise, the whole sequence of the generator is made by an enormous number of strongly correlated, very short sequences. And this makes the correlation tests fail.

+
+

Vigna concludes that no one should use generators like pcg64_k32 as a result.

+
Defensive Response
+

Vigna actually created a custom version of the PCG code to effect his single bit change. The pcg64_k32 generator has 2303 bits of state, 127 bits of LCG increment (which stays constant), 128 bits of LCG current state, and 32 64-bit words in the extension array. The odds of seeding two pcg64_k32 generators each with 2303 bits of seed entropy and finding that they only differ by a single bit in the extension array is 1 in 22292, an order of magnitude so vast that it cannot be represented as a floating point double.

+

If the PRNG were properly initialized (e.g., using std::seed_seq or pcg_extras::sed_seq_from<std::random_device>), Vigna's observed correlation would not have occurred. Likewise, had the single bit change been in the LCG side of the PRNG, it would also not have occurred.

+

But what of Vigna's other claim, that PRNGs that are slow to diffuse single-bit changes to their internal state are necessarily bad? Vigna is right that for LFSR-based designs, the rate of bit diffusion (a.k.a. “avalanche”) matters a lot.

+

However, numerous perfectly good designs for PRNGs would fail Vigna's criteria. All counter-based designs (e.g., SplitMix, Random123, Chacha) will preserve the single bit difference indefinitely if we examine their internal state. In fact, Vigna's collaborator, David Blackman, is author of gjrand, which also includes a counter whose internal state won't diverge significantly over time. But of these designs, only SplitMix would fail a test that looks for output correlations rather than similar internal states.

+

The closest design to PCG's extension array is found in George Marsaglia's venerable XorWow PRNG, shown below (code taken from the Wikipedia page):

+
/* The state array must be initialized to not be all zero in the first four 
+   words */
+uint32_t xorwow(uint32_t state[static 5])
+{
+    /* Algorithm "xorwow" from p. 5 of Marsaglia, "Xorshift RNGs" */
+    uint32_t s, t = state[3];
+    t ^= t >> 2;
+    t ^= t << 1;
+    state[3] = state[2]; state[2] = state[1]; state[1] = s = state[0];
+    t ^= s;
+    t ^= s << 4;
+    state[0] = t;
+    return t + (state[4] += 362437);
+}
+
+ +

In Marsaglia's design, state[4] is a counter in much the same way that PCG's extension array is a “funky counter”. Marsaglia calls this counter a Weyl sequence after Hermann Weyl, who proved the equidistribution theorem in 1916.

+

We can exactly reproduce Vigna's claim's about pcg64_k32 producing similar output with XorWow. The program uncxorwow.c is a port of his demonstration program to XorWow. It fails if tested with PractRand, and, if we uncomment the printf statements, after 1 billion iterations we see that the outputs have not become uncorrelated. They continue to differ only in their high bit. And they will continue that way forever:

+
61b0be0f
+e1b0be0f
+c5a003d8
+45a003d8
+20e14479
+a0e14479
+5a5ebe42
+da5ebe42
+99ce85af
+19ce85af
+d2a1aabb
+52a1aabb
+6bf29670
+ebf29670
+948587d6
+148587d6
+e2c0f91c
+62c0f91c
+536fe7eb
+d36fe7eb
+
+ +

Similarly, Vigna's complaint about “strongly correlated very short sequences” could likewise be applied to XorWow. It consists of 264 very similar sequences (differing only by a constant). It might seem bad at a glance to concatenate a number of very similar sequences but it is worth realizing that the nearest similar sequence is 2128-1 steps away. If Vigna would characterize 2128-1 as “very short”, he must be using a mathematician's sense of scale.

+

Marsaglia's design of Xorwow quite deliberately uses a very simple and weak generator (a Weyl sequence) for a specific purpose. We could say “a counter isn't a very good random number generator”, but the key idea is that it doesn't need to be. It's not the whole story. It's a piece with a specific role to play, and it doesn't need to be any better than it is.

+

PCG's extended generation scheme is a similar story. The extension array is a funky counter akin to a Weyl sequence (each array element is like a digit of a counter). It's slightly better than a Weyl sequence (a single bit change will quickly affect all the bits in the in that array element), but it is essentially the same idea.

+

The pcg64_k32_oneseq and pcg64_k32_fast generators follow XorWow's scheme of just joining together the similar sequences. pcg64_k32 swaps around chunks of size 216 from each similar sequence. In all cases, from any starting point you would need 2128 outputs before the base linear congruential generator lined up to the same place again, and vastly more for the extension array to line up similarly. In short, for pcg64_k32 the correlated states are quite literally unimaginably far away from each other.

+

Talking about his contrived seedings, Vigna notes that, “This is all the decorrelation we get after a billion iterations, and it will not improve (not significantly before the thermodynamical death of the universe).” What he seems to have missed is the corollary to his statements—correlation and decorrelation are sides of the same coin. Two currently uncorrelated pcg64_k32 states will not correlate before the heat death of the universe either.

+

In short, Vigna contrived a seed to show correlation that would never arise in practice with normal seeding, nor could arise by advancing one generator. His critique is not unique to PCG, and should not be a concern for users of PCG.

+
“Yes, and…” Response
+

A rather flippant “Yes, and…” response is that I'm perfectly happy for people to avoid pcg64_k32, as I'm not at all sure it is buying you anything meaningful over and above pcg64— it's a fair amount of added code complexity for something of dubious value. In fact, I didn't even bother to implement it in the C version and only a small number of people who have ported PCG have implemented it. As I see it, k-dimensional equidistribution sounds like a cool property, but the only use case I've found for such a property is performing party tricks. But some people do like k-dimensional equidistribution, so let's press on…

+

First, Vigna went to far too much trouble to create correlated states. He copied the entire C++ source for PCG and hacked it to make a private data member public so he could set a single bit. Had he been more familiar with the features the extended generators provide, he could instead have written.

+
pcg64_k32 rng0;
+pcg64_k32 rng1 = rng0;
+rng1.set(rng0() ^ 1);
+
+

This code uses pcg64_k32's party-trick functionality to leap unimaginably huge distances across the state space to find exactly the correlated generator you want, one that is the same in every respect except for one differing output.

+

In other words, what he sees as a deficiency, I've already highlighted as a feature.

+

But whether it is achieved by the simple method above, or the more convoluted method Vigna used, we have the question of what to do if people are allowed to create very correlated generator states that would not normally arise in practice. One option is to just say “don't do that”, but a more “Yes, and…” perspective would be to allow people to create such states if they choose but provide a means to detect them. More on that in the next section.

+

It's also worth asking whether the slowness with which a single bit change diffuses across the extension array is something inherent in the design of PCG's extended generation scheme, or mere happenstance. In fact, it is the latter.

+

The only cleverness in the extended generation scheme isn't the idea of combining two generators, a strong one and a weaker-but-k-dimensionally-equidistributed one, it's the fact that we can do so without any extra state to keep track of what we're doing.

+

I'm thus not wedded to the particular Weyl-sequence inspired method I used. If it's important that unimaginably distant similar generators do not stay correlated for long, that's a very easy feature to provide.

+

When I designed how the extension array advances, I made a choice to make it “no better than it needs to be”. It doesn't need good avalanche properties, so that wasn't a design concern. But that doesn't mean it couldn't be tweaked to have good avalanche properties, so that a single bit change affects all the bits the next time the extension array advances. In fact, having designed seed_seq_fe for randutils, I'm aware of elegant and amply efficient ways to have better avalanche, so why not?

+

It may not really be necessary, but I actually like this idea. So thanks, Sebastiano, I'll address this issue in a future update to PCG that provides some alternative schemes for updating the extension array!

+

PCG Regular Variants: Contrived seeds for Inter-Stream Correlations

+

In his next concern, Vigna uses makes correlated generators from two “random looking” seeds. He presents a program, corrpcg.c that mixes together the two correlated generators and can then be fed into statistical tests (which will fail because of the correlation).

+
Defensive Response
+

We can devise bad seed pairs for just about any PRNG. Here are three example programs, corrxoshiro.c, corrsplitmix.c, and corrxorwow.c, which initialize generators with two “random looking” seeds but create correlated streams that will fail statistical tests if mixed.

+

In all cases, despite being “random looking”, the seeds are carefully contrived. Seeds such as these would be vanishingly unlikely with proper seeding practice.

+

As before, the concerns Vigna expresses apply to many prior generators. We can view XorWow's state[4] value as being a stream selection constant, but this time let's focus in on SplitMix. For SplitMix, different gamma_ values constitute different streams.

+

In corrsplitmix.c the implementation is hard-wired to use a single stream (0x9e3779b97f4a7c15), but in corrsplitmix2.c we mix two streams, (0x9e3779b97f4a7c15 and 0xdf67d33dd518d407) and observe correlations. Although these gamma values look random, they are not, they are carefully contrived. In particular, here 0xdf67d33dd518d407 * 3 = 0x9e3779b97f4a7c15 (in 64-bit arithmetic), which means that every third output from the second stream will exactly match an output from the first.

+

Vigna's critique thus applies at least as strongly to SplitMix's streams as it does to PCG's.

+

I have written at length about PCG's streams (and discussed SplitMix's, too). I freely acknowledge that these streams exist in a space of trade-offs where we are choosing to do the cheap thing, leveraging the properties of the underlying LCG (or Weyl sequence for SplitMix). In that article, I say:

+
+

Changing the increment parameter is just barely enough for streams that are actually useful. They aren't statistically independent, far from it, but they are distinct and they do help.

+
+

No one should worry that PCG's streams makes anything worse.

+
“Yes, and…” Response
+

Although it is vanishingly unlikely that two randomly seeded pcg64 generators would be correlated (it would only happen with poor/adversarial seeding), it is reasonable to ask if this kind of correlation due to bad seeding can be detected.

+

We can even argue that another checklist feature for a general-purpose PRNG is the ability to tell how independent the sequences from two seeds are likely to be. PCG goes some way towards this goal with its - operator that calculates the distance between two generators, but the functionality was originally designed for generators on the same stream. I've now updated that functionality so that for generators on different streams, it will calculate the distance to their point of closest approach (i.e., where the differences between successive values of the underlying LCG align).

+

So it's now possible with PCG to compare two generators to see whether they have been badly seeded so that they correlate.

+

Here's a short test program:

+
#include "pcg_random.hpp"
+#include "pcg_extras.hpp"
+
+#include <iostream>
+#include <iomanip>
+#include <random>
+
+int main() {
+    using namespace pcg_extras;
+
+#if USE_VIGNA_CONTRIVED_SEEDS
+    pcg64 x(PCG_128BIT_CONSTANT(0x83EED115C9CBCC30, 0x4C55E45838B75647),
+            PCG_128BIT_CONSTANT(0x3E0897751B1A19E7, 0xD9D50DD3E3A454DC));
+    pcg64 y(PCG_128BIT_CONSTANT(0x7C112EEA363433CF, 0xB3AA1BA7C748A9B9),
+            PCG_128BIT_CONSTANT(0x41F7688AE4E5E618, 0x262AF22C1C5BAB23));
+#elif USE_PCG_UNIQUE
+    pcg64_unique x,y;
+#elif USE_SMALL_SEEDS1
+    pcg64 x(0), y(1);
+#elif USE_SMALL_SEEDS2
+    pcg64 x(0,0), y(0,1);
+#elif USE_SMALL_SEEDS3
+    pcg64 x(0,0), y(1,1);
+#elif USE_RANDOM_DEVICE
+    pcg64 x(seed_seq_from<std::random_device>{}), 
+        y(seed_seq_from<std::random_device>{});
+#endif
+
+    std::cout << std::hex;
+    for (int i = 0; i < 10; ++i) {
+        std::cout << (x - y) << ": ";
+        std::cout << x() << ", " << y() << "\n";
+    }
+}
+
+ +

And here are the results of running it (in each case, each line shows the distance between the streams and a value from each PRNG; the distance stays the same because the PRNGs are advancing together):

+
unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_RANDOM_DEVICE && ./strmdist
+a571d615b08fea47c84f39f0811f04f: c021049beac5efd0, ceaa3596f168e8b6
+a571d615b08fea47c84f39f0811f04f: 573371998db59a67, e5d84a00b37c3556
+a571d615b08fea47c84f39f0811f04f: bc4246c671ef9a1f, 1b13ad2f224707c7
+a571d615b08fea47c84f39f0811f04f: b1f3e4ffcfef569, 11b50b226a67cdbe
+a571d615b08fea47c84f39f0811f04f: 8a378ec693dc1e4, 903ccfd4dc769389
+a571d615b08fea47c84f39f0811f04f: 4799de5c580be6ab, 22d13ce52d83c9cb
+a571d615b08fea47c84f39f0811f04f: e8fdf041a93626e8, f24c8f49866b7b4e
+a571d615b08fea47c84f39f0811f04f: f29e3d08104d7630, b37e5b58ae91d45c
+a571d615b08fea47c84f39f0811f04f: 28f524ad8f57bedb, 52d41d39b1186616
+a571d615b08fea47c84f39f0811f04f: 9be8cb37ea8952b5, e6812ed8f0613d3
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_RANDOM_DEVICE && ./strmdist
+25c3990ef6e7766ab543435aa25f4326: 2f76ab68249fd7f5, 4fbfc0ce19119391
+25c3990ef6e7766ab543435aa25f4326: 933845d6c7ad9396, 7572dae64b2cc5a
+25c3990ef6e7766ab543435aa25f4326: d7d1dc18bae0604a, 5b1f8310e1f0dc8a
+25c3990ef6e7766ab543435aa25f4326: 85cd1dcff8830ad5, a1cfea3c01314c8d
+25c3990ef6e7766ab543435aa25f4326: 543ba46266a0b6ba, 7217b15c05cba254
+25c3990ef6e7766ab543435aa25f4326: 5a3bd5d4d6c49a55, a243af7df5cfe287
+25c3990ef6e7766ab543435aa25f4326: 9f2dc30afc3dcead, deaa9d03f7ca1117
+25c3990ef6e7766ab543435aa25f4326: 5856b884c1298dc9, 67502e4490b77bae
+25c3990ef6e7766ab543435aa25f4326: 9b94ebb084cc6fdd, 2e07957697add77c
+25c3990ef6e7766ab543435aa25f4326: efe6b451c262a3fb, 2e94d782daae964d
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_RANDOM_DEVICE && ./strmdist
+32982840d1ddcb5e7f1ed57a6d496525: 96ed26957ef938db, 568fe0aa7e9e8a26
+32982840d1ddcb5e7f1ed57a6d496525: 33270d80d24b0965, 44e42e1afc4db710
+32982840d1ddcb5e7f1ed57a6d496525: 6de9ac5272dd1193, 90696d1c4f52e71d
+32982840d1ddcb5e7f1ed57a6d496525: 43c5c899c7123e57, 337b9d25e00fb0de
+32982840d1ddcb5e7f1ed57a6d496525: 753954b73076704d, f4fce4c33756df7e
+32982840d1ddcb5e7f1ed57a6d496525: 3b5dc9402b56584d, fd7ae3c708355dc0
+32982840d1ddcb5e7f1ed57a6d496525: 15a9227305a442d8, 78fa04eb7f881590
+32982840d1ddcb5e7f1ed57a6d496525: b9e58872c3a299, 381a8f851acbc5f4
+32982840d1ddcb5e7f1ed57a6d496525: 1b624879e6cf5128, aa908d3a4f2d8f02
+32982840d1ddcb5e7f1ed57a6d496525: 79d4836bb5a56a77, 1650f74b3ef617f9
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_SMALL_SEEDS1 && ./strmdist
+1c31b969dc65d7b0df636de659042bb1: 1070196e695f8f1, e175e32ed3507bfa
+1c31b969dc65d7b0df636de659042bb1: 703ec840c59f4493, c0bf922a0b283109
+1c31b969dc65d7b0df636de659042bb1: e54954914b3a44fa, 140bfa21e68785bb
+1c31b969dc65d7b0df636de659042bb1: 96130ff204b9285e, c5ec8bcc4fe35830
+1c31b969dc65d7b0df636de659042bb1: 7d9fdef535ceb21a, 4dd8ed1ca22869c5
+1c31b969dc65d7b0df636de659042bb1: 666feed42e1219a0, c9bffa29c802ef4c
+1c31b969dc65d7b0df636de659042bb1: 981f685721c8326f, 3aa09aa4e147478b
+1c31b969dc65d7b0df636de659042bb1: ad80710d6eab4dda, 1dfdf6222d06378c
+1c31b969dc65d7b0df636de659042bb1: e202c480b037a029, 5a05dacf4df61d4e
+1c31b969dc65d7b0df636de659042bb1: 5d3390eaedd907e2, 489650b1eb840a26
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_SMALL_SEEDS2 && ./strmdist
+151361a7e7368c239a3988178df4d76d: d4feb4e5a4bcfe09, acdbf879b3c73375
+151361a7e7368c239a3988178df4d76d: e85a7fe071b026e6, 7ea754d074e8d88f
+151361a7e7368c239a3988178df4d76d: 3a5b9037fe928c11, f8fc7aec8ae6245a
+151361a7e7368c239a3988178df4d76d: 7b044380d100f216, 7d2ebc3c0b5bedb4
+151361a7e7368c239a3988178df4d76d: 1c7850a6b6d83e6a, cbaf666f55051666
+151361a7e7368c239a3988178df4d76d: 240b82fcc04f0926, 4eba9f04dfb9903b
+151361a7e7368c239a3988178df4d76d: 7e43df85bf9fba26, 4fab6bcf361bd63d
+151361a7e7368c239a3988178df4d76d: 43adf3380b1fe129, 257fcac1ed3817df
+151361a7e7368c239a3988178df4d76d: 3f0fb307287219c, bf6f5515988a494
+151361a7e7368c239a3988178df4d76d: 781f4b84f42a2df, 1081ed38c84c1c9d
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_SMALL_SEEDS3 && ./strmdist
+edfe668df810de6e58b8e92e878fefa: d4feb4e5a4bcfe09, d4692f845d3a3706
+edfe668df810de6e58b8e92e878fefa: e85a7fe071b026e6, bb0f09b0eebab6ff
+edfe668df810de6e58b8e92e878fefa: 3a5b9037fe928c11, e26ac904ad283c09
+edfe668df810de6e58b8e92e878fefa: 7b044380d100f216, 83860212b5d92197
+edfe668df810de6e58b8e92e878fefa: 1c7850a6b6d83e6a, 1c3601ed5afd3f49
+edfe668df810de6e58b8e92e878fefa: 240b82fcc04f0926, 5e4fa027be29b47e
+edfe668df810de6e58b8e92e878fefa: 7e43df85bf9fba26, b930e28d59383019
+edfe668df810de6e58b8e92e878fefa: 43adf3380b1fe129, e0d61e1b074df835
+edfe668df810de6e58b8e92e878fefa: 3f0fb307287219c, f42c38b1aca3ac9d
+edfe668df810de6e58b8e92e878fefa: 781f4b84f42a2df, 19e9cc4fa58fd0ad
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_PCG_UNIQUE && ./strmdist
+534a7c98f86b50b72fad6990038ba18: af8a07de4c8d67d1, d649257470c0180d
+534a7c98f86b50b72fad6990038ba18: 3789d12fe8e452b1, 1017152e85f732fc
+534a7c98f86b50b72fad6990038ba18: c3c4e780fd60901b, 91a9d78551f0c776
+534a7c98f86b50b72fad6990038ba18: e7257e02f7fa5b40, 46fb62417ebf2f13
+534a7c98f86b50b72fad6990038ba18: 3697948fa9aa8378, 60e44721c6fbc9d0
+534a7c98f86b50b72fad6990038ba18: 7bdbcc91de7efbcf, 21de9d1dc03e2ca6
+534a7c98f86b50b72fad6990038ba18: 9cf598a61c9ad958, 62e8c3dc421f4e58
+534a7c98f86b50b72fad6990038ba18: 5c8a6da6c91b7d35, 3cb08b7e59fd655a
+534a7c98f86b50b72fad6990038ba18: f55a8b190a85c9c0, 5a71766fac52ec8a
+534a7c98f86b50b72fad6990038ba18: 906b1a30904fe59, f71525dc1d91a06e
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_PCG_UNIQUE && ./strmdist
+1b7a9a85b5ed2b6a2a92da9e093eba18: a11d6aa92efc9a79, e646943445e368a
+1b7a9a85b5ed2b6a2a92da9e093eba18: 35026a6e1a195a29, 906b9bed756e1667
+1b7a9a85b5ed2b6a2a92da9e093eba18: af1f1193515d9e7b, fe51967d5d532f70
+1b7a9a85b5ed2b6a2a92da9e093eba18: 61baa5620ceeff38, 644345c453ee3b11
+1b7a9a85b5ed2b6a2a92da9e093eba18: 71e88c9c27a7abbf, 1b6a254f565f6c70
+1b7a9a85b5ed2b6a2a92da9e093eba18: 1125753cd420e3c1, 8be4065858e93c57
+1b7a9a85b5ed2b6a2a92da9e093eba18: a53ce57ffaa57eb3, 7f1c546ae9bf7b61
+1b7a9a85b5ed2b6a2a92da9e093eba18: 4cf2c7c152326c4, ada2d31650f07ef8
+1b7a9a85b5ed2b6a2a92da9e093eba18: b731cbec3bfba773, 92ce80f0c8dc855f
+1b7a9a85b5ed2b6a2a92da9e093eba18: b8c449d4872f7971, 44ed4207442550da
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_PCG_UNIQUE && ./strmdist
+360981a27aee6d34271feaa80270ba18: 5da8c0afa4330059, 67af26ab1d05ed52
+360981a27aee6d34271feaa80270ba18: ef0ef074871cc9a0, cda2688372cb72b7
+360981a27aee6d34271feaa80270ba18: 6a15c49d4ae8d89d, 3708ddd964f616fe
+360981a27aee6d34271feaa80270ba18: dd8f24112bcbf580, 69309c3ffa6cea2e
+360981a27aee6d34271feaa80270ba18: e8f252a4132fd0e3, e3ff9751773f6db
+360981a27aee6d34271feaa80270ba18: e23a1246ea5980be, 1161fd499cbecafa
+360981a27aee6d34271feaa80270ba18: 1d19a64904134065, a9e31a01b4c51a43
+360981a27aee6d34271feaa80270ba18: 2c3166d304f9dedf, fdd3f540a6859c19
+360981a27aee6d34271feaa80270ba18: 8f73778d1f6133ea, 13a54957b3c65205
+360981a27aee6d34271feaa80270ba18: c8d362ba3d62239, 66db0b2ae6908dc8
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_PCG_UNIQUE && ./strmdist
+1266069359d4404d4fe77f291da43a18: 9994872b3cc3104c, 5582722b3f354f4b
+1266069359d4404d4fe77f291da43a18: cec9ae92f2f0a929, 7a2d534e7c3a7281
+1266069359d4404d4fe77f291da43a18: ce777879518e6169, c384bb65c1d4364b
+1266069359d4404d4fe77f291da43a18: 2cb082454d09aa19, 703c5ad7747a9b42
+1266069359d4404d4fe77f291da43a18: a581d3154c60654, b4b9369d997cda6e
+1266069359d4404d4fe77f291da43a18: 5ba66e3d99cd33c9, 80aa887fbb5fdef3
+1266069359d4404d4fe77f291da43a18: 1038e3281dcae11d, 54c304cf2a66182c
+1266069359d4404d4fe77f291da43a18: 9df3df9d27af7148, 7ddd385e114299b9
+1266069359d4404d4fe77f291da43a18: bf1656198867bd08, 7aeae9ba84a17dbe
+1266069359d4404d4fe77f291da43a18: 60aef1418aa1c6f1, 8a7196feda932f06
+
+unix% c++ -Wall -std=c++11 -o strmdist strmdist.cpp -Iinclude -DUSE_VIGNA_CONTRIVED_SEEDS && ./strmdist
+0: e1e4e4b44cca9ade, 43dc3c9c96899953
+0: a3ef563648055140, 2b8a051f7ab1b24
+0: 7aa3dc341221459a, 1a0960a2cd3d51ee
+0: cfa0d055fbe9f476, a0abf5d3e8ed9f41
+0: b69403f2c93f3fce, 807e58a7e7f9d6d2
+0: a2550ed76e8d9ae, 144aa1daedd1b35e
+0: a1f898a64347533b, c532263a99dd0fc4
+0: d483377a20c295f0, bbd10614af86a019
+0: 5c6469b1053d2ce1, 9c2b8c8d2e20a7a5
+0: 5f91b4bd64d5eeb1, 58afc8da4eb26af7
+
+ +

As we can see in the last example, Vigna contrived seeds that had the streams exactly aligned. The values from each stream are distinct in this case, but a statistical test will see that they are correlated.

+

Interpreting the distance value is easy in this case, but not every user will be able to do so, and some distances (e.g., a just a single high bit set) would also be bad, so better detection of contrived seeds probably demands a new function, independence_score(), based on this distance metric.

+

Beyond these functions, there is also the question of whether it is wise to allow users to seed generators where they can specify the entire internal state. Vigna's generators (and all basically LFSR generators) must avoid the all-zeros state and do not like states with low hamming weight (so { seed, 0, 0, 0 } is also a poor choice). With these issues in mind, perhaps we should deny users the ability to seed the entire state. That might prevent some contrived seedings like the one Vigna used. I'm not fully sold on this idea, but it is a widely-used approach use by other generators (e.g., Blackman's gjrand) and worth considering.

+

Although Vigna's contrived seeding was a bit silly, his example has helped me improve the PCG distance metric, given us another checklist feature that some people might want (detecting bad seed pairs), got me thinking about future features, and returned me to the topic of good seeding. All in all, we can call this a positive contribution. Thanks, Sebastiano!

+

Prediction Difficulty

+

The next two sections relate to predicting PCG.

+

Predicting pcg32_oneseq +

+

Vigna writes:

+
+

To me, it has been always evident that PCG generators are very easy to predict. Of course, no security expert ever tried to to that: it would be like beating 5-year-old kid on a race. It would be embarrassing.

+

So we had this weird chicken-and-egg situation: nobody who could easily predict a PCG generator would write the code, because they are too easy to predict; but since nobody was predicting a PCG generator, Melissa O'Neill kept on the absurd claim that they were challenging to predict.

+
+

Vigna then goes on to show code to predict pcg32_oneseq, a 64-bit PRNG with 32-bit output.

+
Defensive Response
+

As one reddit observer wrote:

+
+

[Vigna's] program needs to totally brute force half of the state, and then some additional overhead to brute force bits of the rest of the state, so runtime is 2n/2, exponential, not polynomial.

+
+

Vigna has written an exponential algorithm to brute force 32 bits of state. I hope it was obvious to almost everyone that I never claimed that brute-forcing 32-bits of state was hard. In fact, I have already outlined how to predict pcg32 (more bits to figure out given the unknown stream). I observed that pcg32 is predictable using established techniques (specifically the LLL algorithm), and I have even linked to an implementation of those ideas by Marius Lombard-Platet.

+

I characterize pcg32_oneseq as easy to brute force, and pcg32 as annoying (as Marius Lombard-Platet discovered). Only when we get to pcg64 do we have something where there is a meaningful challenge.

+

If Vigna really believes that all members of the PCG family are easy to predict, he should have predicted pcg64 or pcg64_c32.

+
“Yes, and…” Response
+

The best part of Vigna's critique are these lines:

+
+

Writing the function that performs the prediction, recover(), took maybe half an hour of effort. It's a couple of loops, a couple of if's and a few logical operations. Less than 10 lines of code (of course this can be improved, made faster, etc.).

+
+

and the source code comment that reads:

+
+

Pass an initial state (in decimal or hexadecimal), see it recovered from the output in a few seconds.

+
+

So, here Vigna is essentially endorsing all the practical aspects I've previously noted regarding trivial predictability. Specifically, he's noting that with little time or effort, he can write a simple program that quickly predicts a PRNG and has actually done so. This is very different from taking a purely theoretical perspective (e.g., noting that techniques exist to solve a problem in polynomial time without ever implementing them).

+

In other words, clearly ease of prediction matters to Vigna. So we both agreepcg32_oneseq is easy to predict.

+

Now let's keep that characterization of easiness and move on to some of the other generators.

+

Vigna and I would agree, I think, that I lack the necessary insight to develop fast prediction methods for pcg64 or pcg64_c32 (it's an instance of Schneier's Law). Vigna is also right that, if it is tractable to predict, those who might have the necessary skill lack much incentive to try. For some years I have been intending to have a prediction contest with real prizes and I remain hopeful that I'll find the time to run such a contest this summer. When the contest finally launches, I hope he'll have a go—I'd be delighted to send him a prize.

+

Predicting pcg64_once_insecure +

+

Vigna also notes that he can invert the bijection that serves as the output function for pcg64_once_insecure, which reveals the underlying LCG with all its statistical flaws.

+
Defensive Response
+

I noted this exact issue in 2014 in the PCG paper. It's why pcg64_once_insecure has the name it does. I discourage its use as a general-purpose PRNG precisely because of its invertible output function.

+
“Yes, and…” Response
+

Vigna is at least acknowledging that some people might care about this property.

+

Speed and Comparison against LCGs

+

Finally, Vigna develops a PCG variant using a traditional integer hash function based on MurmurHash (I would call it PCG XS M XS M XS). He claims it is faster than the PCG variants I recommend and notes that he doesn't consider PCG especially fast.

+

Defensive Response

+

I considered this exact idea in the 2014 PCG paper. In my tests, I found that a variant using a very similar general integer hash function was not as fast as the PCG permutations I used.

+

Testing is a finicky business.

+

“Yes, and…” Response

+

I absolutely agree with Vigna's claim that people should run their own speed tests.

+

I also realized long ago that PCG probably won't have the speed crown, because it can't. A simple truncated 128-bit LCG passes all standard statistical tests once we get up to 128 bits, and beats everything, including Vigna's generators. Because pcg64 is built from a 128-bit LCG, it can never beat it in speed.

+

I should write a blog post on speed testing. But here's a taste. We'll use Vigna's hamming-weight test as our benchmark, because it is a real program that puts randomness to actual use but is coded with execution speed in mind.

+

First, let's test the Mersenne Twister. Compiling with Clang, we get

+
processed 1.75e+11 bytes in 130 seconds (1.346 GB/s, 4.847 TB/h). Fri May 25 14:03:25 2018
+
+ +

whereas compiling with GCC, we get

+
processed 1.75e+11 bytes in 73 seconds (2.397 GB/s, 8.631 TB/h). Fri May 25 14:05:44 2018
+
+ +

With GCC, it runs almost twice as fast.

+

Now let's contrast that result with this 128-bit MCG:

+
static uint128_t state = 1;   // can be seeded to any odd number
+
+static inline uint64_t next()
+{
+    constexpr uint128_t MULTIPLIER =
+        (uint128_t(0x0fc94e3bf4e9ab32ULL) << 64) |  0x866458cd56f5e605ULL;
+            // Spectral test: M8 = 0.71005, M16 = 0.66094, M24 = 0.61455
+    state *= MULTIPLIER;
+    return state >> 64;
+}
+
+ +

Compiling with Clang, we get

+
processed 1.75e+11 bytes in 39 seconds (4.488 GB/s, 16.16 TB/h). Fri May 25 14:16:25 2018
+
+ +

whereas with GCC we get

+
processed 1.75e+11 bytes in 58 seconds (3.017 GB/s, 10.86 TB/h). Fri May 25 14:18:14 2018
+
+ +

The GCC code is no slouch, but Clang's code here is much faster. Clang is apparently better at 128-bit math.

+

If we really care about speed though, this 128-bit MCG (which uses a carefully chosen 64-bit multiplier instead of a more typical 128-bit multiplier) is even faster and still passes statistical tests:

+
static uint128_t state = 1;   // can be seeded to any odd number
+
+static inline uint64_t next()
+{
+    return (state *= 0xda942042e4dd58b5ULL) >> 64;
+}
+
+ +

Compiling with Clang, we get

+
processed 1.75e+11 bytes in 37 seconds (4.73 GB/s, 17.03 TB/h). Fri May 25 14:09:26 2018
+
+ +

whereas with GCC we get

+
processed 1.75e+11 bytes in 44 seconds (3.978 GB/s, 14.32 TB/h). Fri May 25 14:11:40 2018
+
+ +

Again, Clang takes the speed crown; its executable generates and checks 1 TB of randomness about every 3.5 minutes.

+

If we test Vigna's latest generator, xoshiro256**, and compile with Clang, it gives us

+
processed 1.75e+11 bytes in 50 seconds (3.5 GB/s, 12.6 TB/h). Fri May 25 14:30:05 2018
+
+ +

whereas with GCC we get

+
processed 1.75e+11 bytes in 43 seconds (4.07 GB/s, 14.65 TB/h). Fri May 25 14:31:52 2018
+
+ +

This result is very fast, but not faster than either 128-bit MCG.

+

Finally, let's look at PCG-style generators. First let's look at Vigna's proposed variant. Compiling with Clang, we get

+
processed 1.75e+11 bytes in 59 seconds (2.966 GB/s, 10.68 TB/h). Fri May 25 14:44:37 2018
+
+ +

and with GCC we get

+
processed 1.75e+11 bytes in 62 seconds (2.823 GB/s, 10.16 TB/h). Fri May 25 14:46:42 2018
+
+ +

This is one of the rare occasions where GCC and Clang actually turn in almost equivalent times.

+

In contrast, with the general-purpose pcg64 generator, compiling with Clang I see:

+
processed 1.75e+11 bytes in 57 seconds (3.07 GB/s, 11.05 TB/h). Fri May 25 14:57:02 2018
+
+ +

whereas with GCC, I see

+
processed 1.75e+11 bytes in 64 seconds (2.735 GB/s, 9.844 TB/h). Fri May 25 14:59:07 2018
+
+ +

Thus, depending on which compiler we choose, Vigna's variant is either slightly faster or slightly slower.

+

Finally, if we look at pcg64_fast, compiling with Clang gives us

+
processed 1.75e+11 bytes in 49 seconds (3.572 GB/s, 12.86 TB/h). Fri May 25 15:00:45 2018
+
+ +

and with GCC we get

+
processed 1.75e+11 bytes in 65 seconds (2.693 GB/s, 9.693 TB/h). Fri May 25 15:02:15 2018
+
+ +

Again the performance of GCC is a bit disappointing; this MCG-based generator is actually running slower than the LCG-based one.

+

From this small amount of testing, we can see that pcg64 is not as fast as xoshiro256**, but a lot depends on the compiler you're using—if you're using Clang (which is the default compiler on OS X), pcg64_fast will beat xoshiro256**.

+

There's plenty of room for speed improvement in PCG. My original goal was to be faster than the Mersenne Twister, which it is, but knowing that it'll always be beaten by the speed of its underlying LCG I haven't put a lot of effort into optimizing the code. I could have used the faster multiplier that I used above, and there is actually a completely different way of handling LCG increment that reduces dependences and enhances speed but implementing LCGs that way makes the code more opaque. If PCG's speed is an issue, these are design decisions are worth revisiting.

+

But the speed winner is clearly a 128-bit MCG. It's actually what I use when speed is the primary criterion.

+

Conclusion

+

None of Vigna's concerns raise any serious worries about PCG. But critique is useful, and helps spur us to do better.

+

I'm sure Vigna has spent far longer thinking about PCG than he would like, so it is best to say a big thank you to him for all the thought and energy he has expended in these efforts. I'm pleased that I've mostly been able to put the critique to good use—it may be mostly specious for users, but it is certainly helpful for me. Reddit mostly saw vitriol and condescension, but I prefer to see it as a gift of his time and thought.

+

Thanks, Sebastiano!

+
+
+
+
+ + + +
+
+ + + + + diff --git a/references/refs.txt b/references/refs.txt new file mode 100644 index 0000000..a9c2605 --- /dev/null +++ b/references/refs.txt @@ -0,0 +1,19 @@ +$ cat squiggle.c | grep http | sed 's|.*http|http|g' + +https://en.wikipedia.org/wiki/Xorshift +https://stackoverflow.com/questions/53886131/how-does-xorshift32-works +https://www.pcg-random.org/posts/on-vignas-pcg-critique.html +https://prng.di.unimi.it/ +https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform +https://stackoverflow.com/questions/20626994/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p +https://www.wolframalpha.com/input?i=N%5BInverseCDF%28normal%280%2C1%29%2C+0.05%29%2C%7B%E2%88%9E%2C100%7D%5D +https://en.wikipedia.org/wiki/Normal_distribution?lang=en#Operations_on_a_single_normal_variable +https://dl.acm.org/doi/pdf/10.1145/358407.358414 +https://en.wikipedia.org/wiki/Gamma_distribution +https://dl.acm.org/doi/pdf/10.1145/358407.358414 +https://en.wikipedia.org/wiki/Gamma_distribution#Related_distributions +https://en.wikipedia.org/wiki/Beta_distribution?lang=en#Rule_of_succession + +$ cat squiggle_more.c | grep http | sed 's|.*http|http|g' +https://en.wikipedia.org/wiki/Quickselect +