In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] or (0, 1) in terms of two positive parameters, denoted by alpha (α) and beta (β), that appear as exponents of the variable and its complement to 1, respectively, and control the shape of the distribution.
+
The beta distribution has been applied to model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines. The beta distribution is a suitable model for the random behavior of percentages and proportions.
+
The formulation of the beta distribution discussed here is also known as the beta distribution of the first kind, whereas beta distribution of the second kind is an alternative name for the beta prime distribution. The generalization to multiple variables is called a Dirichlet distribution.
+
Several authors, including N. L. Johnson and S. Kotz,[1] use the symbols p and q (instead of α and β) for the shape parameters of the beta distribution, reminiscent of the symbols traditionally used for the parameters of the Bernoulli distribution, because the beta distribution approaches the Bernoulli distribution in the limit when both shape parameters α and β approach the value of zero.
+
In the following, a random variable X beta-distributed with parameters α and β will be denoted by:[2][3]
+
+
+
Other notations for beta-distributed random variables used in the statistical literature are [4] and .[5]
+
The beta distribution may also be reparameterized in terms of its mean μ(0 < μ < 1) and the sum of the two shape parameters ν = α + β > 0([3] p. 83). Denoting by αPosterior and βPosterior the shape parameters of the posterior beta distribution resulting from applying Bayes theorem to a binomial likelihood function and a prior probability, the interpretation of the addition of both shape parameters to be sample size = ν = α·Posterior + β·Posterior is only correct for the Haldane prior probability Beta(0,0). Specifically, for the Bayes (uniform) prior Beta(1,1) the correct interpretation would be sample size = α·Posterior + β Posterior − 2, or ν = (sample size) + 2. For sample size much larger than 2, the difference between these two priors becomes negligible. (See section Bayesian inference for further details.) ν = α + β is referred to as the "sample size" of a Beta distribution, but one should remember that it is, strictly speaking, the "sample size" of a binomial likelihood function only when using a Haldane Beta(0,0) prior in Bayes theorem.
+
This parametrization may be useful in Bayesian parameter estimation. For example, one may administer a test to a number of individuals. If it is assumed that each person's score (0 ≤ θ ≤ 1) is drawn from a population-level Beta distribution, then an important statistic is the mean of this population-level distribution. The mean and sample size parameters are related to the shape parameters α and β via[3]
+
+
α = μν, β = (1 − μ)ν
+
Under this parametrization, one may place an uninformative prior probability over the mean, and a vague prior probability (such as an exponential or gamma distribution) over the positive reals for the sample size, if they are independent, and prior data and/or beliefs justify it.
+
Concave beta distributions, which have , can be parametrized in terms of mode and "concentration". The mode, , and concentration, , can be used to define the usual shape parameters as follows:[6]
+
+
+
For the mode, , to be well-defined, we need , or equivalently . If instead we define the concentration as , the condition simplifies to and the beta density at and can be written as:
+
+
+
where directly scales the sufficient statistics, and . Note also that in the limit, , the distribution becomes flat.
+
Solving the system of (coupled) equations given in the above sections as the equations for the mean and the variance of the beta distribution in terms of the original parameters α and β, one can express the α and β parameters in terms of the mean (μ) and the variance (var):
+
+
+
This parametrization of the beta distribution may lead to a more intuitive understanding than the one based on the original parameters α and β. For example, by expressing the mode, skewness, excess kurtosis and differential entropy in terms of the mean and the variance:
+
A beta distribution with the two shape parameters α and β is supported on the range [0,1] or (0,1). It is possible to alter the location and scale of the distribution by introducing two further parameters representing the minimum, a, and maximum c (c > a), values of the distribution,[1] by a linear transformation substituting the non-dimensional variable x in terms of the new variable y (with support [a,c] or (a,c)) and the parameters a and c:
+
+
+
The probability density function of the four parameter beta distribution is equal to the two parameter distribution, scaled by the range (c − a), (so that the total area under the density curve equals a probability of one), and with the "y" variable shifted and scaled as follows:
+
+
+
That a random variable Y is Beta-distributed with four parameters α, β, a, and c will be denoted by:
+
+
+
Some measures of central location are scaled (by (c − a)) and shifted (by a), as follows:
+
+
+
Note: the geometric mean and harmonic mean cannot be transformed by a linear transformation in the way that the mean, median and mode can.
+
The shape parameters of Y can be written in term of its mean and variance as
+
+
+
The statistical dispersion measures are scaled (they do not need to be shifted because they are already centered on the mean) by the range (c − a), linearly for the mean deviation and nonlinearly for the variance:
+
+
+
+
+
Since the skewness and excess kurtosis are non-dimensional quantities (as moments centered on the mean and normalized by the standard deviation), they are independent of the parameters a and c, and therefore equal to the expressions given above in terms of X (with support [0,1] or (0,1)):
+
The mode of a Beta distributed random variableX with α, β > 1 is the most likely value of the distribution (corresponding to the peak in the PDF), and is given by the following expression:[1]
+
+
+
When both parameters are less than one (α, β < 1), this is the anti-mode: the lowest point of the probability density curve.[7]
+
Letting α = β, the expression for the mode simplifies to 1/2, showing that for α = β > 1 the mode (resp. anti-mode when α, β < 1), is at the center of the distribution: it is symmetric in those cases. See Shapes section in this article for a full list of mode cases, for arbitrary values of α and β. For several of these cases, the maximum value of the density function occurs at one or both ends. In some cases the (maximum) value of the density function occurring at the end is finite. For example, in the case of α = 2, β = 1 (or α = 1, β = 2), the density function becomes a right-triangle distribution which is finite at both ends. In several other cases there is a singularity at one end, where the value of the density function approaches infinity. For example, in the case α = β = 1/2, the Beta distribution simplifies to become the arcsine distribution. There is debate among mathematicians about some of these cases and whether the ends (x = 0, and x = 1) can be called modes or not.[8][2]
+
+
+
Whether the ends are part of the domain of the density function
For α = 1 and β > 0, median (this case is the mirror-image of the power function [0,1] distribution)
+
For α > 0 and β = 1, median = (this case is the power function [0,1] distribution[8])
+
For α = 3 and β = 2, median = 0.6142724318676105..., the real solution to the quartic equation 1 − 8x3 + 6x4 = 0, which lies in [0,1].
+
For α = 2 and β = 3, median = 0.38572756813238945... = 1−median(Beta(3, 2))
+
The following are the limits with one parameter finite (non-zero) and the other approaching these limits:[citation needed]
+
+
+
A reasonable approximation of the value of the median of the beta distribution, for both α and β greater or equal to one, is given by the formula[9]
+
+
+
When α, β ≥ 1, the relative error (the absolute error divided by the median) in this approximation is less than 4% and for both α ≥ 2 and β ≥ 2 it is less than 1%. The absolute error divided by the difference between the mean and the mode is similarly small:
+
The expected value (mean) (μ) of a Beta distribution random variableX with two parameters α and β is a function of only the ratio β/α of these parameters:[1]
+
+
+
Letting α = β in the above expression one obtains μ = 1/2, showing that for α = β the mean is at the center of the distribution: it is symmetric. Also, the following limits can be obtained from the above expression:
+
+
+
Therefore, for β/α → 0, or for α/β → ∞, the mean is located at the right end, x = 1. For these limit ratios, the beta distribution becomes a one-point degenerate distribution with a Dirac delta function spike at the right end, x = 1, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the right end, x = 1.
+
Similarly, for β/α → ∞, or for α/β → 0, the mean is located at the left end, x = 0. The beta distribution becomes a 1-point Degenerate distribution with a Dirac delta function spike at the left end, x = 0, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the left end, x = 0. Following are the limits with one parameter finite (non-zero) and the other approaching these limits:
+
+
+
While for typical unimodal distributions (with centrally located modes, inflexion points at both sides of the mode, and longer tails) (with Beta(α, β) such that α, β > 2) it is known that the sample mean (as an estimate of location) is not as robust as the sample median, the opposite is the case for uniform or "U-shaped" bimodal distributions (with Beta(α, β) such that α, β ≤ 1), with the modes located at the ends of the distribution. As Mosteller and Tukey remark ([10] p. 207) "the average of the two extreme observations uses all the sample information. This illustrates how, for short-tailed distributions, the extreme observations should get more weight." By contrast, it follows that the median of "U-shaped" bimodal distributions with modes at the edge of the distribution (with Beta(α, β) such that α, β ≤ 1) is not robust, as the sample median drops the extreme sample observations from consideration. A practical application of this occurs for example for random walks, since the probability for the time of the last visit to the origin in a random walk is distributed as the arcsine distribution Beta(1/2, 1/2):[5][11] the mean of a number of realizations of a random walk is a much more robust estimator than the median (which is an inappropriate sample measure estimate in this case).
+
Therefore, the geometric mean of a beta distribution with shape parameters α and β is the exponential of the digamma functions of α and β as follows:
+
+
+
While for a beta distribution with equal shape parameters α = β, it follows that skewness = 0 and mode = mean = median = 1/2, the geometric mean is less than 1/2: 0 < GX < 1/2. The reason for this is that the logarithmic transformation strongly weights the values of X close to zero, as ln(X) strongly tends towards negative infinity as X approaches zero, while ln(X) flattens towards zero as X → 1.
+
Along a line α = β, the following limits apply:
+
+
+
Following are the limits with one parameter finite (non-zero) and the other approaching these limits:
+
+
+
The accompanying plot shows the difference between the mean and the geometric mean for shape parameters α and β from zero to 2. Besides the fact that the difference between them approaches zero as α and β approach infinity and that the difference becomes large for values of α and β approaching zero, one can observe an evident asymmetry of the geometric mean with respect to the shape parameters α and β. The difference between the geometric mean and the mean is larger for small values of α in relation to β than when exchanging the magnitudes of β and α.
+
N. L.Johnson and S. Kotz[1] suggest the logarithmic approximation to the digamma function ψ(α) ≈ ln(α − 1/2) which results in the following approximation to the geometric mean:
+
Similarly, one can calculate the value of shape parameters required for the geometric mean to equal 1/2. Given the value of the parameter β, what would be the value of the other parameter, α, required for the geometric mean to equal 1/2?. The answer is that (for β > 1), the value of α required tends towards β + 1/2 as β → ∞. For example, all these couples have the same geometric mean of 1/2: [β = 1, α = 1.4427], [β = 2, α = 2.46958], [β = 3, α = 3.47943], [β = 4, α = 4.48449], [β = 5, α = 5.48756], [β = 10, α = 10.4938], [β = 100, α = 100.499].
+
The fundamental property of the geometric mean, which can be proven to be false for any other mean, is
+
+
+
This makes the geometric mean the only correct mean when averaging normalized results, that is results that are presented as ratios to reference values.[12] This is relevant because the beta distribution is a suitable model for the random behavior of percentages and it is particularly suitable to the statistical modelling of proportions. The geometric mean plays a central role in maximum likelihood estimation, see section "Parameter estimation, maximum likelihood." Actually, when performing maximum likelihood estimation, besides the geometric meanGX based on the random variable X, also another geometric mean appears naturally: the geometric mean based on the linear transformation ––(1 − X), the mirror-image of X, denoted by G(1−X):
+
+
+
Along a line α = β, the following limits apply:
+
+
+
Following are the limits with one parameter finite (non-zero) and the other approaching these limits:
+
+
+
It has the following approximate value:
+
+
+
Although both GX and G(1−X) are asymmetric, in the case that both shape parameters are equal α = β, the geometric means are equal: GX = G(1−X). This equality follows from the following symmetry displayed between both geometric means:
+
The inverse of the harmonic mean (HX) of a distribution with random variableX is the arithmetic mean of 1/X, or, equivalently, its expected value. Therefore, the harmonic mean (HX) of a beta distribution with shape parameters α and β is:
+
+
+
The harmonic mean (HX) of a Beta distribution with α < 1 is undefined, because its defining expression is not bounded in [0, 1] for shape parameter α less than unity.
+
Letting α = β in the above expression one obtains
+
+
+
showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞.
+
Following are the limits with one parameter finite (non-zero) and the other approaching these limits:
+
+
+
The harmonic mean plays a role in maximum likelihood estimation for the four parameter case, in addition to the geometric mean. Actually, when performing maximum likelihood estimation for the four parameter case, besides the harmonic mean HX based on the random variable X, also another harmonic mean appears naturally: the harmonic mean based on the linear transformation (1 − X), the mirror-image of X, denoted by H1 − X:
+
+
+
The harmonic mean (H(1 − X)) of a Beta distribution with β < 1 is undefined, because its defining expression is not bounded in [0, 1] for shape parameter β less than unity.
+
Letting α = β in the above expression one obtains
+
+
+
showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞.
+
Following are the limits with one parameter finite (non-zero) and the other approaching these limits:
+
+
+
Although both HX and H1−X are asymmetric, in the case that both shape parameters are equal α = β, the harmonic means are equal: HX = H1−X. This equality follows from the following symmetry displayed between both harmonic means:
+
The variance (the second moment centered on the mean) of a Beta distribution random variableX with parameters α and β is:[1][13]
+
+
+
Letting α = β in the above expression one obtains
+
+
+
showing that for α = β the variance decreases monotonically as α = β increases. Setting α = β = 0 in this expression, one finds the maximum variance var(X) = 1/4[1] which only occurs approaching the limit, at α = β = 0.
+
The beta distribution may also be parametrized in terms of its mean μ(0 < μ < 1) and sample size ν = α + β (ν > 0) (see subsection Mean and sample size):
+
+
+
Using this parametrization, one can express the variance in terms of the mean μ and the sample size ν as follows:
+
+
+
Since ν = α + β > 0, it follows that var(X) < μ(1 − μ).
+
For a symmetric distribution, the mean is at the middle of the distribution, μ = 1/2, and therefore:
+
+
+
Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:
+
The logarithm of the geometric variance, ln(varGX), of a distribution with random variableX is the second moment of the logarithm of X centered on the geometric mean of X, ln(GX):
+
+
+
and therefore, the geometric variance is:
+
+
+
In the Fisher information matrix, and the curvature of the log likelihood function, the logarithm of the geometric variance of the reflected variable 1 − X and the logarithm of the geometric covariance between X and 1 − X appear:
+
+
+
For a beta distribution, higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order poly-gamma functions. See the section § Moments of logarithmically transformed random variables. The variance of the logarithmic variables and covariance of ln X and ln(1−X) are:
+
The accompanying plots show the log geometric variances and log geometric covariance versus the shape parameters α and β. The plots show that the log geometric variances and log geometric covariance are close to zero for shape parameters α and β greater than 2, and that the log geometric variances rapidly rise in value for shape parameter values α and β less than unity. The log geometric variances are positive for all values of the shape parameters. The log geometric covariance is negative for all values of the shape parameters, and it reaches large negative values for α and β less than unity.
+
Following are the limits with one parameter finite (non-zero) and the other approaching these limits:
+
+
+
Limits with two parameters varying:
+
+
+
Although both ln(varGX) and ln(varG(1 − X)) are asymmetric, when the shape parameters are equal, α = β, one has: ln(varGX) = ln(varG(1−X)). This equality follows from the following symmetry displayed between both log geometric variances:
+
The mean absolute deviation around the mean for the beta distribution with shape parameters α and β is:[8]
+
+
+
The mean absolute deviation around the mean is a more robustestimator of statistical dispersion than the standard deviation for beta distributions with tails and inflection points at each side of the mode, Beta(α, β) distributions with α,β > 2, as it depends on the linear (absolute) deviations rather than the square deviations from the mean. Therefore, the effect of very large deviations from the mean are not as overly weighted.
+
Using Stirling's approximation to the Gamma function, N.L.Johnson and S.Kotz[1] derived the following approximation for values of the shape parameters greater than unity (the relative error for this approximation is only −3.5% for α = β = 1, and it decreases to zero as α → ∞, β → ∞):
+
+
+
At the limit α → ∞, β → ∞, the ratio of the mean absolute deviation to the standard deviation (for the beta distribution) becomes equal to the ratio of the same measures for the normal distribution: . For α = β = 1 this ratio equals , so that from α = β = 1 to α, β → ∞ the ratio decreases by 8.5%. For α = β = 0 the standard deviation is exactly equal to the mean absolute deviation around the mean. Therefore, this ratio decreases by 15% from α = β = 0 to α = β = 1, and by 25% from α = β = 0 to α, β → ∞ . However, for skewed beta distributions such that α → 0 or β → 0, the ratio of the standard deviation to the mean absolute deviation approaches infinity (although each of them, individually, approaches zero) because the mean absolute deviation approaches zero faster than the standard deviation.
+
Using the parametrization in terms of mean μ and sample size ν = α + β > 0:
+
+
α = μν, β = (1−μ)ν
+
one can express the mean absolute deviation around the mean in terms of the mean μ and the sample size ν as follows:
+
+
+
For a symmetric distribution, the mean is at the middle of the distribution, μ = 1/2, and therefore:
+
+
+
Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:
+
The skewness (the third moment centered on the mean, normalized by the 3/2 power of the variance) of the beta distribution is[1]
+
+
+
Letting α = β in the above expression one obtains γ1 = 0, showing once again that for α = β the distribution is symmetric and hence the skewness is zero. Positive skew (right-tailed) for α < β, negative skew (left-tailed) for α > β.
+
Using the parametrization in terms of mean μ and sample size ν = α + β:
+
+
+
one can express the skewness in terms of the mean μ and the sample size ν as follows:
+
+
+
The skewness can also be expressed just in terms of the variance var and the mean μ as follows:
+
+
+
The accompanying plot of skewness as a function of variance and mean shows that maximum variance (1/4) is coupled with zero skewness and the symmetry condition (μ = 1/2), and that maximum skewness (positive or negative infinity) occurs when the mean is located at one end or the other, so that the "mass" of the probability distribution is concentrated at the ends (minimum variance).
+
The following expression for the square of the skewness, in terms of the sample size ν = α + β and the variance var, is useful for the method of moments estimation of four parameters:
+
+
+
This expression correctly gives a skewness of zero for α = β, since in that case (see § Variance): .
+
For the symmetric case (α = β), skewness = 0 over the whole range, and the following limits apply:
+
+
+
For the asymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:
+
The beta distribution has been applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reported to be a good indicator of the condition of a gear.[14] Kurtosis has also been used to distinguish the seismic signal generated by a person's footsteps from other signals. As persons or other targets moving on the ground generate continuous signals in the form of seismic waves, one can separate different targets based on the seismic waves they generate. Kurtosis is sensitive to impulsive signals, so it's much more sensitive to the signal generated by human footsteps than other signals generated by vehicles, winds, noise, etc.[15] Unfortunately, the notation for kurtosis has not been standardized. Kenney and Keeping[16] use the symbol γ2 for the excess kurtosis, but Abramowitz and Stegun[17] use different terminology. To prevent confusion[18] between kurtosis (the fourth moment centered on the mean, normalized by the square of the variance) and excess kurtosis, when using symbols, they will be spelled out as follows:[8][19]
+
+
+
Letting α = β in the above expression one obtains
+
+
.
+
Therefore, for symmetric beta distributions, the excess kurtosis is negative, increasing from a minimum value of −2 at the limit as {α = β} → 0, and approaching a maximum value of zero as {α = β} → ∞. The value of −2 is the minimum value of excess kurtosis that any distribution (not just beta distributions, but any distribution of any possible kind) can ever achieve. This minimum value is reached when all the probability density is entirely concentrated at each end x = 0 and x = 1, with nothing in between: a 2-point Bernoulli distribution with equal probability 1/2 at each end (a coin toss: see section below "Kurtosis bounded by the square of the skewness" for further discussion). The description of kurtosis as a measure of the "potential outliers" (or "potential rare, extreme values") of the probability distribution, is correct for all distributions including the beta distribution. When rare, extreme values can occur in the beta distribution, the higher its kurtosis; otherwise, the kurtosis is lower. For α ≠ β, skewed beta distributions, the excess kurtosis can reach unlimited positive values (particularly for α → 0 for finite β, or for β → 0 for finite α) because the side away from the mode will produce occasional extreme values. Minimum kurtosis takes place when the mass density is concentrated equally at each end (and therefore the mean is at the center), and there is no probability mass density in between the ends.
+
Using the parametrization in terms of mean μ and sample size ν = α + β:
+
+
+
one can express the excess kurtosis in terms of the mean μ and the sample size ν as follows:
+
+
+
The excess kurtosis can also be expressed in terms of just the following two parameters: the variance var, and the sample size ν as follows:
+
+
+
and, in terms of the variance var and the mean μ as follows:
+
+
+
The plot of excess kurtosis as a function of the variance and the mean shows that the minimum value of the excess kurtosis (−2, which is the minimum possible value for excess kurtosis for any distribution) is intimately coupled with the maximum value of variance (1/4) and the symmetry condition: the mean occurring at the midpoint (μ = 1/2). This occurs for the symmetric case of α = β = 0, with zero skewness. At the limit, this is the 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. (A coin toss: one face of the coin being x = 0 and the other face being x = 1.) Variance is maximum because the distribution is bimodal with nothing in between the two modes (spikes) at each end. Excess kurtosis is minimum: the probability density "mass" is zero at the mean and it is concentrated at the two peaks at each end. Excess kurtosis reaches the minimum possible value (for any distribution) when the probability density function has two spikes at each end: it is bi-"peaky" with nothing in between them.
+
On the other hand, the plot shows that for extreme skewed cases, where the mean is located near one or the other end (μ = 0 or μ = 1), the variance is close to zero, and the excess kurtosis rapidly approaches infinity when the mean of the distribution approaches either end.
+
Alternatively, the excess kurtosis can also be expressed in terms of just the following two parameters: the square of the skewness, and the sample size ν as follows:
+
+
+
From this last expression, one can obtain the same limits published over a century ago by Karl Pearson[20] for the beta distribution (see section below titled "Kurtosis bounded by the square of the skewness"). Setting α + β= ν = 0 in the above expression, one obtains Pearson's lower boundary (values for the skewness and excess kurtosis below the boundary (excess kurtosis + 2 − skewness2 = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region"). The limit of α + β = ν → ∞ determines Pearson's upper boundary.
+
+
+
therefore:
+
+
+
Values of ν = α + β such that ν ranges from zero to infinity, 0 < ν < ∞, span the whole region of the beta distribution in the plane of excess kurtosis versus squared skewness.
+
For the symmetric case (α = β), the following limits apply:
+
+
+
For the unsymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:
+
is the rising factorial, also called the "Pochhammer symbol". The value of the characteristic function for t = 0, is one:
+
+
.
+
Also, the real and imaginary parts of the characteristic function enjoy the following symmetries with respect to the origin of variable t:
+
+
+
+
The symmetric case α = β simplifies the characteristic function of the beta distribution to a Bessel function, since in the special case α + β = 2α the confluent hypergeometric function (of the first kind) reduces to a Bessel function (the modified Bessel function of the first kind ) using Kummer's second transformation as follows:
+
+
+
In the accompanying plots, the real part (Re) of the characteristic function of the beta distribution is displayed for symmetric (α = β) and skewed (α ≠ β) cases.
+
Moments of linearly transformed, product and inverted random variables[edit]
+
One can also show the following expectations for a transformed random variable,[1] where the random variable X is Beta-distributed with parameters α and β: X ~ Beta(α, β). The expected value of the variable 1 − X is the mirror-symmetry of the expected value based on X:
+
+
+
Due to the mirror-symmetry of the probability density function of the beta distribution, the variances based on variables X and 1 − X are identical, and the covariance on X(1 − X is the negative of the variance:
+
+
+
These are the expected values for inverted variables, (these are related to the harmonic means, see § Harmonic mean):
+
+
+
The following transformation by dividing the variable X by its mirror-image X/(1 − X) results in the expected value of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI):[1]
+
+
+
Variances of these transformed variables can be obtained by integration, as the expected values of the second moments centered on the corresponding variables:
+
+
+
+
The following variance of the variable X divided by its mirror-image (X/(1−X) results in the variance of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI):[1]
+
+
+
+
The covariances are:
+
+
+
These expectations and variances appear in the four-parameter Fisher information matrix (§ Fisher information.)
+
+
Moments of logarithmically transformed random variables[edit]
Logit transformations are interesting,[23] as they usually transform various shapes (including J-shapes) into (usually skewed) bell-shaped densities over the logit variable, and they may remove the end singularities over the original variable:
+
+
+
Johnson[24] considered the distribution of the logit - transformed variable ln(X/1−X), including its moment generating function and approximations for large values of the shape parameters. This transformation extends the finite support [0, 1] based on the original variable X to infinite support in both directions of the real line (−∞, +∞).
+
Higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order poly-gamma functions as follows:
+
+
+
therefore the variance of the logarithmic variables and covariance of ln(X) and ln(1−X) are:
+
The variances and covariance of the logarithmically transformed variables X and (1−X) are different, in general, because the logarithmic transformation destroys the mirror-symmetry of the original variables X and (1−X), as the logarithm approaches negative infinity for the variable approaching zero.
+
These logarithmic variances and covariance are the elements of the Fisher information matrix for the beta distribution. They are also a measure of the curvature of the log likelihood function (see section on Maximum likelihood estimation).
+
The variances of the log inverse variables are identical to the variances of the log variables:
+
+
+
It also follows that the variances of the logit transformed variables are:
+
The digamma functionψ appears in the formula for the differential entropy as a consequence of Euler's integral formula for the harmonic numbers which follows from the integral:
+
+
+
The differential entropy of the beta distribution is negative for all values of α and β greater than zero, except at α = β = 1 (for which values the beta distribution is the same as the uniform distribution), where the differential entropy reaches its maximum value of zero. It is to be expected that the maximum entropy should take place when the beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible events are equiprobable.
+
For α or β approaching zero, the differential entropy approaches its minimum value of negative infinity. For (either or both) α or β approaching zero, there is a maximum amount of order: all the probability density is concentrated at the ends, and there is zero probability density at points located between the ends. Similarly for (either or both) α or β approaching infinity, the differential entropy approaches its minimum value of negative infinity, and a maximum amount of order. If either α or β approaches infinity (and the other is finite) all the probability density is concentrated at an end, and the probability density is zero everywhere else. If both shape parameters are equal (the symmetric case), α = β, and they approach infinity simultaneously, the probability density becomes a spike (Dirac delta function) concentrated at the middle x = 1/2, and hence there is 100% probability at the middle x = 1/2 and zero probability everywhere else.
+
+
The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the "entropy of a continuous distribution"), as the concluding part of the same paper where he defined the discrete entropy.[26] It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution). What really matters is the relative value of entropy.
+
Given two beta distributed random variables, X1 ~ Beta(α, β) and X2 ~ Beta(α′, β′), the cross-entropy is (measured in nats)[27]
+
+
+
The cross entropy has been used as an error metric to measure the distance between two hypotheses.[28][29] Its absolute value is minimum when the two distributions are identical. It is the information measure most closely related to the log maximum likelihood [27](see section on "Parameter estimation. Maximum likelihood estimation")).
+
The relative entropy, or Kullback–Leibler divergenceDKL(X1 || X2), is a measure of the inefficiency of assuming that the distribution is X2 ~ Beta(α′, β′) when the distribution is really X1 ~ Beta(α, β). It is defined as follows (measured in nats).
+
+
+
The relative entropy, or Kullback–Leibler divergence, is always non-negative. A few numerical examples follow:
+
The Kullback–Leibler divergence is not symmetric DKL(X1 || X2) ≠ DKL(X2 || X1) for the case in which the individual beta distributions Beta(1, 1) and Beta(3, 3) are symmetric, but have different entropies h(X1) ≠ h(X2). The value of the Kullback divergence depends on the direction traveled: whether going from a higher (differential) entropy to a lower (differential) entropy or the other way around. In the numerical example above, the Kullback divergence measures the inefficiency of assuming that the distribution is (bell-shaped) Beta(3, 3), rather than (uniform) Beta(1, 1). The "h" entropy of Beta(1, 1) is higher than the "h" entropy of Beta(3, 3) because the uniform distribution Beta(1, 1) has a maximum amount of disorder. The Kullback divergence is more than two times higher (0.598803 instead of 0.267864) when measured in the direction of decreasing entropy: the direction that assumes that the (uniform) Beta(1, 1) distribution is (bell-shaped) Beta(3, 3) rather than the other way around. In this restricted sense, the Kullback divergence is consistent with the second law of thermodynamics.
+
The Kullback–Leibler divergence is symmetric DKL(X1 || X2) = DKL(X2 || X1) for the skewed cases Beta(3, 0.5) and Beta(0.5, 3) that have equal differential entropy h(X1) = h(X2).
+
The symmetry condition:
+
+
+
follows from the above definitions and the mirror-symmetry f(x; α, β) = f(1−x; α, β) enjoyed by the beta distribution.
+
If 1 < α < β then mode ≤ median ≤ mean.[9] Expressing the mode (only for α, β > 1), and the mean in terms of α and β:
+
+
+
If 1 < β < α then the order of the inequalities are reversed. For α, β > 1 the absolute distance between the mean and the median is less than 5% of the distance between the maximum and minimum values of x. On the other hand, the absolute distance between the mean and the mode can reach 50% of the distance between the maximum and minimum values of x, for the (pathological) case of α = 1 and β = 1, for which values the beta distribution approaches the uniform distribution and the differential entropy approaches its maximum value, and hence maximum "disorder".
+
Mean, geometric mean and harmonic mean relationship[edit]
+
+
It is known from the inequality of arithmetic and geometric means that the geometric mean is lower than the mean. Similarly, the harmonic mean is lower than the geometric mean. The accompanying plot shows that for α = β, both the mean and the median are exactly equal to 1/2, regardless of the value of α = β, and the mode is also equal to 1/2 for α = β > 1, however the geometric and harmonic means are lower than 1/2 and they only approach this value asymptotically as α = β → ∞.
+
+
Kurtosis bounded by the square of the skewness[edit]
+
+
As remarked by Feller,[5] in the Pearson system the beta probability density appears as type I (any difference between the beta distribution and Pearson's type I distribution is only superficial and it makes no difference for the following discussion regarding the relationship between kurtosis and skewness). Karl Pearson showed, in Plate 1 of his paper [20] published in 1916, a graph with the kurtosis as the vertical axis (ordinate) and the square of the skewness as the horizontal axis (abscissa), in which a number of distributions were displayed.[30] The region occupied by the beta distribution is bounded by the following two lines in the (skewness2,kurtosis) plane, or the (skewness2,excess kurtosis) plane:
+
+
+
or, equivalently,
+
+
+
At a time when there were no powerful digital computers, Karl Pearson accurately computed further boundaries,[31][20] for example, separating the "U-shaped" from the "J-shaped" distributions. The lower boundary line (excess kurtosis + 2 − skewness2 = 0) is produced by skewed "U-shaped" beta distributions with both values of shape parameters α and β close to zero. The upper boundary line (excess kurtosis − (3/2) skewness2 = 0) is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. Karl Pearson showed[20] that this upper boundary line (excess kurtosis − (3/2) skewness2 = 0) is also the intersection with Pearson's distribution III, which has unlimited support in one direction (towards positive infinity), and can be bell-shaped or J-shaped. His son, Egon Pearson, showed[30] that the region (in the kurtosis/squared-skewness plane) occupied by the beta distribution (equivalently, Pearson's distribution I) as it approaches this boundary (excess kurtosis − (3/2) skewness2 = 0) is shared with the noncentral chi-squared distribution. Karl Pearson[32] (Pearson 1895, pp. 357, 360, 373–376) also showed that the gamma distribution is a Pearson type III distribution. Hence this boundary line for Pearson's type III distribution is known as the gamma line. (This can be shown from the fact that the excess kurtosis of the gamma distribution is 6/k and the square of the skewness is 4/k, hence (excess kurtosis − (3/2) skewness2 = 0) is identically satisfied by the gamma distribution regardless of the value of the parameter "k"). Pearson later noted that the chi-squared distribution is a special case of Pearson's type III and also shares this boundary line (as it is apparent from the fact that for the chi-squared distribution the excess kurtosis is 12/k and the square of the skewness is 8/k, hence (excess kurtosis − (3/2) skewness2 = 0) is identically satisfied regardless of the value of the parameter "k"). This is to be expected, since the chi-squared distribution X ~ χ2(k) is a special case of the gamma distribution, with parametrization X ~ Γ(k/2, 1/2) where k is a positive integer that specifies the "number of degrees of freedom" of the chi-squared distribution.
+
An example of a beta distribution near the upper boundary (excess kurtosis − (3/2) skewness2 = 0) is given by α = 0.1, β = 1000, for which the ratio (excess kurtosis)/(skewness2) = 1.49835 approaches the upper limit of 1.5 from below. An example of a beta distribution near the lower boundary (excess kurtosis + 2 − skewness2 = 0) is given by α= 0.0001, β = 0.1, for which values the expression (excess kurtosis + 2)/(skewness2) = 1.01621 approaches the lower limit of 1 from above. In the infinitesimal limit for both α and β approaching zero symmetrically, the excess kurtosis reaches its minimum value at −2. This minimum value occurs at the point at which the lower boundary line intersects the vertical axis (ordinate). (However, in Pearson's original chart, the ordinate is kurtosis, instead of excess kurtosis, and it increases downwards rather than upwards).
+
Values for the skewness and excess kurtosis below the lower boundary (excess kurtosis + 2 − skewness2 = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region". The boundary for this "impossible region" is determined by (symmetric or skewed) bimodal "U"-shaped distributions for which the parameters α and β approach zero and hence all the probability density is concentrated at the ends: x = 0, 1 with practically nothing in between them. Since for α ≈ β ≈ 0 the probability density is concentrated at the two ends x = 0 and x = 1, this "impossible boundary" is determined by a Bernoulli distribution, where the two only possible outcomes occur with respective probabilities p and q = 1−p. For cases approaching this limit boundary with symmetry α = β, skewness ≈ 0, excess kurtosis ≈ −2 (this is the lowest excess kurtosis possible for any distribution), and the probabilities are p ≈ q ≈ 1/2. For cases approaching this limit boundary with skewness, excess kurtosis ≈ −2 + skewness2, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at the left end x = 0 and at the right end x = 1.
+
Geometric Means each is individually asymmetric, the following symmetry applies between the geometric mean based on X and the geometric mean based on its reflection (1-X)
+
+
Harmonic means each is individually asymmetric, the following symmetry applies between the harmonic mean based on X and the harmonic mean based on its reflection (1-X)
+
.
+
Variance symmetry
+
+
Geometric variances each is individually asymmetric, the following symmetry applies between the log geometric variance based on X and the log geometric variance based on its reflection (1-X)
For certain values of the shape parameters α and β, the probability density function has inflection points, at which the curvature changes sign. The position of these inflection points can be useful as a measure of the dispersion or spread of the distribution.
+
Defining the following quantity:
+
+
+
Points of inflection occur,[1][7][8][19] depending on the value of the shape parameters α and β, as follows:
+
+
(α > 2, β > 2) The distribution is bell-shaped (symmetric for α = β and skewed otherwise), with two inflection points, equidistant from the mode:
+
+
(α = 2, β > 2) The distribution is unimodal, positively skewed, right-tailed, with one inflection point, located to the right of the mode:
+
+
(α > 2, β = 2) The distribution is unimodal, negatively skewed, left-tailed, with one inflection point, located to the left of the mode:
+
+
(1 < α < 2, β > 2, α+β>2) The distribution is unimodal, positively skewed, right-tailed, with one inflection point, located to the right of the mode:
+
+
(0 < α < 1, 1 < β < 2) The distribution has a mode at the left end x = 0 and it is positively skewed, right-tailed. There is one inflection point, located to the right of the mode:
+
+
(α > 2, 1 < β < 2) The distribution is unimodal negatively skewed, left-tailed, with one inflection point, located to the left of the mode:
+
+
(1 < α < 2, 0 < β < 1) The distribution has a mode at the right end x=1 and it is negatively skewed, left-tailed. There is one inflection point, located to the left of the mode:
+
+
There are no inflection points in the remaining (symmetric and skewed) regions: U-shaped: (α, β < 1) upside-down-U-shaped: (1 < α < 2, 1 < β < 2), reverse-J-shaped (α < 1, β > 2) or J-shaped: (α > 2, β < 1)
+
The accompanying plots show the inflection point locations (shown vertically, ranging from 0 to 1) versus α and β (the horizontal axes ranging from 0 to 5). There are large cuts at surfaces intersecting the lines α = 1, β = 1, α = 2, and β = 2 because at these values the beta distribution change from 2 modes, to 1 mode to no mode.
+
The beta density function can take a wide variety of different shapes depending on the values of the two parameters α and β. The ability of the beta distribution to take this great diversity of shapes (using only two parameters) is partly responsible for finding wide application for modeling actual measurements:
+
α = β → 0 is a 2-point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. A coin toss: one face of the coin being x = 0 and the other face being x = 1.
+
+
a lower value than this is impossible for any distribution to reach.
α = β > 2 is bell-shaped, with inflection points located to either side of the mode
+
0 < var(X) < 1/20
+
−6/7 < excess kurtosis(X) < 0
+
α = β → ∞ is a 1-point Degenerate distribution with a Dirac delta function spike at the midpoint x = 1/2 with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the single point x = 1/2.
+
The density function is skewed. An interchange of parameter values yields the mirror image (the reverse) of the initial curve, some more specific cases:
+
+
α < 1, β < 1
+
U-shaped
+
Positive skew for α < β, negative skew for α > β.
+
bimodal: left mode = 0, right mode = 1, anti-mode =
If then min + X(max − min) ~ PERT(min, max, m, λ) where PERT denotes a PERT distribution used in PERT analysis, and m=most likely value.[36] Traditionally[37]λ = 4 in PERT analysis.
Beta(1, 1) ~ U(0, 1) with density 1 on that interval.
+
Beta(n, 1) ~ Maximum of n independent rvs. with U(0, 1), sometimes called a a standard power function distribution with density nxn-1 on that interval.
+
Beta(1, n) ~ Minimum of n independent rvs. with U(0, 1) with density n (1-x)n-1 on that interval.
Beta(1/2, 1/2) is equivalent to the arcsine distribution. This distribution is also Jeffreys prior probability for the Bernoulli and binomial distributions. The arcsine probability density is a distribution that appears in several random-walk fundamental theorems. In a fair coin toss random walk, the probability for the time of the last visit to the origin is distributed as an (U-shaped) arcsine distribution.[5][11] In a two-player fair-coin-toss game, a player is said to be in the lead if the random walk (that started at the origin) is above the origin. The most probable number of times that a given player will be in the lead, in a game of length 2N, is not N. On the contrary, N is the least likely number of times that the player will be in the lead. The most likely number of times in the lead is 0 or 2N (following the arcsine distribution).
For large , the normal distribution. More precisely, if then converges in distribution to a normal distribution with mean 0 and variance as n increases.
The Pearson type I distribution is identical to the beta distribution (except for arbitrary shifting and re-scaling that can also be accomplished with the four parameter parametrization of the beta distribution).
Two unknown parameters ( of a beta distribution supported in the [0,1] interval) can be estimated, using the method of moments, with the first two moments (sample mean and sample variance) as follows. Let:
+
When the distribution is required over a known interval other than [0, 1] with random variable X, say [a, c] with random variable Y, then replace with and with in the above couple of equations for the shape parameters (see the "Alternative parametrizations, four parameters" section below).,[40] where:
+
All four parameters ( of a beta distribution supported in the [a, c] interval -see section "Alternative parametrizations, Four parameters"-) can be estimated, using the method of moments developed by Karl Pearson, by equating sample and population values of the first four central moments (mean, variance, skewness and excess kurtosis).[1][41][42] The excess kurtosis was expressed in terms of the square of the skewness, and the sample size ν = α + β, (see previous section "Kurtosis") as follows:
+
+
+
One can use this equation to solve for the sample size ν= α + β in terms of the square of the skewness and the excess kurtosis as follows:[41]
+
+
+
+
This is the ratio (multiplied by a factor of 3) between the previously derived limit boundaries for the beta distribution in a space (as originally done by Karl Pearson[20]) defined with coordinates of the square of the skewness in one axis and the excess kurtosis in the other axis (see § Kurtosis bounded by the square of the skewness):
+
The case of zero skewness, can be immediately solved because for zero skewness, α = β and hence ν = 2α = 2β, therefore α = β = ν/2
+
+
+
+
(Excess kurtosis is negative for the beta distribution with zero skewness, ranging from -2 to 0, so that -and therefore the sample shape parameters- is positive, ranging from zero when the shape parameters approach zero and the excess kurtosis approaches -2, to infinity when the shape parameters approach infinity and the excess kurtosis approaches zero).
+
For non-zero sample skewness one needs to solve a system of two coupled equations. Since the skewness and the excess kurtosis are independent of the parameters , the parameters can be uniquely determined from the sample skewness and the sample excess kurtosis, by solving the coupled equations with two known variables (sample skewness and sample excess kurtosis) and two unknowns (the shape parameters):
+
Where one should take the solutions as follows: for (negative) sample skewness < 0, and for (positive) sample skewness > 0.
+
The accompanying plot shows these two solutions as surfaces in a space with horizontal axes of (sample excess kurtosis) and (sample squared skewness) and the shape parameters as the vertical axis. The surfaces are constrained by the condition that the sample excess kurtosis must be bounded by the sample squared skewness as stipulated in the above equation. The two surfaces meet at the right edge defined by zero skewness. Along this right edge, both parameters are equal and the distribution is symmetric U-shaped for α = β < 1, uniform for α = β = 1, upside-down-U-shaped for 1 < α = β < 2 and bell-shaped for α = β > 2. The surfaces also meet at the front (lower) edge defined by "the impossible boundary" line (excess kurtosis + 2 - skewness2 = 0). Along this front (lower) boundary both shape parameters approach zero, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at the left end x = 0 and at the right end x = 1. The two surfaces become further apart towards the rear edge. At this rear edge the surface parameters are quite different from each other. As remarked, for example, by Bowman and Shenton,[43] sampling in the neighborhood of the line (sample excess kurtosis - (3/2)(sample skewness)2 = 0) (the just-J-shaped portion of the rear edge where blue meets beige), "is dangerously near to chaos", because at that line the denominator of the expression above for the estimate ν = α + β becomes zero and hence ν approaches infinity as that line is approached. Bowman and Shenton [43] write that "the higher moment parameters (kurtosis and skewness) are extremely fragile (near that line). However, the mean and standard deviation are fairly reliable." Therefore, the problem is for the case of four parameter estimation for very skewed distributions such that the excess kurtosis approaches (3/2) times the square of the skewness. This boundary line is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. See § Kurtosis bounded by the square of the skewness for a numerical example and further comments about this rear edge boundary line (sample excess kurtosis - (3/2)(sample skewness)2 = 0). As remarked by Karl Pearson himself [44] this issue may not be of much practical importance as this trouble arises only for very skewed J-shaped (or mirror-image J-shaped) distributions with very different values of shape parameters that are unlikely to occur much in practice). The usual skewed-bell-shape distributions that occur in practice do not have this parameter estimation problem.
+
The remaining two parameters can be determined using the sample mean and the sample variance using a variety of equations.[1][41] One alternative is to calculate the support interval range based on the sample variance and the sample kurtosis. For this purpose one can solve, in terms of the range , the equation expressing the excess kurtosis in terms of the sample variance, and the sample size ν (see § Kurtosis and § Alternative parametrizations, four parameters):
+
+
+
to obtain:
+
+
+
Another alternative is to calculate the support interval range based on the sample variance and the sample skewness.[41] For this purpose one can solve, in terms of the range , the equation expressing the squared skewness in terms of the sample variance, and the sample size ν (see section titled "Skewness" and "Alternative parametrizations, four parameters"):
+
The remaining parameter can be determined from the sample mean and the previously obtained parameters: :
+
+
+
and finally, .
+
In the above formulas one may take, for example, as estimates of the sample moments:
+
+
+
The estimators G1 for sample skewness and G2 for sample kurtosis are used by DAP/SAS, PSPP/SPSS, and Excel. However, they are not used by BMDP and (according to [45]) they were not used by MINITAB in 1998. Actually, Joanes and Gill in their 1998 study[45] concluded that the skewness and kurtosis estimators used in BMDP and in MINITAB (at that time) had smaller variance and mean-squared error in normal samples, but the skewness and kurtosis estimators used in DAP/SAS, PSPP/SPSS, namely G1 and G2, had smaller mean-squared error in samples from a very skewed distribution. It is for this reason that we have spelled out "sample skewness", etc., in the above formulas, to make it explicit that the user should choose the best estimator according to the problem at hand, as the best estimator for skewness and kurtosis depends on the amount of skewness (as shown by Joanes and Gill[45]).
+
As is also the case for maximum likelihood estimates for the gamma distribution, the maximum likelihood estimates for the beta distribution do not have a general closed form solution for arbitrary values of the shape parameters. If X1, ..., XN are independent random variables each having a beta distribution, the joint log likelihood function for Niid observations is:
+
+
+
Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the maximum likelihood estimator of the shape parameters:
+
To ensure that the values with zero tangent slope are indeed a maximum (instead of a saddle-point or a minimum) one has to also satisfy the condition that the curvature is negative. This amounts to satisfying that the second partial derivative with respect to the shape parameters is negative
+
+
+
+
using the previous equations, this is equivalent to:
+
These conditions are equivalent to stating that the variances of the logarithmically transformed variables are positive, since:
+
+
+
+
Therefore, the condition of negative curvature at a maximum is equivalent to the statements:
+
+
+
+
Alternatively, the condition of negative curvature at a maximum is also equivalent to stating that the following logarithmic derivatives of the geometric meansGX and G(1−X) are positive, since:
+
+
+
+
While these slopes are indeed positive, the other slopes are negative:
+
+
+
The slopes of the mean and the median with respect to α and β display similar sign behavior.
+
From the condition that at a maximum, the partial derivative with respect to the shape parameter equals zero, we obtain the following system of coupled maximum likelihood estimate equations (for the average log-likelihoods) that needs to be inverted to obtain the (unknown) shape parameter estimates in terms of the (known) average of logarithms of the samples X1, ..., XN:[1]
+
+
+
where we recognize as the logarithm of the sample geometric mean and as the logarithm of the sample geometric mean based on (1 − X), the mirror-image of X. For , it follows that .
+
+
+
These coupled equations containing digamma functions of the shape parameter estimates must be solved by numerical methods as done, for example, by Beckman et al.[46] Gnanadesikan et al. give numerical solutions for a few cases.[47]N.L.Johnson and S.Kotz[1] suggest that for "not too small" shape parameter estimates , the logarithmic approximation to the digamma function may be used to obtain initial values for an iterative solution, since the equations resulting from this approximation can be solved exactly:
+
+
+
+
which leads to the following solution for the initial values (of the estimate shape parameters in terms of the sample geometric means) for an iterative solution:
+
+
+
+
Alternatively, the estimates provided by the method of moments can instead be used as initial values for an iterative solution of the maximum likelihood coupled equations in terms of the digamma functions.
+
When the distribution is required over a known interval other than [0, 1] with random variable X, say [a, c] with random variable Y, then replace ln(Xi) in the first equation with
+
+
+
and replace ln(1−Xi) in the second equation with
+
+
+
(see "Alternative parametrizations, four parameters" section below).
+
If one of the shape parameters is known, the problem is considerably simplified. The following logit transformation can be used to solve for the unknown shape parameter (for skewed cases such that , otherwise, if symmetric, both -equal- parameters are known when one is known):
+
+
+
This logit transformation is the logarithm of the transformation that divides the variable X by its mirror-image (X/(1 - X) resulting in the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) with support [0, +∞). As previously discussed in the section "Moments of logarithmically transformed random variables," the logit transformation , studied by Johnson,[24] extends the finite support [0, 1] based on the original variable X to infinite support in both directions of the real line (−∞, +∞).
+
If, for example, is known, the unknown parameter can be obtained in terms of the inverse[48] digamma function of the right hand side of this equation:
+
+
+
+
In particular, if one of the shape parameters has a value of unity, for example for (the power function distribution with bounded support [0,1]), using the identity ψ(x + 1) = ψ(x) + 1/x in the equation , the maximum likelihood estimator for the unknown parameter is,[1] exactly:
+
+
+
The beta has support [0, 1], therefore , and hence , and therefore
+
In conclusion, the maximum likelihood estimates of the shape parameters of a beta distribution are (in general) a complicated function of the sample geometric mean, and of the sample geometric mean based on (1−X), the mirror-image of X. One may ask, if the variance (in addition to the mean) is necessary to estimate two shape parameters with the method of moments, why is the (logarithmic or geometric) variance not necessary to estimate two shape parameters with the maximum likelihood method, for which only the geometric means suffice? The answer is because the mean does not provide as much information as the geometric mean. For a beta distribution with equal shape parameters α = β, the mean is exactly 1/2, regardless of the value of the shape parameters, and therefore regardless of the value of the statistical dispersion (the variance). On the other hand, the geometric mean of a beta distribution with equal shape parameters α = β, depends on the value of the shape parameters, and therefore it contains more information. Also, the geometric mean of a beta distribution does not satisfy the symmetry conditions satisfied by the mean, therefore, by employing both the geometric mean based on X and geometric mean based on (1 − X), the maximum likelihood method is able to provide best estimates for both parameters α = β, without need of employing the variance.
+
One can express the joint log likelihood per Niid observations in terms of the sufficient statistics (the sample geometric means) as follows:
+
+
+
We can plot the joint log likelihood per N observations for fixed values of the sample geometric means to see the behavior of the likelihood function as a function of the shape parameters α and β. In such a plot, the shape parameter estimators correspond to the maxima of the likelihood function. See the accompanying graph that shows that all the likelihood functions intersect at α = β = 1, which corresponds to the values of the shape parameters that give the maximum entropy (the maximum entropy occurs for shape parameters equal to unity: the uniform distribution). It is evident from the plot that the likelihood function gives sharp peaks for values of the shape parameter estimators close to zero, but that for values of the shape parameters estimators greater than one, the likelihood function becomes quite flat, with less defined peaks. Obviously, the maximum likelihood parameter estimation method for the beta distribution becomes less acceptable for larger values of the shape parameter estimators, as the uncertainty in the peak definition increases with the value of the shape parameter estimators. One can arrive at the same conclusion by noticing that the expression for the curvature of the likelihood function is in terms of the geometric variances
+
+
+
+
These variances (and therefore the curvatures) are much larger for small values of the shape parameter α and β. However, for shape parameter values α, β > 1, the variances (and therefore the curvatures) flatten out. Equivalently, this result follows from the Cramér–Rao bound, since the Fisher information matrix components for the beta distribution are these logarithmic variances. The Cramér–Rao bound states that the variance of any unbiased estimator of α is bounded by the reciprocal of the Fisher information:
+
+
+
+
so the variance of the estimators increases with increasing α and β, as the logarithmic variances decrease.
+
Also one can express the joint log likelihood per Niid observations in terms of the digamma function expressions for the logarithms of the sample geometric means as follows:
+
+
+
this expression is identical to the negative of the cross-entropy (see section on "Quantities of information (entropy)"). Therefore, finding the maximum of the joint log likelihood of the shape parameters, per Niid observations, is identical to finding the minimum of the cross-entropy for the beta distribution, as a function of the shape parameters.
+
The procedure is similar to the one followed in the two unknown parameter case. If Y1, ..., YN are independent random variables each having a beta distribution with four parameters, the joint log likelihood function for Niid observations is:
+
+
+
Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the maximum likelihood estimator of the shape parameters:
+
+
+
+
+
+
these equations can be re-arranged as the following system of four coupled equations (the first two equations are geometric means and the second two equations are the harmonic means) in terms of the maximum likelihood estimates for the four parameters :
+
+
+
+
+
+
with sample geometric means:
+
+
+
+
The parameters are embedded inside the geometric mean expressions in a nonlinear way (to the power 1/N). This precludes, in general, a closed form solution, even for an initial value approximation for iteration purposes. One alternative is to use as initial values for iteration the values obtained from the method of moments solution for the four parameter case. Furthermore, the expressions for the harmonic means are well-defined only for , which precludes a maximum likelihood solution for shape parameters less than unity in the four-parameter case. Fisher's information matrix for the four parameter case is positive-definite only for α, β > 2 (for further discussion, see section on Fisher information matrix, four parameter case), for bell-shaped (symmetric or unsymmetric) beta distributions, with inflection points located to either side of the mode. The following Fisher information components (that represent the expectations of the curvature of the log likelihood function) have singularities at the following values:
+
+
+
+
+
+
(for further discussion see section on Fisher information matrix). Thus, it is not possible to strictly carry on the maximum likelihood estimation for some well known distributions belonging to the four-parameter beta distribution family, like the uniform distribution (Beta(1, 1, a, c)), and the arcsine distribution (Beta(1/2, 1/2, a, c)). N.L.Johnson and S.Kotz[1] ignore the equations for the harmonic means and instead suggest "If a and c are unknown, and maximum likelihood estimators of a, c, α and β are required, the above procedure (for the two unknown parameter case, with X transformed as X = (Y − a)/(c − a)) can be repeated using a succession of trial values of a and c, until the pair (a, c) for which maximum likelihood (given a and c) is as great as possible, is attained" (where, for the purpose of clarity, their notation for the parameters has been translated into the present notation).
+
Let a random variable X have a probability density f(x;α). The partial derivative with respect to the (unknown, and to be estimated) parameter α of the log likelihood function is called the score. The second moment of the score is called the Fisher information:
+
+
+
The expectation of the score is zero, therefore the Fisher information is also the second moment centered on the mean of the score: the variance of the score.
+
If the log likelihood function is twice differentiable with respect to the parameter α, and under certain regularity conditions,[49] then the Fisher information may also be written as follows (which is often a more convenient form for calculation purposes):
+
+
+
Thus, the Fisher information is the negative of the expectation of the second derivative with respect to the parameter α of the log likelihood function. Therefore, Fisher information is a measure of the curvature of the log likelihood function of α. A low curvature (and therefore high radius of curvature), flatter log likelihood function curve has low Fisher information; while a log likelihood function curve with large curvature (and therefore low radius of curvature) has high Fisher information. When the Fisher information matrix is computed at the evaluates of the parameters ("the observed Fisher information matrix") it is equivalent to the replacement of the true log likelihood surface by a Taylor's series approximation, taken as far as the quadratic terms.[50] The word information, in the context of Fisher information, refers to information about the parameters. Information such as: estimation, sufficiency and properties of variances of estimators. The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any estimator of a parameter α:
+
+
+
The precision to which one can estimate the estimator of a parameter α is limited by the Fisher Information of the log likelihood function. The Fisher information is a measure of the minimum error involved in estimating a parameter of a distribution and it can be viewed as a measure of the resolving power of an experiment needed to discriminate between two alternative hypothesis of a parameter.[51]
+
Under certain regularity conditions,[49] the Fisher Information Matrix may also be written in the following form, which is often more convenient for computation:
+
+
+
With X1, ..., XNiid random variables, an N-dimensional "box" can be constructed with sides X1, ..., XN. Costa and Cover[52] show that the (Shannon) differential entropy h(X) is related to the volume of the typical set (having the sample entropy close to the true entropy), while the Fisher information is related to the surface of this typical set.
+
For X1, ..., XN independent random variables each having a beta distribution parametrized with shape parameters α and β, the joint log likelihood function for Niid observations is:
+
+
+
therefore the joint log likelihood function per Niid observations is:
+
+
+
For the two parameter case, the Fisher information has 4 components: 2 diagonal and 2 off-diagonal. Since the Fisher information matrix is symmetric, one of these off diagonal components is independent. Therefore, the Fisher information matrix has 3 independent components (2 diagonal and 1 off diagonal).
+
Aryal and Nadarajah[53] calculated Fisher's information matrix for the four-parameter case, from which the two parameter case can be obtained as follows:
+
+
+
+
+
Since the Fisher information matrix is symmetric
+
+
+
The Fisher information components are equal to the log geometric variances and log geometric covariance. Therefore, they can be expressed as trigamma functions, denoted ψ1(α), the second of the polygamma functions, defined as the derivative of the digamma function:
+
+
+
These derivatives are also derived in the § Two unknown parameters and plots of the log likelihood function are also shown in that section. § Geometric variance and covariance contains plots and further discussion of the Fisher information matrix components: the log geometric variances and log geometric covariance as a function of the shape parameters α and β. § Moments of logarithmically transformed random variables contains formulas for moments of logarithmically transformed random variables. Images for the Fisher information components and are shown in § Geometric variance.
+
The determinant of Fisher's information matrix is of interest (for example for the calculation of Jeffreys prior probability). From the expressions for the individual components of the Fisher information matrix, it follows that the determinant of Fisher's (symmetric) information matrix for the beta distribution is:
+
+
+
From Sylvester's criterion (checking whether the diagonal elements are all positive), it follows that the Fisher information matrix for the two parameter case is positive-definite (under the standard condition that the shape parameters are positive α > 0 and β > 0).
+
If Y1, ..., YN are independent random variables each having a beta distribution with four parameters: the exponents α and β, and also a (the minimum of the distribution range), and c (the maximum of the distribution range) (section titled "Alternative parametrizations", "Four parameters"), with probability density function:
+
+
+
the joint log likelihood function per Niid observations is:
+
+
+
For the four parameter case, the Fisher information has 4*4=16 components. It has 12 off-diagonal components = (4×4 total − 4 diagonal). Since the Fisher information matrix is symmetric, half of these components (12/2=6) are independent. Therefore, the Fisher information matrix has 6 independent off-diagonal + 4 diagonal = 10 independent components. Aryal and Nadarajah[53] calculated Fisher's information matrix for the four parameter case as follows:
+
+
+
+
+
In the above expressions, the use of X instead of Y in the expressions var[ln(X)] = ln(varGX) is not an error. The expressions in terms of the log geometric variances and log geometric covariance occur as functions of the two parameter X ~ Beta(α, β) parametrization because when taking the partial derivatives with respect to the exponents (α, β) in the four parameter case, one obtains the identical expressions as for the two parameter case: these terms of the four parameter Fisher information matrix are independent of the minimum a and maximum c of the distribution's range. The only non-zero term upon double differentiation of the log likelihood function with respect to the exponents α and β is the second derivative of the log of the beta function: ln(B(α, β)). This term is independent of the minimum a and maximum c of the distribution's range. Double differentiation of this term results in trigamma functions. The sections titled "Maximum likelihood", "Two unknown parameters" and "Four unknown parameters" also show this fact.
+
The Fisher information for Ni.i.d. samples is N times the individual Fisher information (eq. 11.279, page 394 of Cover and Thomas[27]). (Aryal and Nadarajah[53] take a single observation, N = 1, to calculate the following components of the Fisher information, which leads to the same result as considering the derivatives of the log likelihood per N observations. Moreover, below the erroneous expression for in Aryal and Nadarajah has been corrected.)
+
+
+
The lower two diagonal entries of the Fisher information matrix, with respect to the parameter a (the minimum of the distribution's range): , and with respect to the parameter c (the maximum of the distribution's range): are only defined for exponents α > 2 and β > 2 respectively. The Fisher information matrix component for the minimum a approaches infinity for exponent α approaching 2 from above, and the Fisher information matrix component for the maximum c approaches infinity for exponent β approaching 2 from above.
+
The Fisher information matrix for the four parameter case does not depend on the individual values of the minimum a and the maximum c, but only on the total range (c − a). Moreover, the components of the Fisher information matrix that depend on the range (c − a), depend only through its inverse (or the square of the inverse), such that the Fisher information decreases for increasing range (c − a).
+
The accompanying images show the Fisher information components and . Images for the Fisher information components and are shown in § Geometric variance. All these Fisher information components look like a basin, with the "walls" of the basin being located at low values of the parameters.
+
The following four-parameter-beta-distribution Fisher information components can be expressed in terms of the two-parameter: X ~ Beta(α, β) expectations of the transformed ratio ((1 − X)/X) and of its mirror image (X/(1 − X)), scaled by the range (c − a), which may be helpful for interpretation:
+
+
+
+
These are also the expected values of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) [1] and its mirror image, scaled by the range (c − a).
+
Also, the following Fisher information components can be expressed in terms of the harmonic (1/X) variances or of variances based on the ratio transformed variables ((1-X)/X) as follows:
+
+
+
See section "Moments of linearly transformed, product and inverted random variables" for these expectations.
+
The determinant of Fisher's information matrix is of interest (for example for the calculation of Jeffreys prior probability). From the expressions for the individual components, it follows that the determinant of Fisher's (symmetric) information matrix for the beta distribution with four parameters is:
+
+
+
Using Sylvester's criterion (checking whether the diagonal elements are all positive), and since diagonal components and have singularities at α=2 and β=2 it follows that the Fisher information matrix for the four parameter case is positive-definite for α>2 and β>2. Since for α > 2 and β > 2 the beta distribution is (symmetric or unsymmetric) bell shaped, it follows that the Fisher information matrix is positive-definite only for bell-shaped (symmetric or unsymmetric) beta distributions, with inflection points located to either side of the mode. Thus, important well known distributions belonging to the four-parameter beta distribution family, like the parabolic distribution (Beta(2,2,a,c)) and the uniform distribution (Beta(1,1,a,c)) have Fisher information components () that blow up (approach infinity) in the four-parameter case (although their Fisher information components are all defined for the two parameter case). The four-parameter Wigner semicircle distribution (Beta(3/2,3/2,a,c)) and arcsine distribution (Beta(1/2,1/2,a,c)) have negative Fisher information determinants for the four-parameter case.
+
Examples of beta distributions used as prior probabilities to represent ignorance of prior parameter values in Bayesian inference are Beta(1,1), Beta(0,0) and Beta(1/2,1/2).
+
A classic application of the beta distribution is the rule of succession, introduced in the 18th century by Pierre-Simon Laplace[54] in the course of treating the sunrise problem. It states that, given s successes in nconditionally independentBernoulli trials with probability p, that the estimate of the expected value in the next trial is . This estimate is the expected value of the posterior distribution over p, namely Beta(s+1, n−s+1), which is given by Bayes' rule if one assumes a uniform prior probability over p (i.e., Beta(1, 1)) and then observes that p generated s successes in n trials. Laplace's rule of succession has been criticized by prominent scientists. R. T. Cox described Laplace's application of the rule of succession to the sunrise problem ([55] p. 89) as "a travesty of the proper use of the principle". Keynes remarks ([56] Ch.XXX, p. 382) "indeed this is so foolish a theorem that to entertain it is discreditable". Karl Pearson[57] showed that the probability that the next (n + 1) trials will be successes, after n successes in n trials, is only 50%, which has been considered too low by scientists like Jeffreys and unacceptable as a representation of the scientific process of experimentation to test a proposed scientific law. As pointed out by Jeffreys ([58] p. 128) (crediting C. D. Broad[59] ) Laplace's rule of succession establishes a high probability of success ((n+1)/(n+2)) in the next trial, but only a moderate probability (50%) that a further sample (n+1) comparable in size will be equally successful. As pointed out by Perks,[60] "The rule of succession itself is hard to accept. It assigns a probability to the next trial which implies the assumption that the actual run observed is an average run and that we are always at the end of an average run. It would, one would think, be more reasonable to assume that we were in the middle of an average run. Clearly a higher value for both probabilities is necessary if they are to accord with reasonable belief." These problems with Laplace's rule of succession motivated Haldane, Perks, Jeffreys and others to search for other forms of prior probability (see the next § Bayesian inference). According to Jaynes,[51] the main problem with the rule of succession is that it is not valid when s=0 or s=n (see rule of succession, for an analysis of its validity).
+
The beta distribution achieves maximum differential entropy for Beta(1,1): the uniform probability density, for which all values in the domain of the distribution have equal density. This uniform distribution Beta(1,1) was suggested ("with a great deal of doubt") by Thomas Bayes[61] as the prior probability distribution to express ignorance about the correct prior distribution. This prior distribution was adopted (apparently, from his writings, with little sign of doubt[54]) by Pierre-Simon Laplace, and hence it was also known as the "Bayes-Laplace rule" or the "Laplace rule" of "inverse probability" in publications of the first half of the 20th century. In the later part of the 19th century and early part of the 20th century, scientists realized that the assumption of uniform "equal" probability density depended on the actual functions (for example whether a linear or a logarithmic scale was most appropriate) and parametrizations used. In particular, the behavior near the ends of distributions with finite support (for example near x = 0, for a distribution with initial support at x = 0) required particular attention. Keynes ([56] Ch.XXX, p. 381) criticized the use of Bayes's uniform prior probability (Beta(1,1)) that all values between zero and one are equiprobable, as follows: "Thus experience, if it shows anything, shows that there is a very marked clustering of statistical ratios in the neighborhoods of zero and unity, of those for positive theories and for correlations between positive qualities in the neighborhood of zero, and of those for negative theories and for correlations between negative qualities in the neighborhood of unity. "
+
The Beta(0,0) distribution was proposed by J.B.S. Haldane,[62] who suggested that the prior probability representing complete uncertainty should be proportional to p−1(1−p)−1. The function p−1(1−p)−1 can be viewed as the limit of the numerator of the beta distribution as both shape parameters approach zero: α, β → 0. The Beta function (in the denominator of the beta distribution) approaches infinity, for both parameters approaching zero, α, β → 0. Therefore, p−1(1−p)−1 divided by the Beta function approaches a 2-point Bernoulli distribution with equal probability 1/2 at each end, at 0 and 1, and nothing in between, as α, β → 0. A coin-toss: one face of the coin being at 0 and the other face being at 1. The Haldane prior probability distribution Beta(0,0) is an "improper prior" because its integration (from 0 to 1) fails to strictly converge to 1 due to the singularities at each end. However, this is not an issue for computing posterior probabilities unless the sample size is very small. Furthermore, Zellner[63] points out that on the log-odds scale, (the logit transformation ln(p/1−p)), the Haldane prior is the uniformly flat prior. The fact that a uniform prior probability on the logit transformed variable ln(p/1−p) (with domain (-∞, ∞)) is equivalent to the Haldane prior on the domain [0, 1] was pointed out by Harold Jeffreys in the first edition (1939) of his book Theory of Probability ([58] p. 123). Jeffreys writes "Certainly if we take the Bayes-Laplace rule right up to the extremes we are led to results that do not correspond to anybody's way of thinking. The (Haldane) rule dx/(x(1−x)) goes too far the other way. It would lead to the conclusion that if a sample is of one type with respect to some property there is a probability 1 that the whole population is of that type." The fact that "uniform" depends on the parametrization, led Jeffreys to seek a form of prior that would be invariant under different parametrizations.
+
+
Jeffreys' prior probability (Beta(1/2,1/2) for a Bernoulli or for a binomial distribution)[edit]
Harold Jeffreys[58][64] proposed to use an uninformative prior probability measure that should be invariant under reparameterization: proportional to the square root of the determinant of Fisher's information matrix. For the Bernoulli distribution, this can be shown as follows: for a coin that is "heads" with probability p ∈ [0, 1] and is "tails" with probability 1 − p, for a given (H,T) ∈ {(0,1), (1,0)} the probability is pH(1 − p)T. Since T = 1 − H, the Bernoulli distribution is pH(1 − p)1 − H. Considering p as the only parameter, it follows that the log likelihood for the Bernoulli distribution is
+
+
+
The Fisher information matrix has only one component (it is a scalar, because there is only one parameter: p), therefore:
+
It will be shown in the next section that the normalizing constant for Jeffreys prior is immaterial to the final result because the normalizing constant cancels out in Bayes theorem for the posterior probability. Hence Beta(1/2,1/2) is used as the Jeffreys prior for both Bernoulli and binomial distributions. As shown in the next section, when using this expression as a prior probability times the likelihood in Bayes theorem, the posterior probability turns out to be a beta distribution. It is important to realize, however, that Jeffreys prior is proportional to for the Bernoulli and binomial distribution, but not for the beta distribution. Jeffreys prior for the beta distribution is given by the determinant of Fisher's information for the beta distribution, which, as shown in the § Fisher information matrix is a function of the trigamma function ψ1 of shape parameters α and β as follows:
+
+
+
As previously discussed, Jeffreys prior for the Bernoulli and binomial distributions is proportional to the arcsine distribution Beta(1/2,1/2), a one-dimensional curve that looks like a basin as a function of the parameter p of the Bernoulli and binomial distributions. The walls of the basin are formed by p approaching the singularities at the ends p → 0 and p → 1, where Beta(1/2,1/2) approaches infinity. Jeffreys prior for the beta distribution is a 2-dimensional surface (embedded in a three-dimensional space) that looks like a basin with only two of its walls meeting at the corner α = β = 0 (and missing the other two walls) as a function of the shape parameters α and β of the beta distribution. The two adjoining walls of this 2-dimensional surface are formed by the shape parameters α and β approaching the singularities (of the trigamma function) at α, β → 0. It has no walls for α, β → ∞ because in this case the determinant of Fisher's information matrix for the beta distribution approaches zero.
+
It will be shown in the next section that Jeffreys prior probability results in posterior probabilities (when multiplied by the binomial likelihood function) that are intermediate between the posterior probability results of the Haldane and Bayes prior probabilities.
+
Jeffreys prior may be difficult to obtain analytically, and for some cases it just doesn't exist (even for simple distribution functions like the asymmetric triangular distribution). Berger, Bernardo and Sun, in a 2009 paper[65] defined a reference prior probability distribution that (unlike Jeffreys prior) exists for the asymmetric triangular distribution. They cannot obtain a closed-form expression for their reference prior, but numerical calculations show it to be nearly perfectly fitted by the (proper) prior
+
+
+
where θ is the vertex variable for the asymmetric triangular distribution with support [0, 1] (corresponding to the following parameter values in Wikipedia's article on the triangular distribution: vertex c = θ, left end a = 0,and right end b = 1). Berger et al. also give a heuristic argument that Beta(1/2,1/2) could indeed be the exact Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution. Therefore, Beta(1/2,1/2) not only is Jeffreys prior for the Bernoulli and binomial distributions, but also seems to be the Berger–Bernardo–Sun reference prior for the asymmetric triangular distribution (for which the Jeffreys prior does not exist), a distribution used in project management and PERT analysis to describe the cost and duration of project tasks.
+
Clarke and Barron[66] prove that, among continuous positive priors, Jeffreys prior (when it exists) asymptotically maximizes Shannon's mutual information between a sample of size n and the parameter, and therefore Jeffreys prior is the most uninformative prior (measuring information as Shannon information). The proof rests on an examination of the Kullback–Leibler divergence between probability density functions for iid random variables.
+
+
Effect of different prior probability choices on the posterior beta distribution[edit]
+
If samples are drawn from the population of a random variable X that result in s successes and f failures in "n" Bernoulli trialsn = s + f, then the likelihood function for parameters s and f given x = p (the notation x = p in the expressions below will emphasize that the domain x stands for the value of the parameter p in the binomial distribution), is the following binomial distribution:
+
+
+
If beliefs about prior probability information are reasonably well approximated by a beta distribution with parameters α Prior and β Prior, then:
+
+
+
According to Bayes' theorem for a continuous event space, the posterior probability is given by the product of the prior probability and the likelihood function (given the evidence s and f = n − s), normalized so that the area under the curve equals one, as follows:
+
appears both in the numerator and the denominator of the posterior probability, and it does not depend on the integration variable x, hence it cancels out, and it is irrelevant to the final result. Similarly the normalizing factor for the prior probability, the beta function B(αPrior,βPrior) cancels out and it is immaterial to the final result. The same posterior probability result can be obtained if one uses an un-normalized prior
+
+
+
because the normalizing factors all cancel out. Several authors (including Jeffreys himself) thus use an un-normalized prior formula since the normalization constant cancels out. The numerator of the posterior probability ends up being just the (un-normalized) product of the prior probability and the likelihood function, and the denominator is its integral from zero to one. The beta function in the denominator, B(s + α Prior, n − s + β Prior), appears as a normalization constant to ensure that the total posterior probability integrates to unity.
+
The ratio s/n of the number of successes to the total number of trials is a sufficient statistic in the binomial case, which is relevant for the following results.
+
For the Bayes' prior probability (Beta(1,1)), the posterior probability is:
+
+
+
For the Jeffreys' prior probability (Beta(1/2,1/2)), the posterior probability is:
+
+
+
and for the Haldane prior probability (Beta(0,0)), the posterior probability is:
+
+
+
From the above expressions it follows that for s/n = 1/2) all the above three prior probabilities result in the identical location for the posterior probability mean = mode = 1/2. For s/n < 1/2, the mean of the posterior probabilities, using the following priors, are such that: mean for Bayes prior > mean for Jeffreys prior > mean for Haldane prior. For s/n > 1/2 the order of these inequalities is reversed such that the Haldane prior probability results in the largest posterior mean. The Haldane prior probability Beta(0,0) results in a posterior probability density with mean (the expected value for the probability of success in the "next" trial) identical to the ratio s/n of the number of successes to the total number of trials. Therefore, the Haldane prior results in a posterior probability with expected value in the next trial equal to the maximum likelihood. The Bayes prior probability Beta(1,1) results in a posterior probability density with mode identical to the ratio s/n (the maximum likelihood).
+
In the case that 100% of the trials have been successful s = n, the Bayes prior probability Beta(1,1) results in a posterior expected value equal to the rule of succession (n + 1)/(n + 2), while the Haldane prior Beta(0,0) results in a posterior expected value of 1 (absolute certainty of success in the next trial). Jeffreys prior probability results in a posterior expected value equal to (n + 1/2)/(n + 1). Perks[60] (p. 303) points out: "This provides a new rule of succession and expresses a 'reasonable' position to take up, namely, that after an unbroken run of n successes we assume a probability for the next trial equivalent to the assumption that we are about half-way through an average run, i.e. that we expect a failure once in (2n + 2) trials. The Bayes–Laplace rule implies that we are about at the end of an average run or that we expect a failure once in (n + 2) trials. The comparison clearly favours the new result (what is now called Jeffreys prior) from the point of view of 'reasonableness'."
+
Conversely, in the case that 100% of the trials have resulted in failure (s = 0), the Bayes prior probability Beta(1,1) results in a posterior expected value for success in the next trial equal to 1/(n + 2), while the Haldane prior Beta(0,0) results in a posterior expected value of success in the next trial of 0 (absolute certainty of failure in the next trial). Jeffreys prior probability results in a posterior expected value for success in the next trial equal to (1/2)/(n + 1), which Perks[60] (p. 303) points out: "is a much more reasonably remote result than the Bayes-Laplace result 1/(n + 2)".
+
Jaynes[51] questions (for the uniform prior Beta(1,1)) the use of these formulas for the cases s = 0 or s = n because the integrals do not converge (Beta(1,1) is an improper prior for s = 0 or s = n). In practice, the conditions 0<s<n necessary for a mode to exist between both ends for the Bayes prior are usually met, and therefore the Bayes prior (as long as 0 < s < n) results in a posterior mode located between both ends of the domain.
+
As remarked in the section on the rule of succession, K. Pearson showed that after n successes in n trials the posterior probability (based on the Bayes Beta(1,1) distribution as the prior probability) that the next (n + 1) trials will all be successes is exactly 1/2, whatever the value of n. Based on the Haldane Beta(0,0) distribution as the prior probability, this posterior probability is 1 (absolute certainty that after n successes in n trials the next (n + 1) trials will all be successes). Perks[60] (p. 303) shows that, for what is now known as the Jeffreys prior, this probability is ((n + 1/2)/(n + 1))((n + 3/2)/(n + 2))...(2n + 1/2)/(2n + 1), which for n = 1, 2, 3 gives 15/24, 315/480, 9009/13440; rapidly approaching a limiting value of as n tends to infinity. Perks remarks that what is now known as the Jeffreys prior: "is clearly more 'reasonable' than either the Bayes-Laplace result or the result on the (Haldane) alternative rule rejected by Jeffreys which gives certainty as the probability. It clearly provides a very much better correspondence with the process of induction. Whether it is 'absolutely' reasonable for the purpose, i.e. whether it is yet large enough, without the absurdity of reaching unity, is a matter for others to decide. But it must be realized that the result depends on the assumption of complete indifference and absence of knowledge prior to the sampling experiment."
+
Following are the variances of the posterior distribution obtained with these three prior probability distributions:
+
for the Bayes' prior probability (Beta(1,1)), the posterior variance is:
+
+
+
for the Jeffreys' prior probability (Beta(1/2,1/2)), the posterior variance is:
+
+
+
and for the Haldane prior probability (Beta(0,0)), the posterior variance is:
+
+
+
So, as remarked by Silvey,[49] for large n, the variance is small and hence the posterior distribution is highly concentrated, whereas the assumed prior distribution was very diffuse. This is in accord with what one would hope for, as vague prior knowledge is transformed (through Bayes theorem) into a more precise posterior knowledge by an informative experiment. For small n the Haldane Beta(0,0) prior results in the largest posterior variance while the Bayes Beta(1,1) prior results in the more concentrated posterior. Jeffreys prior Beta(1/2,1/2) results in a posterior variance in between the other two. As n increases, the variance rapidly decreases so that the posterior variance for all three priors converges to approximately the same value (approaching zero variance as n → ∞). Recalling the previous result that the Haldane prior probability Beta(0,0) results in a posterior probability density with mean (the expected value for the probability of success in the "next" trial) identical to the ratio s/n of the number of successes to the total number of trials, it follows from the above expression that also the Haldane prior Beta(0,0) results in a posterior with variance identical to the variance expressed in terms of the max. likelihood estimate s/n and sample size (in § Variance):
+
+
+
with the mean μ = s/n and the sample size ν = n.
+
In Bayesian inference, using a prior distribution Beta(αPrior,βPrior) prior to a binomial distribution is equivalent to adding (αPrior − 1) pseudo-observations of "success" and (βPrior − 1) pseudo-observations of "failure" to the actual number of successes and failures observed, then estimating the parameter p of the binomial distribution by the proportion of successes over both real- and pseudo-observations. A uniform prior Beta(1,1) does not add (or subtract) any pseudo-observations since for Beta(1,1) it follows that (αPrior − 1) = 0 and (βPrior − 1) = 0. The Haldane prior Beta(0,0) subtracts one pseudo observation from each and Jeffreys prior Beta(1/2,1/2) subtracts 1/2 pseudo-observation of success and an equal number of failure. This subtraction has the effect of smoothing out the posterior distribution. If the proportion of successes is not 50% (s/n ≠ 1/2) values of αPrior and βPrior less than 1 (and therefore negative (αPrior − 1) and (βPrior − 1)) favor sparsity, i.e. distributions where the parameter p is closer to either 0 or 1. In effect, values of αPrior and βPrior between 0 and 1, when operating together, function as a concentration parameter.
+
The accompanying plots show the posterior probability density functions for sample sizes n ∈ {3,10,50}, successes s ∈ {n/2,n/4} and Beta(αPrior,βPrior) ∈ {Beta(0,0),Beta(1/2,1/2),Beta(1,1)}. Also shown are the cases for n = {4,12,40}, success s = {n/4} and Beta(αPrior,βPrior) ∈ {Beta(0,0),Beta(1/2,1/2),Beta(1,1)}. The first plot shows the symmetric cases, for successes s ∈ {n/2}, with mean = mode = 1/2 and the second plot shows the skewed cases s ∈ {n/4}. The images show that there is little difference between the priors for the posterior with sample size of 50 (characterized by a more pronounced peak near p = 1/2). Significant differences appear for very small sample sizes (in particular for the flatter distribution for the degenerate case of sample size = 3). Therefore, the skewed cases, with successes s = {n/4}, show a larger effect from the choice of prior, at small sample size, than the symmetric cases. For symmetric distributions, the Bayes prior Beta(1,1) results in the most "peaky" and highest posterior distributions and the Haldane prior Beta(0,0) results in the flattest and lowest peak distribution. The Jeffreys prior Beta(1/2,1/2) lies in between them. For nearly symmetric, not too skewed distributions the effect of the priors is similar. For very small sample size (in this case for a sample size of 3) and skewed distribution (in this example for s ∈ {n/4}) the Haldane prior can result in a reverse-J-shaped distribution with a singularity at the left end. However, this happens only in degenerate cases (in this example n = 3 and hence s = 3/4 < 1, a degenerate value because s should be greater than unity in order for the posterior of the Haldane prior to have a mode located between the ends, and because s = 3/4 is not an integer number, hence it violates the initial assumption of a binomial distribution for the likelihood) and it is not an issue in generic cases of reasonable sample size (such that the condition 1 < s < n − 1, necessary for a mode to exist between both ends, is fulfilled).
+
In Chapter 12 (p. 385) of his book, Jaynes[51] asserts that the Haldane prior Beta(0,0) describes a prior state of knowledge of complete ignorance, where we are not even sure whether it is physically possible for an experiment to yield either a success or a failure, while the Bayes (uniform) prior Beta(1,1) applies if one knows that both binary outcomes are possible. Jaynes states: "interpret the Bayes-Laplace (Beta(1,1)) prior as describing not a state of complete ignorance, but the state of knowledge in which we have observed one success and one failure...once we have seen at least one success and one failure, then we know that the experiment is a true binary one, in the sense of physical possibility." Jaynes [51] does not specifically discuss Jeffreys prior Beta(1/2,1/2) (Jaynes discussion of "Jeffreys prior" on pp. 181, 423 and on chapter 12 of Jaynes book[51] refers instead to the improper, un-normalized, prior "1/pdp" introduced by Jeffreys in the 1939 edition of his book,[58] seven years before he introduced what is now known as Jeffreys' invariant prior: the square root of the determinant of Fisher's information matrix. "1/p" is Jeffreys' (1946) invariant prior for the exponential distribution, not for the Bernoulli or binomial distributions). However, it follows from the above discussion that Jeffreys Beta(1/2,1/2) prior represents a state of knowledge in between the Haldane Beta(0,0) and Bayes Beta (1,1) prior.
+
Similarly, Karl Pearson in his 1892 book The Grammar of Science[67][68] (p. 144 of 1900 edition) maintained that the Bayes (Beta(1,1) uniform prior was not a complete ignorance prior, and that it should be used when prior information justified to "distribute our ignorance equally"". K. Pearson wrote: "Yet the only supposition that we appear to have made is this: that, knowing nothing of nature, routine and anomy (from the Greek ανομία, namely: a- "without", and nomos "law") are to be considered as equally likely to occur. Now we were not really justified in making even this assumption, for it involves a knowledge that we do not possess regarding nature. We use our experience of the constitution and action of coins in general to assert that heads and tails are equally probable, but we have no right to assert before experience that, as we know nothing of nature, routine and breach are equally probable. In our ignorance we ought to consider before experience that nature may consist of all routines, all anomies (normlessness), or a mixture of the two in any proportion whatever, and that all such are equally probable. Which of these constitutions after experience is the most probable must clearly depend on what that experience has been like."
+
If there is sufficient sampling data, and the posterior probability mode is not located at one of the extremes of the domain (x=0 or x=1), the three priors of Bayes (Beta(1,1)), Jeffreys (Beta(1/2,1/2)) and Haldane (Beta(0,0)) should yield similar posterior probability densities. Otherwise, as Gelman et al.[69] (p. 65) point out, "if so few data are available that the choice of noninformative prior distribution makes a difference, one should put relevant information into the prior distribution", or as Berger[4] (p. 125) points out "when different reasonable priors yield substantially different answers, can it be right to state that there is a single answer? Would it not be better to admit that there is scientific uncertainty, with the conclusion depending on prior beliefs?."
+
The beta distribution has an important application in the theory of order statistics. A basic result is that the distribution of the kth smallest of a sample of size n from a continuous uniform distribution has a beta distribution.[38] This result is summarized as:
+
In standard logic, propositions are considered to be either true or false. In contradistinction, subjective logic assumes that humans cannot determine with absolute certainty whether a proposition about the real world is absolutely true or false. In subjective logic the posteriori probability estimates of binary events can be represented by beta distributions.[70]
+
A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" that promptly decays. Wavelets can be used to extract information from many different kinds of data, including – but certainly not limited to – audio signals and images. Thus, wavelets are purposefully crafted to have specific properties that make them useful for signal processing. Wavelets are localized in both time and frequency whereas the standard Fourier transform is only localized in frequency. Therefore, standard Fourier Transforms are only applicable to stationary processes, while wavelets are applicable to non-stationary processes. Continuous wavelets can be constructed based on the beta distribution. Beta wavelets[71] can be viewed as a soft variety of Haar wavelets whose shape is fine-tuned by two shape parameters α and β.
+
where and ; here F is (Wright's) genetic distance between two populations.
+
+
Project management: task cost and schedule modeling[edit]
+
The beta distribution can be used to model events which are constrained to take place within an interval defined by a minimum and maximum value. For this reason, the beta distribution — along with the triangular distribution — is used extensively in PERT, critical path method (CPM), Joint Cost Schedule Modeling (JCSM) and other project management/control systems to describe the time to completion and the cost of a task. In project management, shorthand computations are widely used to estimate the mean and standard deviation of the beta distribution:[37]
+
+
+
where a is the minimum, c is the maximum, and b is the most likely value (the mode for α > 1 and β > 1).
+
The above estimate for the mean is known as the PERTthree-point estimation and it is exact for either of the following values of β (for arbitrary α within these ranges):
+
Otherwise, these can be poor approximations for beta distributions with other values of α and β, exhibiting average errors of 40% in the mean and 549% in the variance.[73][74][75]
+
So one algorithm for generating beta variates is to generate , where X is a gamma variate with parameters (α, 1) and Y is an independent gamma variate with parameters (β, 1).[76] In fact, here and are independent, and . If and is independent of and , then and is independent of . This shows that the product of independent and random variables is a random variable.
+
Also, the kth order statistic of nuniformly distributed variates is , so an alternative if α and β are small integers is to generate α + β − 1 uniform variates and choose the α-th smallest.[38]
+
Another way to generate the Beta distribution is by Pólya urn model. According to this method, one start with an "urn" with α "black" balls and β "white" balls and draw uniformly with replacement. Every trial an additional ball is added according to the color of the last ball which was drawn. Asymptotically, the proportion of black and white balls will be distributed according to the Beta distribution, where each repetition of the experiment will produce a different value.
+
Normal approximation to the Beta distribution[edit]
+
A beta distribution with α ~ β and α and β >> 1 is approximately normal with mean 1/2 and variance 1/(4(2α + 1)). If α ≥ β the Normal approximation can be improved by taking the cube-root of the logarithm of the reciprocal of [77]
+
Thomas Bayes, in a posthumous paper [61] published in 1763 by Richard Price, obtained a beta distribution as the density of the probability of success in Bernoulli trials (see § Applications, Bayesian inference), but the paper does not analyze any of the moments of the beta distribution or discuss any of its properties.
+
+
+
The first systematic modern discussion of the beta distribution is probably due to Karl Pearson.[78][79] In Pearson's papers[20][32] the beta distribution is couched as a solution of a differential equation: Pearson's Type I distribution which it is essentially identical to except for arbitrary shifting and re-scaling (the beta and Pearson Type I distributions can always be equalized by proper choice of parameters). In fact, in several English books and journal articles in the few decades prior to World War II, it was common to refer to the beta distribution as Pearson's Type I distribution. William P. Elderton in his 1906 monograph "Frequency curves and correlation"[41] further analyzes the beta distribution as Pearson's Type I distribution, including a full discussion of the method of moments for the four parameter case, and diagrams of (what Elderton describes as) U-shaped, J-shaped, twisted J-shaped, "cocked-hat" shapes, horizontal and angled straight-line cases. Elderton wrote "I am chiefly indebted to Professor Pearson, but the indebtedness is of a kind for which it is impossible to offer formal thanks." Elderton in his 1906 monograph [41] provides an impressive amount of information on the beta distribution, including equations for the origin of the distribution chosen to be the mode, as well as for other Pearson distributions: types I through VII. Elderton also included a number of appendixes, including one appendix ("II") on the beta and gamma functions. In later editions, Elderton added equations for the origin of the distribution chosen to be the mean, and analysis of Pearson distributions VIII through XII.
+
As remarked by Bowman and Shenton[43] "Fisher and Pearson had a difference of opinion in the approach to (parameter) estimation, in particular relating to (Pearson's method of) moments and (Fisher's method of) maximum likelihood in the case of the Beta distribution." Also according to Bowman and Shenton, "the case of a Type I (beta distribution) model being the center of the controversy was pure serendipity. A more difficult model of 4 parameters would have been hard to find." The long running public conflict of Fisher with Karl Pearson can be followed in a number of articles in prestigious journals. For example, concerning the estimation of the four parameters for the beta distribution, and Fisher's criticism of Pearson's method of moments as being arbitrary, see Pearson's article "Method of moments and method of maximum likelihood" [44] (published three years after his retirement from University College, London, where his position had been divided between Fisher and Pearson's son Egon) in which Pearson writes "I read (Koshai's paper in the Journal of the Royal Statistical Society, 1933) which as far as I am aware is the only case at present published of the application of Professor Fisher's method. To my astonishment that method depends on first working out the constants of the frequency curve by the (Pearson) Method of Moments and then superposing on it, by what Fisher terms "the Method of Maximum Likelihood" a further approximation to obtain, what he holds, he will thus get, "more efficient values" of the curve constants".
+
David and Edwards's treatise on the history of statistics[80] cites the first modern treatment of the beta distribution, in 1911,[81] using the beta designation that has become standard, due to Corrado Gini, an Italian statistician, demographer, and sociologist, who developed the Gini coefficient. N.L.Johnson and S.Kotz, in their comprehensive and very informative monograph[82] on leading historical personalities in statistical sciences credit Corrado Gini[83] as "an early Bayesian...who dealt with the problem of eliciting the parameters of an initial Beta distribution, by singling out techniques which anticipated the advent of the so-called empirical Bayes approach."
+
^ abFeller, William (1968). An Introduction to Probability Theory and Its Applications. Vol. 1 (3rd ed.). ISBN978-0471257080.
+
+
^Philip J. Fleming and John J. Wallace. How not to lie with statistics: the correct way to summarize benchmark results. Communications of the ACM, 29(3):218–221, March 1986.
+
^Oguamanam, D.C.D.; Martin, H. R.; Huissoon, J. P. (1995). "On the application of the beta distribution to gear damage analysis". Applied Acoustics. 45 (3): 247–261. doi:10.1016/0003-682X(95)00001-P.
+
^Kenney, J. F., and E. S. Keeping (1951). Mathematics of Statistics Part Two, 2nd edition. D. Van Nostrand Company Inc.{{cite book}}: CS1 maint: multiple names: authors list (link)
+
^Verdugo Lazo, A. C. G.; Rathie, P. N. (1978). "On the entropy of continuous probability distributions". IEEE Trans. Inf. Theory. 24 (1): 120–122. doi:10.1109/TIT.1978.1055832.
+
+
^Shannon, Claude E. (1948). "A Mathematical Theory of Communication". Bell System Technical Journal. 27 (4): 623–656. doi:10.1002/j.1538-7305.1948.tb01338.x.
+
+
^ abcCover, Thomas M. and Joy A. Thomas (2006). Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience; 2 edition. ISBN978-0471241959.
+
^Herrerías-Velasco, José Manuel and Herrerías-Pleguezuelo, Rafael and René van Dorp, Johan. (2011). Revisiting the PERT mean and Variance. European Journal of Operational Research (210), p. 448–451.
+
+
^ abMalcolm, D. G.; Roseboom, J. H.; Clark, C. E.; Fazar, W. (September–October 1958). "Application of a Technique for Research and Development Program Evaluation". Operations Research. 7 (5): 646–669. doi:10.1287/opre.7.5.646. ISSN0030-364X.
+
+
^ abcdDavid, H. A., Nagaraja, H. N. (2003) Order Statistics (3rd Edition). Wiley, New Jersey pp 458. ISBN0-471-38926-9
+
^ abPearson, Karl (June 1936). "Method of moments and method of maximum likelihood". Biometrika. 28 (1/2): 34–59. doi:10.2307/2334123. JSTOR2334123.
+
+
^ abcJoanes, D. N.; C. A. Gill (1998). "Comparing measures of sample skewness and kurtosis". The Statistician. 47 (Part 1): 183–189. doi:10.1111/1467-9884.00122.
+
+
^Beckman, R. J.; G. L. Tietjen (1978). "Maximum likelihood estimation for the beta distribution". Journal of Statistical Computation and Simulation. 7 (3–4): 253–258. doi:10.1080/00949657808810232.
+
+
^Gnanadesikan, R.,Pinkham and Hughes (1967). "Maximum likelihood estimation of the parameters of the beta distribution from smallest order statistics". Technometrics. 9 (4): 607–620. doi:10.2307/1266199. JSTOR1266199.{{cite journal}}: CS1 maint: multiple names: authors list (link)
+
^Cox, Richard T. (1961). Algebra of Probable Inference. The Johns Hopkins University Press. ISBN978-0801869822.
+
+
^ abKeynes, John Maynard (2010) [1921]. A Treatise on Probability: The Connection Between Philosophy and the History of Science. Wildside Press. ISBN978-1434406965.
+
+
^Pearson, Karl (1907). "On the Influence of Past Experience on Future Expectation". Philosophical Magazine. 6 (13): 365–378.
+
+
^ abcdJeffreys, Harold (1998). Theory of Probability. Oxford University Press, 3rd edition. ISBN978-0198503682.
+
+
^Broad, C. D. (October 1918). "On the relation between induction and probability". MIND, A Quarterly Review of Psychology and Philosophy. 27 (New Series) (108): 389–404. doi:10.1093/mind/XXVII.4.389. JSTOR2249035.
+
^Pearson, Karl (2009). The Grammar of Science. BiblioLife. ISBN978-1110356119.
+
+
^Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian Data Analysis. Chapman and Hall/CRC. ISBN978-1584883883.{{cite book}}: CS1 maint: multiple names: authors list (link)
+
^H.M. de Oliveira and G.A.A. Araújo,. Compactly Supported One-cyclic Wavelets Derived from Beta Distributions. Journal of Communication and Information Systems. vol.20, n.3, pp.27-33, 2005.
+
+
^Balding, David J.; Nichols, Richard A. (1995). "A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity". Genetica. Springer. 96 (1–2): 3–12. doi:10.1007/BF01441146. PMID7607457. S2CID30680826.
+
+
^Keefer, Donald L. and Verdini, William A. (1993). Better Estimation of PERT Activity Time Parameters. Management Science 39(9), p. 1086–1091.
+
+
^Keefer, Donald L. and Bodily, Samuel E. (1983). Three-point Approximations for Continuous Random variables. Management Science 29(5), p. 595–609.
+
^David, H. A. and A.W.F. Edwards (2001). Annotated Readings in the History of Statistics. Springer; 1 edition. ISBN978-0387988443.
+
+
^Gini, Corrado (1911). "Considerazioni Sulle Probabilità Posteriori e Applicazioni al Rapporto dei Sessi Nelle Nascite Umane". Studi Economico-Giuridici della Università de Cagliari. Anno III (reproduced in Metron 15, 133, 171, 1949): 5–41.
+
+
^Johnson, Norman L. and Samuel Kotz, ed. (1997). Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present (Wiley Series in Probability and Statistics. Wiley. ISBN978-0471163817.
+
The Box–Muller transform is commonly expressed in two forms. The basic form as given by Box and Muller takes two samples from the uniform distribution on the interval [0, 1] and maps them to two standard, normally distributed samples. The polar form takes two samples from a different interval, [−1, +1], and maps them to two normally distributed samples without the use of sine or cosine functions.
+
The Box–Muller transform was developed as a more computationally efficient alternative to the inverse transform sampling method.[3] The ziggurat algorithm gives a more efficient method for scalar processors (e.g. old CPUs), while the Box–Muller transform is superior for processors with vector units (e.g. GPUs or modern CPUs).[4]
+
The derivation[5] is based on a property of a two-dimensional Cartesian system, where X and Y coordinates are described by two independent and normally distributed random variables, the random variables for R2 and Θ (shown above) in the corresponding polar coordinates are also independent and can be expressed as
+
+
+
and
+
+
+
Because R2 is the square of the norm of the standard bivariate normal variable (X, Y), it has the chi-squared distribution with two degrees of freedom. In the special case of two degrees of freedom, the chi-squared distribution coincides with the exponential distribution, and the equation for R2 above is a simple way of generating the required exponential variate.
+
The polar form was first proposed by J. Bell[6] and then modified by R. Knop.[7] While several different versions of the polar method have been described, the version of R. Knop will be described here because it is the most widely used, in part due to its inclusion in Numerical Recipes. A slightly different form is described as "Algorithm P" by D. Knuth in The Art of Computer Programming.[8]
+
Given u and v, independent and uniformly distributed in the closed interval [−1, +1], set s = R2 = u2 + v2. If s = 0 or s ≥ 1, discard u and v, and try another pair (u, v). Because u and v are uniformly distributed and because only points within the unit circle have been admitted, the values of s will be uniformly distributed in the open interval (0, 1), too. The latter can be seen by calculating the cumulative distribution function for s in the interval (0, 1). This is the area of a circle with radius , divided by . From this we find the probability density function to have the constant value 1 on the interval (0, 1). Equally so, the angle θ divided by is uniformly distributed in the interval [0, 1) and independent of s.
+
We now identify the value of s with that of U1 and with that of U2 in the basic form. As shown in the figure, the values of and in the basic form can be replaced with the ratios and , respectively. The advantage is that calculating the trigonometric functions directly can be avoided. This is helpful when trigonometric functions are more expensive to compute than the single division that replaces each one.
+
Just as the basic form produces two standard normal deviates, so does this alternate calculation.
+
The polar method differs from the basic method in that it is a type of rejection sampling. It discards some generated random numbers, but can be faster than the basic method because it is simpler to compute (provided that the random number generator is relatively fast) and is more numerically robust.[9] Avoiding the use of expensive trigonometric functions improves speed over the basic form.[6] It discards 1 − π/4 ≈ 21.46% of the total input uniformly distributed random number pairs generated, i.e. discards 4/π − 1 ≈ 27.32% uniformly distributed random number pairs per Gaussian random number pair generated, requiring 4/π ≈ 1.2732 input random numbers per output random number.
+
The basic form requires two multiplications, 1/2 logarithm, 1/2 square root, and one trigonometric function for each normal variate.[10] On some processors, the cosine and sine of the same argument can be calculated in parallel using a single instruction. Notably for Intel-based machines, one can use the fsincos assembler instruction or the expi instruction (usually available from C as an intrinsic function), to calculate complex
+
+
+
and just separate the real and imaginary parts.
+
Note:
+To explicitly calculate the complex-polar form use the following substitutions in the general form,
+
Let and Then
+
+
+
The polar form requires 3/2 multiplications, 1/2 logarithm, 1/2 square root, and 1/2 division for each normal variate. The effect is to replace one multiplication and one trigonometric function with a single division and a conditional loop.
+
When a computer is used to produce a uniform random variable it will inevitably have some inaccuracies because there is a lower bound on how close numbers can be to 0. If the generator uses 32 bits per output value, the smallest non-zero number that can be generated is . When and are equal to this the Box–Muller transform produces a normal random deviate equal to . This means that the algorithm will not produce random variables more than 6.660 standard deviations from the mean. This corresponds to a proportion of lost due to the truncation, where is the standard cumulative normal distribution. With 64 bits the limit is pushed to standard deviations, for which .
+
The standard Box–Muller transform generates values from the standard normal distribution (i.e.standard normal deviates) with mean 0 and standard deviation 1. The implementation below in standard C++ generates values from any normal distribution with mean and variance . If is a standard normal deviate, then will have a normal distribution with mean and standard deviation . The random number generator has been seeded to ensure that new, pseudo-random values will be returned from sequential calls to the generateGaussianNoise function.
+
+
#include<cmath>
+#include<limits>
+#include<random>
+#include<utility>
+
+//"mu" is the mean of the distribution, and "sigma" is the standard deviation.
+std::pair<double,double>generateGaussianNoise(doublemu,doublesigma)
+{
+constexprdoubleepsilon=std::numeric_limits<double>::epsilon();
+constexprdoubletwo_pi=2.0*M_PI;
+
+//initialize the random uniform number generator (runif) in a range 0 to 1
+staticstd::mt19937rng(std::random_device{}());// Standard mersenne_twister_engine seeded with rd()
+staticstd::uniform_real_distribution<>runif(0.0,1.0);
+
+//create two random numbers, make sure u1 is greater than epsilon
+doubleu1,u2;
+do
+{
+u1=runif(rng);
+}
+while(u1<=epsilon);
+u2=runif(rng);
+
+//compute z0 and z1
+automag=sigma*sqrt(-2.0*log(u1));
+autoz0=mag*cos(two_pi*u2)+mu;
+autoz1=mag*sin(two_pi*u2)+mu;
+
+returnstd::make_pair(z0,z1);
+}
+
Howes, Lee; Thomas, David (2008). GPU Gems 3 - Efficient Random Number Generation and Application Using CUDA. Pearson Education, Inc. ISBN978-0-321-51526-1.
This article's lead sectionmay be too short to adequately summarize the key points. Please consider expanding the lead to provide an accessible overview of all important aspects of the article.(November 2022)
With a shape parameter and an inverse scale parameter , called a rate parameter.
+
In each of these forms, both parameters are positive real numbers.
+
The gamma distribution is the maximum entropy probability distribution (both with respect to a uniform base measure and a base measure) for a random variable for which E[X] = kθ = α/β is fixed and greater than zero, and E[ln(X)] = ψ(k) + ln(θ) = ψ(α) − ln(β) is fixed (ψ is the digamma function).[1]
+
The parameterization with k and θ appears to be more common in econometrics and other applied fields, where the gamma distribution is frequently used to model waiting times. For instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution. See Hogg and Craig[2] for an explicit motivation.
+
The gamma distribution can be parameterized in terms of a shape parameterα = k and an inverse scale parameter β = 1/θ, called a rate parameter. A random variable X that is gamma-distributed with shape α and rate β is denoted
+
+
+
The corresponding probability density function in the shape-rate parameterization is
+
+
+
where is the gamma function.
+For all positive integers, .
+
If α is a positive integer (i.e., the distribution is an Erlang distribution), the cumulative distribution function has the following series expansion:[4]
+
Unlike the mode and the mean, which have readily calculable formulas based on the parameters, the median does not have a closed-form equation. The median for this distribution is the value such that
+
+
+
A rigorous treatment of the problem of determining an asymptotic expansion and bounds for the median of the gamma distribution was handled first by Chen and Rubin, who proved that (for )
+
+
+
where is the mean and is the median of the distribution.[5] For other values of the scale parameter, the mean scales to , and the median bounds and approximations would be similarly scaled by .
+
K. P. Choi found the first five terms in a Laurent series asymptotic approximation of the median by comparing the median to Ramanujan's function.[6] Berg and Pedersen found more terms:[7]
+
+
+
+
+
Partial sums of these series are good approximations for high enough ; they are not plotted in the figure, which is focused on the low- region that is less well approximated.
+
Berg and Pedersen also proved many properties of the median, showing that it is a convex function of ,[8] and that the asymptotic behavior near is (where is the Euler–Mascheroni constant), and that for all the median is bounded by .[7]
+
A closer linear upper bound, for only, was provided in 2021 by Gaunt and Merkle,[9] relying on the Berg and Pedersen result that the slope of is everywhere less than 1:
+
+
for (with equality at )
+
which can be extended to a bound for all by taking the max with the chord shown in the figure, since the median was proved convex.[8]
+
An approximation to the median that is asymptotically accurate at high and reasonable down to or a bit lower follows from the Wilson–Hilferty transformation:
+
+
+
which goes negative for .
+
In 2021, Lyon proposed several approximations of the form . He conjectured values of and for which this approximation is an asymptotically tight upper or lower bound for all .[10] In particular, he proposed these closed-form bounds, which he proved in 2023:[11]
+
+
is a lower bound, asymptotically tight as
+
is an upper bound, asymptotically tight as
+
Lyon also showed (informally in 2021, rigorously in 2023) two other lower bounds that are not closed-form expressions, including this one involving the gamma function, based on solving the integral expression substituting 1 for :
+
+
(approaching equality as )
+
and the tangent line at where the derivative was found to be :
+
Additionally, he showed that interpolations between bounds could provide excellent approximations or tighter bounds to the median, including an approximation that is exact at (where ) and has a maximum relative error less than 0.6%. Interpolated approximations and bounds are all of the form
+
+
+
where is an interpolating function running monotonically from 0 at low to 1 at high , approximating an ideal, or exact, interpolator :
+
+
+
For the simplest interpolating function considered, a first-order rational function
+
+
+
the tightest lower bound has
+
+
+
and the tightest upper bound has
+
+
+
The interpolated bounds are plotted (mostly inside the yellow region) in the log–log plot shown. Even tighter bounds are available using different interpolating functions, but not usually with closed-form parameters like these.[10]
+
Indeed, we know that if X is an exponential r.v. with rate λ, then cX is an exponential r.v. with rate λ/c; the same thing is valid with Gamma variates (and this can be checked using the moment-generating function, see, e.g.,these notes, 10.4-(ii)): multiplication by a positive constant c divides the rate (or, equivalently, multiplies the scale).
+
The Kullback–Leibler divergence (KL-divergence), of Gamma(αp, βp) ("true" distribution) from Gamma(αq, βq) ("approximating" distribution) is given by[14]
+
+
+
Written using the k, θ parameterization, the KL-divergence of Gamma(kp, θp) from Gamma(kq, θq) is given by
+
Let be independent and identically distributed random variables following an exponential distribution with rate parameter λ, then ~ Gamma(n, λ) where n is the shape parameter and λ is the rate, and .
+
If X ~ Gamma(1, λ) (in the shape–rate parametrization), then X has an exponential distribution with rate parameter λ. In the shape-scale parametrization, X ~ Gamma(1, λ) has an exponential distribution with rate parameter 1/λ.
+
If X ~ Gamma(ν/2, 2) (in the shape–scale parametrization), then X is identical to χ2(ν), the chi-squared distribution with ν degrees of freedom. Conversely, if Q ~ χ2(ν) and c is a positive constant, then cQ ~ Gamma(ν/2, 2c).
+
If θ=1/k, one obtains the Schulz-Zimm distribution, which is most prominently used to model polymer chain lengths.
+
If k is an integer, the gamma distribution is an Erlang distribution and is the probability distribution of the waiting time until the kth "arrival" in a one-dimensional Poisson process with intensity 1/θ. If
If X ~ Gamma(k, θ), then follows an exponential-gamma (abbreviated exp-gamma) distribution.[15] It is sometimes referred to as the log-gamma distribution.[16] Formulas for its mean and variance are in the section #Logarithmic expectation and variance.
More generally, if X ~ Gamma(k,θ), then for follows a generalized gamma distribution with parameters p = 1/q, d = k/q, and .
+
If X ~ Gamma(k, θ) with shape k and scale θ, then 1/X ~ Inv-Gamma(k, θ−1) (see Inverse-gamma distribution for derivation).
+
Parametrization 1: If are independent, then , or equivalently,
+
Parametrization 2: If are independent, then , or equivalently,
+
If X ~ Gamma(α, θ) and Y ~ Gamma(β, θ) are independently distributed, then X/(X + Y) has a beta distribution with parameters α and β, and X/(X + Y) is independent of X + Y, which is Gamma(α + β, θ)-distributed.
+
If Xi ~ Gamma(αi, 1) are independently distributed, then the vector (X1/S, ..., Xn/S), where S = X1 + ... + Xn, follows a Dirichlet distribution with parameters α1, ..., αn.
+
For large k the gamma distribution converges to normal distribution with mean μ = kθ and variance σ2 = kθ2.
The matrix gamma distribution and the Wishart distribution are multivariate generalizations of the gamma distribution (samples are positive-definite matrices rather than positive real numbers).
If the shape parameter of the gamma distribution is known, but the inverse-scale parameter is unknown, then a gamma distribution for the inverse scale forms a conjugate prior. The compound distribution, which results from integrating out the inverse scale, has a closed-form solution known as the compound gamma distribution.[18]
+
If, instead, the shape parameter is known but the mean is unknown, with the prior of the mean being given by another gamma distribution, then it results in K-distribution.
+
The gamma distribution can be expressed as the product distribution of a Weibull distribution and a variant form of the stable count distribution.
+Its shape parameter can be regarded as the inverse of Lévy's stability parameter in the stable count distribution:
+
+where is a standard stable count distribution of shape , and is a standard Weibull distribution of shape .
+
+
The likelihood function for Niid observations (x1, ..., xN) is
+
+
+
from which we calculate the log-likelihood function
+
+
+
Finding the maximum with respect to θ by taking the derivative and setting it equal to zero yields the maximum likelihood estimator of the θ parameter, which equals the sample mean divided by the shape parameter k:
+
+
+
Substituting this into the log-likelihood function gives
+
+
+
We need at least two samples: , because for , the function increases without bounds as . For , it can be verified that is strictly concave, by using inequality properties of the polygamma function. Finding the maximum with respect to k by taking the derivative and setting it equal to zero yields
+
+
+
where is the digamma function and is the sample mean of ln(x). There is no closed-form solution for k. The function is numerically very well behaved, so if a numerical solution is desired, it can be found using, for example, Newton's method. An initial value of k can be found either using the method of moments, or using the approximation
+
+
+
If we let
+
+
+
then k is approximately
+
+
+
which is within 1.5% of the correct value.[19] An explicit form for the Newton–Raphson update of this initial guess is:[20]
+
+
+
At the maximum-likelihood estimate , the expected values for and agree with the empirical averages:
+
For data, , that is represented in a floating point format that underflows to 0 for values smaller than , the logarithms that are needed for the maximum-likelihood estimate will cause failure if there are any underflows. If we assume the data was generated by a gamma distribution with cdf , then the probability that there is at least one underflow is:
+
+
+
This probability will approach 1 for small and large . For example, at , and , . A workaround is to instead have the data in logarithmic format.
+
In order to test an implementation of a maximum-likelihood estimator that takes logarithmic data as input, it is useful to be able to generate non-underflowing logarithms of random gamma variates, when . Following the implementation in scipy.stats.loggamma, this can be done as follows:[21] sample and independently. Then the required logarithmic sample is , so that .
+
Using the sample mean of x, the sample mean of ln(x), and the sample mean of the product x·ln(x) simplifies the expressions to:
+
+
+
+
If the rate parameterization is used, the estimate of .
+
These estimators are not strictly maximum likelihood estimators, but are instead referred to as mixed type log-moment estimators. They have however similar efficiency as the maximum likelihood estimators.
+
Although these estimators are consistent, they have a small bias. A bias-corrected variant of the estimator for the scale θ is
+
+
+
A bias correction for the shape parameter k is given as[23]
+
Consider a sequence of events, with the waiting time for each event being an exponential distribution with rate . Then the waiting time for the -th event to occur is the gamma distribution with integer shape . This construction of the gamma distribution allows it to model a wide variety of phenomena where several sub-events, each taking time with exponential distribution, must happen in sequence for a major event to occur.[25] Examples include the waiting time of cell-division events,[26] number of compensatory mutations for a given mutation,[27] waiting time until a repair is necessary for a hydraulic system,[28] and so on.
+
In biophysics, the dwell time between steps of a molecular motor like ATP synthase is nearly exponential at constant ATP concentration, revealing that each step of the motor takes a single ATP hydrolysis. If there were n ATP hydrolysis events, then it would be a gamma distribution with degree n.[29]
+
In oncology, the age distribution of cancerincidence often follows the gamma distribution, wherein the shape and scale parameters predict, respectively, the number of driver events and the time interval between them.[32][33]
+
In phylogenetics, the gamma distribution is the most commonly used approach to model among-sites rate variation[39] when maximum likelihood, Bayesian, or distance matrix methods are used to estimate phylogenetic trees. Phylogenetic analyzes that use the gamma distribution to model rate variation estimate a single parameter from the data because they limit consideration to distributions where α=β. This parameterization means that the mean of this distribution is 1 and the variance is 1/α. Maximum likelihood and Bayesian methods typically use a discrete approximation to the continuous gamma distribution.[40][41]
+
Given the scaling property above, it is enough to generate gamma variables with θ = 1, as we can later convert to any value of β with a simple division.
+
Suppose we wish to generate random variables from Gamma(n + δ, 1), where n is a non-negative integer and 0 < δ < 1. Using the fact that a Gamma(1, 1) distribution is the same as an Exp(1) distribution, and noting the method of generating exponential variables, we conclude that if U is uniformly distributed on (0, 1], then −ln(U) is distributed Gamma(1, 1) (i.e. inverse transform sampling). Now, using the "α-addition" property of gamma distribution, we expand this result:
+
+
+
where Uk are all uniformly distributed on (0, 1] and independent. All that is left now is to generate a variable distributed as Gamma(δ, 1) for 0 < δ < 1 and apply the "α-addition" property once more. This is the most difficult part.
+
Random generation of gamma variates is discussed in detail by Devroye,[42]: 401–428 noting that none are uniformly fast for all shape parameters. For small values of the shape parameter, the algorithms are often not valid.[42]: 406 For arbitrary values of the shape parameter, one can apply the Ahrens and Dieter[43] modified acceptance-rejection method Algorithm GD (shape k ≥ 1), or transformation method[44] when 0 < k < 1. Also see Cheng and Feast Algorithm GKM 3[45] or Marsaglia's squeeze method.[46]
+
Generate U, V and W as iid uniform (0, 1] variates.
+
If then and . Otherwise, and .
+
If then go to step 1.
+
ξ is distributed as Γ(δ, 1).
+
A summary of this is
+
+
+
where is the integer part of k, ξ is generated via the algorithm above with δ = {k} (the fractional part of k) and the Uk are all independent.
+
While the above approach is technically correct, Devroye notes that it is linear in the value of k and generally is not a good choice. Instead, he recommends using either rejection-based or table-based methods, depending on context.[42]: 401–428
+
For example, Marsaglia's simple transformation-rejection method relying on one normal variate X and one uniform variate U:[21]
+
+
Set and .
+
Set .
+
If and return , else go back to step 2.
+
With generates a gamma distributed random number in time that is approximately constant with k. The acceptance rate does depend on k, with an acceptance rate of 0.95, 0.98, and 0.99 for k=1, 2, and 4. For k < 1, one can use to boost k to be usable with this method.
+
^Hogg, R. V.; Craig, A. T. (1978). Introduction to Mathematical Statistics (4th ed.). New York: Macmillan. pp. Remark 3.3.1. ISBN0023557109.
+
+
^Gopalan, Prem; Hofman, Jake M.; Blei, David M. (2013). "Scalable Recommendation with Poisson Factorization". arXiv:1311.1704 [cs.IR].
+
+
^ abPapoulis, Pillai, Probability, Random Variables, and Stochastic Processes, Fourth Edition
+
+
^Jeesen Chen, Herman Rubin, Bounds for the difference between median and mean of gamma and Poisson distributions, Statistics & Probability Letters, Volume 4, Issue 6, October 1986, Pages 281–283, ISSN0167-7152, [1].
+
^Gaunt, Robert E., and Milan Merkle (2021). "On bounds for the mode and median of the generalized hyperbolic and related distributions". Journal of Mathematical Analysis and Applications. 493 (1): 124508. arXiv:2002.01884. doi:10.1016/j.jmaa.2020.124508. S2CID221103640.{{cite journal}}: CS1 maint: multiple names: authors list (link)
+
^Mathai, A. M. (1982). "Storage capacity of a dam with gamma type inputs". Annals of the Institute of Statistical Mathematics. 34 (3): 591–597. doi:10.1007/BF02481056. ISSN0020-3157. S2CID122537756.
+
+
^Moschopoulos, P. G. (1985). "The distribution of the sum of independent gamma random variables". Annals of the Institute of Statistical Mathematics. 37 (3): 541–544. doi:10.1007/BF02481123. S2CID120066454.
+
+
^W.D. Penny, [www.fil.ion.ucl.ac.uk/~wpenny/publications/densities.ps KL-Divergences of Normal, Gamma, Dirichlet, and Wishart densities][full citation needed]
+
^Choi, S. C.; Wette, R. (1969). "Maximum Likelihood Estimation of the Parameters of the Gamma Distribution and Their Bias". Technometrics. 11 (4): 683–690. doi:10.1080/00401706.1969.10490731.
+
+
^ abMarsaglia, G.; Tsang, W. W. (2000). "A simple method for generating gamma variables". ACM Transactions on Mathematical Software. 26 (3): 363–372. doi:10.1145/358407.358414. S2CID2634158.
+
^Fink, D. 1995 A Compendium of Conjugate Priors. In progress report: Extension and enhancement of methods for setting data quality objectives. (DOE contract 95‑831).
+
^J. G. Robson and J. B. Troy, "Nature of the maintained discharge of Q, X, and Y retinal ganglion cells of the cat", J. Opt. Soc. Am. A 4, 2301–2307 (1987)
+
+
^M.C.M. Wright, I.M. Winter, J.J. Forster, S. Bleeck "Response to best-frequency tone bursts in the ventral cochlear nucleus is governed by ordered inter-spike interval statistics", Hearing Research 317 (2014)
+
+
^N. Friedman, L. Cai and X. S. Xie (2006) "Linking stochastic dynamics to population distribution: An analytical framework of gene expression", Phys. Rev. Lett. 97, 168302.
+
The parameter is the mean or expectation of the distribution (and also its median and mode), while the parameter is its standard deviation. The variance of the distribution is . A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate.
+
Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.[2][3] Their importance is partly due to the central limit theorem. It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases. Therefore, physical quantities that are expected to be the sum of many independent processes, such as measurement errors, often have distributions that are nearly normal.[4]
+
Moreover, Gaussian distributions have some unique properties that are valuable in analytic studies. For instance, any linear combination of a fixed collection of independent normal deviates is a normal deviate. Many results and methods, such as propagation of uncertainty and least squares[5] parameter fitting, can be derived analytically in explicit form when the relevant variables are normally distributed.
+
A normal distribution is sometimes informally called a bell curve.[6] However, many other distributions are bell-shaped (such as the Cauchy, Student's t, and logistic distributions). For other names, see Naming.
+
The simplest case of a normal distribution is known as the standard normal distribution or unit normal distribution. This is a special case when and , and it is described by this probability density function (or density):
+
+
+
The variable has a mean of 0 and a variance and standard deviation of 1. The density has its peak at and inflection points at and .
+
Although the density above is most commonly known as the standard normal, a few authors have used that term to describe other versions of the normal distribution. Carl Friedrich Gauss, for example, once defined the standard normal as
+
+
+
which has a variance of 1/2, and Stephen Stigler[7] once defined the standard normal as
+
+
+
which has a simple functional form and a variance of
+
Every normal distribution is a version of the standard normal distribution, whose domain has been stretched by a factor (the standard deviation) and then translated by (the mean value):
+
+
+
The probability density must be scaled by so that the integral is still 1.
+
If is a standard normal deviate, then will have a normal distribution with expected value and standard deviation . This is equivalent to saying that the standard normal distribution can be scaled/stretched by a factor of and shifted by to yield a different normal distribution, called . Conversely, if is a normal deviate with parameters and , then this distribution can be re-scaled and shifted via the formula to convert it to the standard normal distribution. This variate is also called the standardized form of .
+
The probability density of the standard Gaussian distribution (standard normal distribution, with zero mean and unit variance) is often denoted with the Greek letter (phi).[8] The alternative form of the Greek letter phi, , is also used quite often.
+
The normal distribution is often referred to as or .[9] Thus when a random variable is normally distributed with mean and standard deviation , one may write
+
Some authors advocate using the precision as the parameter defining the width of the distribution, instead of the deviation or the variance . The precision is normally defined as the reciprocal of the variance, .[10] The formula for the distribution then becomes
+
+
+
This choice is claimed to have advantages in numerical computations when is very close to zero, and simplifies formulas in some contexts, such as in the Bayesian inference of variables with multivariate normal distribution.
+
Alternatively, the reciprocal of the standard deviation might be defined as the precision, in which case the expression of the normal distribution becomes
+
+
+
According to Stigler, this formulation is advantageous because of a much simpler and easier-to-remember formula, and simple approximate formulas for the quantiles of the distribution.
+
Normal distributions form an exponential family with natural parameters and , and natural statistics x and x2. The dual expectation parameters for normal distribution are η1 = μ and η2 = μ2 + σ2.
+
The related error function gives the probability of a random variable, with normal distribution of mean 0 and variance 1/2 falling in the range . That is:
+
+
+
These integrals cannot be expressed in terms of elementary functions, and are often said to be special functions. However, many numerical approximations are known; see below for more.
+
The two functions are closely related, namely
+
+
+
For a generic normal distribution with density , mean and deviation , the cumulative distribution function is
+
+
+
The complement of the standard normal cumulative distribution function, , is often called the Q-function, especially in engineering texts.[11][12] It gives the probability that the value of a standard normal random variable will exceed : . Other definitions of the -function, all of which are simple transformations of , are also used occasionally.[13]
+
The graph of the standard normal cumulative distribution function has 2-fold rotational symmetry around the point (0,1/2); that is, . Its antiderivative (indefinite integral) can be expressed as follows:
+
+
+
The cumulative distribution function of the standard normal distribution can be expanded by Integration by parts into a series:
+
A quick approximation to the standard normal distribution's cumulative distribution function can be found by using a Taylor series approximation:
+
+
+
Recursive computation with Taylor series expansion[edit]
+
The recursive nature of the family of derivatives may be used to easily construct a rapidly converging Taylor series expansion using recursive entries about any point of known value of the distribution,:
+
+
+
where:
+
+
+
+
, for all n ≥ 2.
+
Using the Taylor series and Newton's method for the inverse function[edit]
+
An application for the above Taylor series expansion is to use Newton's method to reverse the computation. That is, if we have a value for the cumulative distribution function, , but do not know the x needed to obtain the , we can use Newton's method to find x, and use the Taylor series expansion above to minimize the number of computations. Newton's method is ideal to solve this problem because the first derivative of , which is an integral of the normal standard distribution, is the normal standard distribution, and is readily available to use in the Newton's method solution.
+
To solve, select a known approximate solution, , to the desired . may be a value from a distribution table, or an intelligent estimate followed by a computation of using any desired means to compute. Use this value of and the Taylor series expansion above to minimize computations.
+
Repeat the following process until the difference between the computed and the desired , which we will call , is below a chosen acceptably small error, such as 10−5, 10−15, etc.:
+
+
where
+
+
is the from a Taylor series solution using and
+
+
When the repeated computations converge to an error below the chosen acceptably small value, x will be the value needed to obtain a of the desired value, . If is a good beginning estimate, convergence should be rapid with only a small number of iterations needed.[citation needed]
+
About 68% of values drawn from a normal distribution are within one standard deviation σ away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations.[6] This fact is known as the 68-95-99.7 (empirical) rule, or the 3-sigma rule.
+
More precisely, the probability that a normal deviate lies in the range between and is given by
+
+
+
To 12 significant digits, the values for are:[citation needed]
+
The quantile function of a distribution is the inverse of the cumulative distribution function. The quantile function of the standard normal distribution is called the probit function, and can be expressed in terms of the inverse error function:
+
+
+
For a normal random variable with mean and variance , the quantile function is
+
+
+
The quantile of the standard normal distribution is commonly denoted as . These values are used in hypothesis testing, construction of confidence intervals and Q–Q plots. A normal random variable will exceed with probability , and will lie outside the interval with probability . In particular, the quantile is 1.96; therefore a normal random variable will lie outside the interval in only 5% of cases.
+
The following table gives the quantile such that will lie in the range with a specified probability . These values are useful to determine tolerance interval for sample averages and other statistical estimators with normal (or asymptotically normal) distributions.[citation needed] The following table shows , not as defined above.
+
The normal distribution is the only distribution whose cumulants beyond the first two (i.e., other than the mean and variance) are zero. It is also the continuous distribution with the maximum entropy for a specified mean and variance.[15][16] Geary has shown, assuming that the mean and variance are finite, that the normal distribution is the only distribution where the mean and variance calculated from a set of independent draws are independent of each other.[17][18]
+
The normal distribution is a subclass of the elliptical distributions. The normal distribution is symmetric about its mean, and is non-zero over the entire real line. As such it may not be a suitable model for variables that are inherently positive or strongly skewed, such as the weight of a person or the price of a share. Such variables may be better described by other distributions, such as the log-normal distribution or the Pareto distribution.
+
The value of the normal distribution is practically zero when the value lies more than a few standard deviations away from the mean (e.g., a spread of three standard deviations covers all but 0.27% of the total distribution). Therefore, it may not be an appropriate model when one expects a significant fraction of outliers—values that lie many standard deviations away from the mean—and least squares and other statistical inference methods that are optimal for normally distributed variables often become highly unreliable when applied to such data. In those cases, a more heavy-tailed distribution should be assumed and the appropriate robust statistical inference methods applied.
+
The Gaussian distribution belongs to the family of stable distributions which are the attractors of sums of independent, identically distributed distributions whether or not the mean or variance is finite. Except for the Gaussian which is a limiting case, all stable distributions have heavy tails and infinite variance. It is one of the few distributions that are stable and that have probability density functions that can be expressed analytically, the others being the Cauchy distribution and the Lévy distribution.
+
The normal distribution with density (mean and standard deviation ) has the following properties:
+
+
It is symmetric around the point which is at the same time the mode, the median and the mean of the distribution.[19]
+
It is unimodal: its first derivative is positive for negative for and zero only at
+
The area bounded by the curve and the -axis is unity (i.e. equal to one).
+
Its first derivative is
+
Its second derivative is
+
Its density has two inflection points (where the second derivative of is zero and changes sign), located one standard deviation away from the mean, namely at and [19]
Furthermore, the density of the standard normal distribution (i.e. and ) also has the following properties:
+
+
Its first derivative is
+
Its second derivative is
+
More generally, its nth derivative is where is the nth (probabilist) Hermite polynomial.[21]
+
The probability that a normally distributed variable with known and is in a particular set, can be calculated by using the fact that the fraction has a standard normal distribution.
The plain and absolute moments of a variable are the expected values of and , respectively. If the expected value of is zero, these parameters are called central moments; otherwise, these parameters are called non-central moments. Usually we are interested only in moments with integer order .
+
If has a normal distribution, the non-central moments exist and are finite for any whose real part is greater than −1. For any non-negative integer , the plain central moments are:[22]
+
+
+
Here denotes the double factorial, that is, the product of all numbers from to 1 that have the same parity as
+
The central absolute moments coincide with plain moments for all even orders, but are nonzero for odd orders. For any non-negative integer
+
The expectation of conditioned on the event that lies in an interval is given by
+
+
+
where and respectively are the density and the cumulative distribution function of . For this is known as the inverse Mills ratio. Note that above, density of is used instead of standard normal density as in inverse Mills ratio, so here we have instead of .
+
+
Fourier transform and characteristic function[edit]
where is the imaginary unit. If the mean , the first factor is 1, and the Fourier transform is, apart from a constant factor, a normal density on the frequency domain, with mean 0 and standard deviation . In particular, the standard normal distribution is an eigenfunction of the Fourier transform.
+
In probability theory, the Fourier transform of the probability distribution of a real-valued random variable is closely connected to the characteristic function of that variable, which is defined as the expected value of , as a function of the real variable (the frequency parameter of the Fourier transform). This definition can be analytically extended to a complex-value variable .[24] The relation between both is:
+
The moment generating function of a real random variable is the expected value of , as a function of the real parameter . For a normal distribution with density , mean and deviation , the moment generating function exists and is equal to
+
In the limit when tends to zero, the probability density eventually tends to zero at any , but grows without limit if , while its integral remains equal to 1. Therefore, the normal distribution cannot be defined as an ordinary function when .
+
However, one can define the normal distribution with zero variance as a generalized function; specifically, as a Dirac delta function translated by the mean , that is
+Its cumulative distribution function is then the Heaviside step function translated by the mean , namely
+
where is understood to be zero whenever . This functional can be maximized, subject to the constraints that the distribution is properly normalized and has a specified mean and variance, by using variational calculus. A function with three Lagrange multipliers is defined:
+
+
+
At maximum entropy, a small variation about will produce a variation about which is equal to 0:
+
+
+
Since this must hold for any small , the factor multiplying must be zero, and solving for yields:
+
+
+
The Lagrange constraints that is properly normalized and has the specified mean and variance are satisfied if and only if , , and are chosen so that
+
+
+
The entropy of a normal distribution is equal to
+
If the characteristic function of some random variable is of the form , where is a polynomial, then the Marcinkiewicz theorem (named after Józef Marcinkiewicz) asserts that can be at most a quadratic polynomial, and therefore is a normal random variable.[29] The consequence of this result is that the normal distribution is the only distribution with a finite number (two) of non-zero cumulants.
If and are jointly normal and uncorrelated, then they are independent. The requirement that and should be jointly normal is essential; without it the property does not hold.[30][31][proof] For non-normal random variables uncorrelatedness does not imply independence.
The conjugate prior of the mean of a normal distribution is another normal distribution.[33] Specifically, if are iid and the prior is , then the posterior distribution for the estimator of will be
The family of normal distributions not only forms an exponential family (EF), but in fact forms a natural exponential family (NEF) with quadratic variance function (NEF-QVF). Many properties of normal distributions generalize to properties of NEF-QVF distributions, NEF distributions, or EF distributions generally. NEF-QVF distributions comprises 6 families, including Poisson, Gamma, binomial, and negative binomial distributions, while many of the common families studied in probability and statistics are NEF or EF.
The central limit theorem states that under certain (fairly common) conditions, the sum of many random variables will have an approximately normal distribution. More specifically, where are independent and identically distributed random variables with the same arbitrary distribution, zero mean, and variance and is their
+mean scaled by
+
+
+
Then, as increases, the probability distribution of will tend to the normal distribution with zero mean and variance .
+
The theorem can be extended to variables that are not independent and/or not identically distributed if certain constraints are placed on the degree of dependence and the moments of the distributions.
+
Many test statistics, scores, and estimators encountered in practice contain sums of certain random variables in them, and even more estimators can be represented as sums of random variables through the use of influence functions. The central limit theorem implies that those statistical parameters will have asymptotically normal distributions.
+
The central limit theorem also implies that certain distributions can be approximated by the normal distribution, for example:
+
Whether these approximations are sufficiently accurate depends on the purpose for which they are needed, and the rate of convergence to the normal distribution. It is typically the case that such approximations are less accurate in the tails of the distribution.
+
A general upper bound for the approximation error in the central limit theorem is given by the Berry–Esseen theorem, improvements of the approximation are given by the Edgeworth expansions.
+
This theorem can also be used to justify modeling the sum of many uniform noise sources as Gaussian noise. See AWGN.
+
+
Operations and functions of normal variables[edit]
If is distributed normally with mean and variance , then
+
+
, for any real numbers and , is also normally distributed, with mean and standard deviation . That is, the family of normal distributions is closed under linear transformations.
Operations on two independent normal variables[edit]
+
If and are two independent normal random variables, with means , and standard deviations , , then their sum will also be normally distributed,[proof] with mean and variance .
+
In particular, if and are independent normal deviates with zero mean and variance , then and are also independent and normally distributed, with zero mean and variance . This is a special case of the polarization identity.[37]
+
If , are two independent normal deviates with mean and deviation , and , are arbitrary real numbers, then the variable
is also normally distributed with mean and deviation . It follows that the normal distribution is stable (with exponent ).
+
If , are normal distributions, then their normalized geometric mean is a normal distribution with and (see here for a visualization).
+
Operations on two independent standard normal variables[edit]
+
If and are two independent standard normal random variables with mean 0 and variance 1, then
+
+
Their sum and difference is distributed normally with mean zero and variance two: .
If , are independent standard normal random variables, then the ratio of their normalized sums of squares will have the F-distribution with (n, m) degrees of freedom:[41]
+
Operations on multiple correlated normal variables[edit]
+
A quadratic form of a normal vector, i.e. a quadratic function of multiple independent or correlated normal variables, is a generalized chi-square variable.
The split normal distribution is most directly defined in terms of joining scaled sections of the density functions of different normal distributions and rescaling the density to integrate to one. The truncated normal distribution results from rescaling a section of a single density function.
+
For any positive integer , any normal distribution with mean and variance is the distribution of the sum of independent normal deviates, each with mean and variance . This property is called infinite divisibility.[42]
+
Conversely, if and are independent random variables and their sum has a normal distribution, then both and must be normal deviates.[43]
+
This result is known as Cramér's decomposition theorem, and is equivalent to saying that the convolution of two distributions is normal if and only if both are normal. Cramér's theorem implies that a linear combination of independent non-Gaussian variables will never have an exactly normal distribution, although it may approach it arbitrarily closely.[29]
+
Bernstein's theorem states that if and are independent and and are also independent, then both X and Y must necessarily have normal distributions.[44][45]
+
More generally, if are independent random variables, then two distinct linear combinations and will be independent if and only if all are normal and , where denotes the variance of .[44]
+
The notion of normal distribution, being one of the most important distributions in probability theory, has been extended far beyond the standard framework of the univariate (that is one-dimensional) case (Case 1). All these extensions are also called normal or Gaussian laws, so a certain ambiguity in names exists.
+
+
The multivariate normal distribution describes the Gaussian law in the k-dimensional Euclidean space. A vector X ∈ Rk is multivariate-normally distributed if any linear combination of its components Σk j=1aj Xj has a (univariate) normal distribution. The variance of X is a k×k symmetric positive-definite matrix V. The multivariate normal distribution is a special case of the elliptical distributions. As such, its iso-density loci in the k = 2 case are ellipses and in the case of arbitrary k are ellipsoids.
Complex normal distribution deals with the complex normal vectors. A complex vector X ∈ Ck is said to be normal if both its real and imaginary components jointly possess a 2k-dimensional multivariate normal distribution. The variance-covariance structure of X is described by two matrices: the variance matrix Γ, and the relation matrix C.
Gaussian processes are the normally distributed stochastic processes. These can be viewed as elements of some infinite-dimensional Hilbert spaceH, and thus are the analogues of multivariate normal vectors for the case k = ∞. A random element h ∈ H is said to be normal if for any constant a ∈ H the scalar product(a, h) has a (univariate) normal distribution. The variance structure of such Gaussian random element can be described in terms of the linear covariance operator K: H → H. Several Gaussian processes became popular enough to have their own names:
+
A random variable X has a two-piece normal distribution if it has a distribution
+
+
+
+
where μ is the mean and σ1 and σ2 are the standard deviations of the distribution to the left and right of the mean respectively.
+
The mean, variance and third central moment of this distribution have been determined[46]
+
+
+
+
+
where E(X), V(X) and T(X) are the mean, variance, and third central moment respectively.
+
One of the main practical uses of the Gaussian law is to model the empirical distributions of many different random variables encountered in practice. In such case a possible extension would be a richer family of distributions, having more than two parameters and therefore being able to fit the empirical distribution more accurately. The examples of such extensions are:
+
+
Pearson distribution — a four-parameter family of probability distributions that extend the normal law to include different skewness and kurtosis values.
+
The generalized normal distribution, also known as the exponential power distribution, allows for distribution tails with thicker or thinner asymptotic behaviors.
It is often the case that we do not know the parameters of the normal distribution, but instead want to estimate them. That is, having a sample from a normal population we would like to learn the approximate values of parameters and . The standard approach to this problem is the maximum likelihood method, which requires maximization of the log-likelihood function:
+
+
+
Taking derivatives with respect to and and solving the resulting system of first order conditions yields the maximum likelihood estimates:
+
The variance of this estimator is equal to the μμ-element of the inverse Fisher information matrix. This implies that the estimator is finite-sample efficient. Of practical importance is the fact that the standard error of is proportional to , that is, if one wishes to decrease the standard error by a factor of 10, one must increase the number of points in the sample by a factor of 100. This fact is widely used in determining sample sizes for opinion polls and the number of trials in Monte Carlo simulations.
+
The estimator is called the sample variance, since it is the variance of the sample (). In practice, another estimator is often used instead of the . This other estimator is denoted , and is also called the sample variance, which represents a certain ambiguity in terminology; its square root is called the sample standard deviation. The estimator differs from by having (n − 1) instead of n in the denominator (the so-called Bessel's correction):
+
+
+
The difference between and becomes negligibly small for large n's. In finite samples however, the motivation behind the use of is that it is an unbiased estimator of the underlying parameter , whereas is biased. Also, by the Lehmann–Scheffé theorem the estimator is uniformly minimum variance unbiased (UMVU),[47] which makes it the "best" estimator among all unbiased ones. However it can be shown that the biased estimator is better than the in terms of the mean squared error (MSE) criterion. In finite samples both and have scaled chi-squared distribution with (n − 1) degrees of freedom:
+
+
+
The first of these expressions shows that the variance of is equal to , which is slightly greater than the σσ-element of the inverse Fisher information matrix . Thus, is not an efficient estimator for , and moreover, since is UMVU, we can conclude that the finite-sample efficient estimator for does not exist.
+
Applying the asymptotic theory, both estimators and are consistent, that is they converge in probability to as the sample size . The two estimators are also both asymptotically normal:
+
+
+
In particular, both estimators are asymptotically efficient for .
+
By Cochran's theorem, for normal distributions the sample mean and the sample variance s2 are independent, which means there can be no gain in considering their joint distribution. There is also a converse theorem: if in a sample the sample mean and sample variance are independent, then the sample must have come from the normal distribution. The independence between and s can be employed to construct the so-called t-statistic:
+
+
+
This quantity t has the Student's t-distribution with (n − 1) degrees of freedom, and it is an ancillary statistic (independent of the value of the parameters). Inverting the distribution of this t-statistics will allow us to construct the confidence interval for μ;[48] similarly, inverting the χ2 distribution of the statistic s2 will give us the confidence interval for σ2:[49]
+
+
+
+
where tk,p and χ2 k,p are the pth quantiles of the t- and χ2-distributions respectively. These confidence intervals are of the confidence level1 − α, meaning that the true values μ and σ2 fall outside of these intervals with probability (or significance level) α. In practice people usually take α = 5%, resulting in the 95% confidence intervals.
+
Approximate formulas can be derived from the asymptotic distributions of and s2:
+
+
+
+
The approximate formulas become valid for large values of n, and are more convenient for the manual calculation since the standard normal quantiles zα/2 do not depend on n. In particular, the most popular value of α = 5%, results in |z0.025| = 1.96.
+
Normality tests assess the likelihood that the given data set {x1, ..., xn} comes from a normal distribution. Typically the null hypothesisH0 is that the observations are distributed normally with unspecified mean μ and variance σ2, versus the alternative Ha that the distribution is arbitrary. Many tests (over 40) have been devised for this problem. The more prominent of them are outlined below:
+
Diagnostic plots are more intuitively appealing but subjective at the same time, as they rely on informal human judgement to accept or reject the null hypothesis.
+
+
Q–Q plot, also known as normal probability plot or rankit plot—is a plot of the sorted values from the data set against the expected values of the corresponding quantiles from the standard normal distribution. That is, it's a plot of point of the form (Φ−1(pk), x(k)), where plotting points pk are equal to pk = (k − α)/(n + 1 − 2α) and α is an adjustment constant, which can be anything between 0 and 1. If the null hypothesis is true, the plotted points should approximately lie on a straight line.
+
P–P plot – similar to the Q–Q plot, but used much less frequently. This method consists of plotting the points (Φ(z(k)), pk), where . For normally distributed data this plot should lie on a 45° line between (0, 0) and (1, 1).
Shapiro–Wilk test: This is based on the fact that the line in the Q–Q plot has the slope of σ. The test compares the least squares estimate of that slope with the value of the sample variance, and rejects the null hypothesis if these two quantities differ significantly.
+
Tests based on the empirical distribution function:
+
Bayesian analysis of the normal distribution[edit]
+
Bayesian analysis of normally distributed data is complicated by the many different possibilities that may be considered:
+
+
Either the mean, or the variance, or neither, may be considered a fixed quantity.
+
When the variance is unknown, analysis may be done directly in terms of the variance, or in terms of the precision, the reciprocal of the variance. The reason for expressing the formulas in terms of precision is that the analysis of most cases is simplified.
+
Both univariate and multivariate cases need to be considered.
The following auxiliary formula is useful for simplifying the posterior update equations, which otherwise become fairly tedious.
+
+
+
This equation rewrites the sum of two quadratics in x by expanding the squares, grouping the terms in x, and completing the square. Note the following about the complex constant factors attached to some of the terms:
+
This shows that this factor can be thought of as resulting from a situation where the reciprocals of quantities a and b add directly, so to combine a and b themselves, it's necessary to reciprocate, add, and reciprocate the result again to get back into the original units. This is exactly the sort of operation performed by the harmonic mean, so it is not surprising that is one-half the harmonic mean of a and b.
A similar formula can be written for the sum of two vector quadratics: If x, y, z are vectors of length k, and A and B are symmetric, invertible matrices of size , then
+
In other words, it sums up all possible combinations of products of pairs of elements from x, with a separate coefficient for each. In addition, since , only the sum matters for any off-diagonal elements of A, and there is no loss of generality in assuming that A is symmetric. Furthermore, if A is symmetric, then the form
+
For a set of i.i.d. normally distributed data points X of size n where each individual point x follows with known variance σ2, the conjugate prior distribution is also normally distributed.
+
This can be shown more easily by rewriting the variance as the precision, i.e. using τ = 1/σ2. Then if and we proceed as follows.
+
First, the likelihood function is (using the formula above for the sum of differences from the mean):
+
+
+
Then, we proceed as follows:
+
+
+
In the above derivation, we used the formula above for the sum of two quadratics and eliminated all constant factors not involving μ. The result is the kernel of a normal distribution, with mean and precision , i.e.
+
+
+
This can be written as a set of Bayesian update equations for the posterior parameters in terms of the prior parameters:
+
+
+
That is, to combine n data points with total precision of nτ (or equivalently, total variance of n/σ2) and mean of values , derive a new total precision simply by adding the total precision of the data to the prior total precision, and form a new mean through a precision-weighted average, i.e. a weighted average of the data mean and the prior mean, each weighted by the associated total precision. This makes logical sense if the precision is thought of as indicating the certainty of the observations: In the distribution of the posterior mean, each of the input components is weighted by its certainty, and the certainty of this distribution is the sum of the individual certainties. (For the intuition of this, compare the expression "the whole is (or is not) greater than the sum of its parts". In addition, consider that the knowledge of the posterior comes from a combination of the knowledge of the prior and likelihood, so it makes sense that we are more certain of it than of either of its components.)
+
The above formula reveals why it is more convenient to do Bayesian analysis of conjugate priors for the normal distribution in terms of the precision. The posterior precision is simply the sum of the prior and likelihood precisions, and the posterior mean is computed through a precision-weighted average, as described above. The same formulas can be written in terms of variance by reciprocating all the precisions, yielding the more ugly formulas
+
For a set of i.i.d. normally distributed data points X of size n where each individual point x follows with known mean μ, the conjugate prior of the variance has an inverse gamma distribution or a scaled inverse chi-squared distribution. The two are equivalent except for having different parameterizations. Although the inverse gamma is more commonly used, we use the scaled inverse chi-squared for the sake of convenience. The prior for σ2 is as follows:
+
For a set of i.i.d. normally distributed data points X of size n where each individual point x follows with unknown mean μ and unknown variance σ2, a combined (multivariate) conjugate prior is placed over the mean and variance, consisting of a normal-inverse-gamma distribution.
+Logically, this originates as follows:
+
+
From the analysis of the case with unknown mean but known variance, we see that the update equations involve sufficient statistics computed from the data consisting of the mean of the data points and the total variance of the data points, computed in turn from the known variance divided by the number of data points.
+
From the analysis of the case with unknown variance but known mean, we see that the update equations involve sufficient statistics over the data consisting of the number of data points and sum of squared deviations.
+
Keep in mind that the posterior update values serve as the prior distribution when further data is handled. Thus, we should logically think of our priors in terms of the sufficient statistics just described, with the same semantics kept in mind as much as possible.
+
To handle the case where both mean and variance are unknown, we could place independent priors over the mean and variance, with fixed estimates of the average mean, total variance, number of data points used to compute the variance prior, and sum of squared deviations. Note however that in reality, the total variance of the mean depends on the unknown variance, and the sum of squared deviations that goes into the variance prior (appears to) depend on the unknown mean. In practice, the latter dependence is relatively unimportant: Shifting the actual mean shifts the generated points by an equal amount, and on average the squared deviations will remain the same. This is not the case, however, with the total variance of the mean: As the unknown variance increases, the total variance of the mean will increase proportionately, and we would like to capture this dependence.
+
This suggests that we create a conditional prior of the mean on the unknown variance, with a hyperparameter specifying the mean of the pseudo-observations associated with the prior, and another parameter specifying the number of pseudo-observations. This number serves as a scaling parameter on the variance, making it possible to control the overall variance of the mean relative to the actual variance parameter. The prior for the variance also has two hyperparameters, one specifying the sum of squared deviations of the pseudo-observations associated with the prior, and another specifying once again the number of pseudo-observations. Each of the priors has a hyperparameter specifying the number of pseudo-observations, and in each case this controls the relative variance of that prior. These are given as two separate hyperparameters so that the variance (aka the confidence) of the two priors can be controlled separately.
+
This leads immediately to the normal-inverse-gamma distribution, which is the product of the two distributions just defined, with conjugate priors used (an inverse gamma distribution over the variance, and a normal distribution over the mean, conditional on the variance) and with the same four parameters just defined.
+
The priors are normally defined as follows:
+
+
+
The update equations can be derived, and look as follows:
+
+
+
The respective numbers of pseudo-observations add the number of actual observations to them. The new mean hyperparameter is once again a weighted average, this time weighted by the relative numbers of observations. Finally, the update for is similar to the case with known mean, but in this case the sum of squared deviations is taken with respect to the observed data mean rather than the true mean, and as a result a new interaction term needs to be added to take care of the additional error source stemming from the deviation between prior and data mean.
+
Writing it in terms of variance rather than precision, we get:
+
+
+
where
+
Therefore, the posterior is (dropping the hyperparameters as conditioning factors):
+
+
+
In other words, the posterior distribution has the form of a product of a normal distribution over times an inverse gamma distribution over , with parameters that are the same as the update equations above.
+
The position of a particle that experiences diffusion. If initially the particle is located at a specific point (that is its probability distribution is the Dirac delta function), then after time t its location is described by a normal distribution with variance t, which satisfies the diffusion equation. If the initial location is given by a certain density function , then the density at time t is the convolution of g and the normal probability density function.
Approximately normal distributions occur in many situations, as explained by the central limit theorem. When the outcome is produced by many small effects acting additively and independently, its distribution will be close to normal. The normal approximation will not be valid if the effects act multiplicatively (instead of additively), or if there is a single external influence that has a considerably larger magnitude than the rest of the effects.
+
+
In counting problems, where the central limit theorem includes a discrete-to-continuum approximation and where infinitely divisible and decomposable distributions are involved, such as
+
Thermal radiation has a Bose–Einstein distribution on very short time scales, and a normal distribution on longer timescales due to the central limit theorem.
I can only recognize the occurrence of the normal curve – the Laplacian curve of errors – as a very abnormal phenomenon. It is roughly approximated to in certain distributions; for this reason, and on account for its beautiful simplicity, we may, perhaps, use it as a first approximation, particularly in theoretical investigations.
There are statistical methods to empirically test that assumption; see the above Normality tests section.
+
+
In biology, the logarithm of various variables tend to have a normal distribution, that is, they tend to have a log-normal distribution (after separation on male/female subpopulations), with examples including:
+
Measures of size of living tissue (length, height, skin area, weight);[50]
+
The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth; presumably the thickness of tree bark also falls under this category;
+
Certain physiological measurements, such as blood pressure of adult humans.
+
In finance, in particular the Black–Scholes model, changes in the logarithm of exchange rates, price indices, and stock market indices are assumed normal (these variables behave like compound interest, not like simple interest, and so are multiplicative). Some mathematicians such as Benoit Mandelbrot have argued that log-Levy distributions, which possesses heavy tails would be a more appropriate model, in particular for the analysis for stock market crashes. The use of the assumption of normal distribution occurring in financial models has also been criticized by Nassim Nicholas Taleb in his works.
+
Measurement errors in physical experiments are often modeled by a normal distribution. This use of a normal distribution does not imply that one is assuming the measurement errors are normally distributed, rather using the normal distribution produces the most conservative predictions possible given only knowledge about the mean and variance of the errors.[51]
+
In standardized testing, results can be made to have a normal distribution by either selecting the number and difficulty of questions (as in the IQ test) or transforming the raw test scores into output scores by fitting them to the normal distribution. For example, the SAT's traditional range of 200–800 is based on a normal distribution with a mean of 500 and a standard deviation of 100.
+
+
Many scores are derived from the normal distribution, including percentile ranks (percentiles or quantiles), normal curve equivalents, stanines, z-scores, and T-scores. Additionally, some behavioral statistical procedures assume that scores are normally distributed; for example, t-tests and ANOVAs. Bell curve grading assigns relative grades based on a normal distribution of scores.
John Ioannidis argues that using normally distributed standard deviations as standards for validating research findings leave falsifiable predictions about phenomena that are not normally distributed untested. This includes, for example, phenomena that only appear when all necessary conditions are present and one cannot be a substitute for another in an addition-like way and phenomena that are not randomly distributed. Ioannidis argues that standard deviation-centered validation gives a false appearance of validity to hypotheses and theories where some but not all falsifiable predictions are normally distributed since the portion of falsifiable predictions that there is evidence against may and in some cases are in the non-normally distributed parts of the range of falsifiable predictions, as well as baselessly dismissing hypotheses for which none of the falsifiable predictions are normally distributed as if were they unfalsifiable when in fact they do make falsifiable predictions. It is argued by Ioannidis that many cases of mutually exclusive theories being accepted as validated by research journals are caused by failure of the journals to take in empirical falsifications of non-normally distributed predictions, and not because mutually exclusive theories are true, which they cannot be, although two mutually exclusive theories can both be wrong and a third one correct.[53]
+
In computer simulations, especially in applications of the Monte-Carlo method, it is often desirable to generate values that are normally distributed. The algorithms listed below all generate the standard normal deviates, since a N(μ, σ2) can be generated as X = μ + σZ, where Z is standard normal. All these algorithms rely on the availability of a random number generatorU capable of producing uniform random variates.
+
+
The most straightforward method is based on the probability integral transform property: if U is distributed uniformly on (0,1), then Φ−1(U) will have the standard normal distribution. The drawback of this method is that it relies on calculation of the probit function Φ−1, which cannot be done analytically. Some approximate methods are described in Hart (1968) and in the erf article. Wichura gives a fast algorithm for computing this function to 16 decimal places,[54] which is used by R to compute random variates of the normal distribution.
+
An easy-to-program approximate approach that relies on the central limit theorem is as follows: generate 12 uniform U(0,1) deviates, add them all up, and subtract 6 – the resulting random variable will have approximately standard normal distribution. In truth, the distribution will be Irwin–Hall, which is a 12-section eleventh-order polynomial approximation to the normal distribution. This random deviate will have a limited range of (−6, 6).[55] Note that in a true normal distribution, only 0.00034% of all samples will fall outside ±6σ.
+
The Box–Muller method uses two independent random numbers U and V distributed uniformly on (0,1). Then the two random variables X and Y
will both have the standard normal distribution, and will be independent. This formulation arises because for a bivariate normal random vector (X, Y) the squared norm X2 + Y2 will have the chi-squared distribution with two degrees of freedom, which is an easily generated exponential random variable corresponding to the quantity −2 ln(U) in these equations; and the angle is distributed uniformly around the circle, chosen by the random variable V.
+
The Marsaglia polar method is a modification of the Box–Muller method which does not require computation of the sine and cosine functions. In this method, U and V are drawn from the uniform (−1,1) distribution, and then S = U2 + V2 is computed. If S is greater or equal to 1, then the method starts over, otherwise the two quantities
are returned. Again, X and Y are independent, standard normal random variables.
+
The Ratio method[56] is a rejection method. The algorithm proceeds as follows:
+
Generate two independent uniform deviates U and V;
+
Compute X = √8/e (V − 0.5)/U;
+
Optional: if X2 ≤ 5 − 4e1/4U then accept X and terminate algorithm;
+
Optional: if X2 ≥ 4e−1.35/U + 1.4 then reject X and start over from step 1;
+
If X2 ≤ −4 lnU then accept X, otherwise start over the algorithm.
+
The two optional steps allow the evaluation of the logarithm in the last step to be avoided in most cases. These steps can be greatly improved[57] so that the logarithm is rarely evaluated.
+
The ziggurat algorithm[58] is faster than the Box–Muller transform and still exact. In about 97% of all cases it uses only two random numbers, one random integer and one random uniform, one multiplication and an if-test. Only in 3% of the cases, where the combination of those two falls outside the "core of the ziggurat" (a kind of rejection sampling using logarithms), do exponentials and more uniform random numbers have to be employed.
+
Integer arithmetic can be used to sample from the standard normal distribution.[59] This method is exact in the sense that it satisfies the conditions of ideal approximation;[60] i.e., it is equivalent to sampling a real number from the standard normal distribution and rounding this to the nearest representable floating point number.
+
There is also some investigation[61] into the connection between the fast Hadamard transform and the normal distribution, since the transform employs just addition and subtraction and by the central limit theorem random numbers from almost any distribution will be transformed into the normal distribution. In this regard a series of Hadamard transforms can be combined with random permutations to turn arbitrary data sets into a normally distributed data.
+
Numerical approximations for the normal cumulative distribution function and normal quantile function[edit]
Zelen & Severo (1964) give the approximation for Φ(x) for x > 0 with the absolute error |ε(x)| < 7.5·10−8 (algorithm 26.2.17):
where ϕ(x) is the standard normal probability density function, and b0 = 0.2316419, b1 = 0.319381530, b2 = −0.356563782, b3 = 1.781477937, b4 = −1.821255978, b5 = 1.330274429.
+
Hart (1968) lists some dozens of approximations – by means of rational functions, with or without exponentials – for the erfc() function. His algorithms vary in the degree of complexity and the resulting precision, with maximum absolute precision of 24 digits. An algorithm by West (2009) combines Hart's algorithm 5666 with a continued fraction approximation in the tail to provide a fast computation algorithm with a 16-digit precision.
+
Cody (1969) after recalling Hart68 solution is not suited for erf, gives a solution for both erf and erfc, with maximal relative error bound, via Rational Chebyshev Approximation.
for calculating Φ(x) with arbitrary precision. The drawback of this algorithm is comparatively slow calculation time (for example it takes over 300 iterations to calculate the function with 16 digits of precision when x = 10).
Dia (2023) proposes the following approximation of with a maximum relative error less than in absolute value: for and for ,
+
+
Shore (1982) introduced simple approximations that may be incorporated in stochastic optimization models of engineering and operations research, like reliability engineering and inventory analysis. Denoting p = Φ(z), the simplest approximation for the quantile function is:
+
+
This approximation delivers for z a maximum absolute error of 0.026 (for 0.5 ≤ p ≤ 0.9999, corresponding to 0 ≤ z ≤ 3.719). For p < 1/2 replace p by 1 − p and change sign. Another approximation, somewhat less accurate, is the single-parameter approximation:
+
+
The latter had served to derive a simple approximation for the loss integral of the normal distribution, defined by
+
+
This approximation is particularly accurate for the right far-tail (maximum error of 10−3 for z≥1.4). Highly accurate approximations for the cumulative distribution function, based on Response Modeling Methodology (RMM, Shore, 2011, 2012), are shown in Shore (2005).
+
Some more approximations can be found at: Error function#Approximation with elementary functions. In particular, small relative error on the whole domain for the cumulative distribution function and the quantile function as well, is achieved via an explicitly invertible formula by Sergei Winitzki in 2008.
+
Some authors[62][63] attribute the credit for the discovery of the normal distribution to de Moivre, who in 1738[note 2] published in the second edition of his The Doctrine of Chances the study of the coefficients in the binomial expansion of (a + b)n. De Moivre proved that the middle term in this expansion has the approximate magnitude of , and that "If m or 1/2n be a Quantity infinitely great, then the Logarithm of the Ratio, which a Term distant from the middle by the Interval ℓ, has to the middle Term, is ."[64] Although this theorem can be interpreted as the first obscure expression for the normal probability law, Stigler points out that de Moivre himself did not interpret his results as anything more than the approximate rule for the binomial coefficients, and in particular de Moivre lacked the concept of the probability density function.[65]
+
+
+
In 1823 Gauss published his monograph "Theoria combinationis observationum erroribus minimis obnoxiae" where among other things he introduces several important statistical concepts, such as the method of least squares, the method of maximum likelihood, and the normal distribution. Gauss used M, M′, M′′, ... to denote the measurements of some unknown quantity V, and sought the most probable estimator of that quantity: the one that maximizes the probability φ(M − V) · φ(M′ − V) · φ(M′′ − V) · ... of obtaining the observed experimental results. In his notation φΔ is the probability density function of the measurement errors of magnitude Δ. Not knowing what the function φ is, Gauss requires that his method should reduce to the well-known answer: the arithmetic mean of the measured values.[note 3] Starting from these principles, Gauss demonstrates that the only law that rationalizes the choice of arithmetic mean as an estimator of the location parameter, is the normal law of errors:[66]
+
+where h is "the measure of the precision of the observations". Using this normal law as a generic model for errors in the experiments, Gauss formulates what is now known as the non-linearweighted least squares method.[67]
+
+
+
Although Gauss was the first to suggest the normal distribution law, Laplace made significant contributions.[note 4] It was Laplace who first posed the problem of aggregating several observations in 1774,[68] although his own solution led to the Laplacian distribution. It was Laplace who first calculated the value of the integral ∫ e−t2dt = √π in 1782, providing the normalization constant for the normal distribution.[69] Finally, it was Laplace who in 1810 proved and presented to the Academy the fundamental central limit theorem, which emphasized the theoretical importance of the normal distribution.[70]
+
It is of interest to note that in 1809 an Irish-American mathematician Robert Adrain published two insightful but flawed derivations of the normal probability law, simultaneously and independently from Gauss.[71] His works remained largely unnoticed by the scientific community, until in 1871 they were exhumed by Abbe.[72]
+
In the middle of the 19th century Maxwell demonstrated that the normal distribution is not just a convenient mathematical tool, but may also occur in natural phenomena:[73] The number of particles whose velocity, resolved in a certain direction, lies between x and x + dx is
+
Today, the concept is usually known in English as the normal distribution or Gaussian distribution. Other less common names include Gauss distribution, Laplace-Gauss distribution, the law of error, the law of facility of errors, Laplace's second law, and Gaussian law.
+
Gauss himself apparently coined the term with reference to the "normal equations" involved in its applications, with normal having its technical meaning of orthogonal rather than usual.[74] However, by the end of the 19th century some authors[note 5] had started using the name normal distribution, where the word "normal" was used as an adjective – the term now being seen as a reflection of the fact that this distribution was seen as typical, common – and thus normal. Peirce (one of those authors) once defined "normal" thus: "...the 'normal' is not the average (or any other kind of mean) of what actually occurs, but of what would, in the long run, occur under certain circumstances."[75] Around the turn of the 20th century Pearson popularized the term normal as a designation for this distribution.[76]
+
+
Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another 'abnormal'.
Also, it was Pearson who first wrote the distribution in terms of the standard deviation σ as in modern notation. Soon after this, in year 1915, Fisher added the location parameter to the formula for normal distribution, expressing it in the way it is written nowadays:
+
+
The term "standard normal", which denotes the normal distribution with zero mean and unit variance came into general use around the 1950s, appearing in the popular textbooks by P. G. Hoel (1947) Introduction to Mathematical Statistics and A. M. Mood (1950) Introduction to the Theory of Statistics.[77]
+
^De Moivre first published his findings in 1733, in a pamphlet Approximatio ad Summam Terminorum Binomii (a + b)n in Seriem Expansi that was designated for private circulation only. But it was not until the year 1738 that he made his results publicly available. The original pamphlet was reprinted several times, see for example Walker (1985).
+
+
^"It has been customary certainly to regard as an axiom the hypothesis that if any quantity has been determined by several direct observations, made under the same circumstances and with equal care, the arithmetical mean of the observed values affords the most probable value, if not rigorously, yet very nearly at least, so that it is always most safe to adhere to it." — Gauss (1809, section 177)
+
+
^"My custom of terming the curve the Gauss–Laplacian or normal curve saves us from proportioning the merit of discovery between the two great astronomer mathematicians." quote from Pearson (1905, p. 189)
+
^Geary RC(1936) The distribution of the "Student's ratio for the non-normal samples". Supplement to the Journal of the Royal Statistical Society 3 (2): 178–184
+
^Edward L. Melnick and Aaron Tenenbein, "Misspecifications of the Normal Distribution", The American Statistician, volume 36, number 4 November 1982, pages 372–373
+
^John, S (1982). "The three parameter two-piece normal family of distributions and its fitting". Communications in Statistics - Theory and Methods. 11 (8): 879–885. doi:10.1080/03610928208828279.
+
^Why Most Published Research Findings Are False, John P. A. Ioannidis, 2005
+
+
^Wichura, Michael J. (1988). "Algorithm AS241: The Percentage Points of the Normal Distribution". Applied Statistics. 37 (3): 477–84. doi:10.2307/2347330. JSTOR2347330.
+
Halperin, Max; Hartley, Herman O.; Hoel, Paul G. (1965). "Recommended Standards for Statistical Symbols and Notation. COPSS Committee on Symbols and Notation". The American Statistician. 19 (3): 12–14. doi:10.2307/2681417. JSTOR2681417.
+
Hart, John F.; et al. (1968). Computer Approximations. New York, NY: John Wiley & Sons, Inc. ISBN978-0-88275-642-4.
Karney, C. F. F. (2016). "Sampling exactly from the normal distribution". ACM Transactions on Mathematical Software. 42 (1): 3:1–14. arXiv:1303.6257. doi:10.1145/2710016. S2CID14252035.
Krishnamoorthy, Kalimuthu (2006). Handbook of Statistical Distributions with Applications. Chapman & Hall/CRC. ISBN978-1-58488-635-8.
+
Kruskal, William H.; Stigler, Stephen M. (1997). Spencer, Bruce D. (ed.). Normative Terminology: 'Normal' in Statistics and Elsewhere. Statistics and Public Policy. Oxford University Press. ISBN978-0-19-852341-3.
Lexis, Wilhelm (1878). "Sur la durée normale de la vie humaine et sur la théorie de la stabilité des rapports statistiques". Annales de Démographie Internationale. Paris. II: 447–462.
Maxwell, James Clerk (1860). "V. Illustrations of the dynamical theory of gases. — Part I: On the motions and collisions of perfectly elastic spheres". Philosophical Magazine. Series 4. 19 (124): 19–32. doi:10.1080/14786446008642818.
Shore, H (1982). "Simple Approximations for the Inverse Cumulative Function, the Density Function and the Loss Integral of the Normal Distribution". Journal of the Royal Statistical Society. Series C (Applied Statistics). 31 (2): 108–114. doi:10.2307/2347972. JSTOR2347972.
+
Shore, H (2005). "Accurate RMM-Based Approximations for the CDF of the Normal Distribution". Communications in Statistics – Theory and Methods. 34 (3): 507–513. doi:10.1081/sta-200052102. S2CID122148043.
Stigler, Stephen M. (1982). "A Modest Proposal: A New Standard for the Normal". The American Statistician. 36 (2): 137–138. doi:10.2307/2684031. JSTOR2684031.
In computer science, quickselect is a selection algorithm to find the kth smallest element in an unordered list, also known as the kth order statistic. Like the related quicksort sorting algorithm, it was developed by Tony Hoare, and thus is also known as Hoare's selection algorithm.[1] Like quicksort, it is efficient in practice and has good average-case performance, but has poor worst-case performance. Quickselect and its variants are the selection algorithms most often used in efficient real-world implementations.
+
Quickselect uses the same overall approach as quicksort, choosing one element as a pivot and partitioning the data in two based on the pivot, accordingly as less than or greater than the pivot. However, instead of recursing into both sides, as in quicksort, quickselect only recurses into one side – the side with the element it is searching for. This reduces the average complexity from to , with a worst case of .
+
As with quicksort, quickselect is generally implemented as an in-place algorithm, and beyond selecting the kth element, it also partially sorts the data. See selection algorithm for further discussion of the connection with sorting.
+
In quicksort, there is a subprocedure called partition that can, in linear time, group a list (ranging from indices left to right) into two parts: those less than a certain element, and those greater than or equal to the element. Here is pseudocode that performs a partition about the element list[pivotIndex]:
+
+
function partition(list, left, right, pivotIndex) is
+ pivotValue := list[pivotIndex]
+ swap list[pivotIndex] and list[right] // Move pivot to end
+ storeIndex := left
+ for i from left to right − 1 do
+ if list[i] < pivotValue then
+ swap list[storeIndex] and list[i]
+ increment storeIndex
+ swap list[right] and list[storeIndex] // Move pivot to its final place
+ return storeIndex
+
In quicksort, we recursively sort both branches, leading to best-case time. However, when doing selection, we already know which partition our desired element lies in, since the pivot is in its final sorted position, with all those preceding it in an unsorted order and all those following it in an unsorted order. Therefore, a single recursive call locates the desired element in the correct partition, and we build upon this for quickselect:
+
+
// Returns the k-th smallest element of list within left..right inclusive
+// (i.e. left <= k <= right).
+function select(list, left, right, k) is
+ if left = right then// If the list contains only one element,
+ return list[left] // return that element
+ pivotIndex := ... // select a pivotIndex between left and right,
+ // e.g., left + floor(rand() % (right − left + 1))
+ pivotIndex := partition(list, left, right, pivotIndex)
+ // The pivot is in its final sorted position
+ if k = pivotIndex then
+ return list[k]
+ else if k < pivotIndex then
+ return select(list, left, pivotIndex − 1, k)
+ else
+ return select(list, pivotIndex + 1, right, k)
+
+
Note the resemblance to quicksort: just as the minimum-based selection algorithm is a partial selection sort, this is a partial quicksort, generating and partitioning only of its partitions. This simple procedure has expected linear performance, and, like quicksort, has quite good performance in practice. It is also an in-place algorithm, requiring only constant memory overhead if tail call optimization is available, or if eliminating the tail recursion with a loop:
+
function select(list, left, right, k) is
+ loop
+ if left = right then
+ return list[left]
+ pivotIndex := ... // select pivotIndex between left and right
+ pivotIndex := partition(list, left, right, pivotIndex)
+ if k = pivotIndex then
+ return list[k]
+ else if k < pivotIndex then
+ right := pivotIndex − 1
+ else
+ left := pivotIndex + 1
+
Like quicksort, quickselect has good average performance, but is sensitive to the pivot that is chosen. If good pivots are chosen, meaning ones that consistently decrease the search set by a given fraction, then the search set decreases in size exponentially and by induction (or summing the geometric series) one sees that performance is linear, as each step is linear and the overall time is a constant times this (depending on how quickly the search set reduces). However, if bad pivots are consistently chosen, such as decreasing by only a single element each time, then worst-case performance is quadratic: This occurs for example in searching for the maximum element of a set, using the first element as the pivot, and having sorted data. However, for randomly chosen pivots, this worst case is very unlikely: the probability of using more than comparisons, for any sufficiently large constant , is superexponentially small as a function of .[2]
+
The easiest solution is to choose a random pivot, which yields almost certain linear time. Deterministically, one can use median-of-3 pivot strategy (as in the quicksort), which yields linear performance on partially sorted data, as is common in the real world. However, contrived sequences can still cause worst-case complexity; David Musser describes a "median-of-3 killer" sequence that allows an attack against that strategy, which was one motivation for his introselect algorithm.
+
One can assure linear performance even in the worst case by using a more sophisticated pivot strategy; this is done in the median of medians algorithm. However, the overhead of computing the pivot is high, and thus this is generally not used in practice. One can combine basic quickselect with median of medians as fallback to get both fast average case performance and linear worst-case performance; this is done in introselect.
+
Finer computations of the average time complexity yield a worst case of for random pivots (in the case of the median; other k are faster).[3] The constant can be improved to 3/2 by a more complicated pivot strategy, yielding the Floyd–Rivest algorithm, which has average complexity of for median, with other k being faster.
+
Xorshift random number generators, also called shift-register generators, are a class of pseudorandom number generators that were invented by George Marsaglia.[1] They are a subset of linear-feedback shift registers (LFSRs) which allow a particularly efficient implementation in software without the excessive use of sparse polynomials.[2] They generate the next number in their sequence by repeatedly taking the exclusive or of a number with a bit-shifted version of itself. This makes execution extremely efficient on modern computer architectures, but it does not benefit efficiency in a hardware implementation. Like all LFSRs, the parameters have to be chosen very carefully in order to achieve a long period.[3]
+
For execution in software, xorshift generators are among the fastest PRNGs, requiring very small code and state. However, they do not pass every statistical test without further refinement. This weakness is amended by combining them with a non-linear function, as described in the original paper. Because plain xorshift generators (without a non-linear step) fail some statistical tests, they have been accused of being unreliable.[3]: 360
+
A C version[a] of three xorshift algorithms[1]: 4,5 is given here. The first has one 32-bit word of state, and period 232−1. The second has one 64-bit word of state and period 264−1. The last one has four 32-bit words of state, and period 2128−1. The 128-bit algorithm passes the diehard tests. However, it fails the MatrixRank and LinearComp tests of the BigCrush test suite from the TestU01 framework.
+
All use three shifts and three or four exclusive-or operations:
+
+
#include<stdint.h>
+
+structxorshift32_state{
+uint32_ta;
+};
+
+/* The state must be initialized to non-zero */
+uint32_txorshift32(structxorshift32_state*state)
+{
+/* Algorithm "xor" from p. 4 of Marsaglia, "Xorshift RNGs" */
+uint32_tx=state->a;
+x^=x<<13;
+x^=x>>17;
+x^=x<<5;
+returnstate->a=x;
+}
+
+structxorshift64_state{
+uint64_ta;
+};
+
+uint64_txorshift64(structxorshift64_state*state)
+{
+uint64_tx=state->a;
+x^=x<<13;
+x^=x>>7;
+x^=x<<17;
+returnstate->a=x;
+}
+
+/* struct xorshift128_state can alternatively be defined as a pair
+ of uint64_t or a uint128_t where supported */
+structxorshift128_state{
+uint32_tx[4];
+};
+
+/* The state must be initialized to non-zero */
+uint32_txorshift128(structxorshift128_state*state)
+{
+/* Algorithm "xor128" from p. 5 of Marsaglia, "Xorshift RNGs" */
+uint32_tt=state->x[3];
+
+uint32_ts=state->x[0];/* Perform a contrived 32-bit shift. */
+state->x[3]=state->x[2];
+state->x[2]=state->x[1];
+state->x[1]=s;
+
+t^=t<<11;
+t^=t>>8;
+returnstate->x[0]=t^s^(s>>19);
+}
+
All xorshift generators fail some tests in the BigCrush test suite. This is true for all generators based on linear recurrences, such as the Mersenne Twister or WELL. However, it is easy to scramble the output of such generators to improve their quality.
+
The scramblers known as + and * still leave weakness in the low bits,[4] so they are intended for floating point use, as double-precision floating-point numbers only use 53 bits, so the lower 11 bits are not used. For general purpose, the scrambler ** (pronounced starstar) makes the LFSR generators pass in all bits.
+
Marsaglia suggested scrambling the output by combining it with a simple additive counter modulo 232 (which he calls a "Weyl sequence" after Weyl's equidistribution theorem). This also increases the period by a factor of 232, to 2192−232:
+
+
#include<stdint.h>
+
+structxorwow_state{
+uint32_tx[5];
+uint32_tcounter;
+};
+
+/* The state array must be initialized to not be all zero in the first four words */
+uint32_txorwow(structxorwow_state*state)
+{
+/* Algorithm "xorwow" from p. 5 of Marsaglia, "Xorshift RNGs" */
+uint32_tt=state->x[4];
+
+uint32_ts=state->x[0];/* Perform a contrived 32-bit shift. */
+state->x[4]=state->x[3];
+state->x[3]=state->x[2];
+state->x[2]=state->x[1];
+state->x[1]=s;
+
+t^=t>>2;
+t^=t<<1;
+t^=s^(s<<4);
+state->x[0]=t;
+state->counter+=362437;
+returnt+state->counter;
+}
+
+
This performs well, but fails a few tests in BigCrush.[5] This generator is the default in Nvidia's CUDA toolkit.[6]
+
An xorshift* generator applies an invertible multiplication (modulo the word size) as a non-linear transformation to the output of an xorshift generator, as suggested by Marsaglia.[1] All xorshift* generators emit a sequence of values that is equidistributed in the maximum possible dimension (except that they will never output zero for 16 calls, i.e. 128 bytes, in a row).[7]
+
The following 64-bit generator has a maximal period of 264−1.[7]
+
+
#include<stdint.h>
+
+/* xorshift64s, variant A_1(12,25,27) with multiplier M_32 from line 3 of table 5 */
+uint64_txorshift64star(void){
+/* initial seed must be nonzero, don't use a static variable for the state if multithreaded */
+staticuint64_tx=1;
+x^=x>>12;
+x^=x<<25;
+x^=x>>27;
+returnx*0x2545F4914F6CDD1DULL;
+}
+
+
The generator fails only the MatrixRank test of BigCrush, however if the generator is modified to return only the high 32 bits, then it passes BigCrush with zero failures.[8]: 7 In fact, a reduced version with only 40 bits of internal state passes the suite, suggesting a large safety margin.[8]: 19 A similar generator suggested in Numerical Recipes[9] as RanQ1 also fails the BirthdaySpacings test.
+
Vigna[7] suggests the following xorshift1024* generator with 1024 bits of state and a maximal period of 21024−1; however, it does not always pass BigCrush.[4] xoshiro256** is therefore a much better option.
+
+
#include<stdint.h>
+
+/* The state must be seeded so that there is at least one non-zero element in array */
+structxorshift1024s_state{
+uint64_tx[16];
+intindex;
+};
+
+uint64_txorshift1024s(structxorshift1024s_state*state)
+{
+intindex=state->index;
+uint64_tconsts=state->x[index++];
+uint64_tt=state->x[index&=15];
+t^=t<<31;// a
+t^=t>>11;// b -- Again, the shifts and the multipliers are tunable
+t^=s^(s>>30);// c
+state->x[index]=t;
+state->index=index;
+returnt*1181783497276652981ULL;
+}
+
An xorshift+ generator can achieve an order of magnitude fewer failures than Mersenne Twister or WELL. A native C implementation of an xorshift+ generator that passes all tests from the BigCrush suite can typically generate a random number in fewer than 10 clock cycles on x86, thanks to instruction pipelining.[10]
+
Rather than using multiplication, it is possible to use addition as a faster non-linear transformation. The idea was first proposed by Saito and Matsumoto (also responsible for the Mersenne Twister) in the XSadd generator, which adds two consecutive outputs of an underlying xorshift generator based on 32-bit shifts.[11] However, one disadvantage of adding consecutive outputs is that, while the underlying xorshift128 generator is 2-dimensionally equidistributed, the xorshift128+ generator is only 1-dimensionally equidistributed.[12]
+
XSadd has some weakness in the low-order bits of its output; it fails several BigCrush tests when the output words are bit-reversed. To correct this problem, Vigna introduced the xorshift+ family,[12] based on 64-bit shifts. xorshift+ generators, even as large as xorshift1024+, exhibit some detectable linearity in the low-order bits of their output;[4] it passes BigCrush, but doesn't when the 32 lowest-order bits are used in reverse order from each 64-bit word.[4] This generator is one of the fastest generators passing BigCrush.[10]
+
The following xorshift128+ generator uses 128 bits of state and has a maximal period of 2128−1.
+
+
#include<stdint.h>
+
+structxorshift128p_state{
+uint64_tx[2];
+};
+
+/* The state must be seeded so that it is not all zero */
+uint64_txorshift128p(structxorshift128p_state*state)
+{
+uint64_tt=state->x[0];
+uint64_tconsts=state->x[1];
+state->x[0]=s;
+t^=t<<23;// a
+t^=t>>18;// b -- Again, the shifts and the multipliers are tunable
+t^=s^(s>>5);// c
+state->x[1]=t;
+returnt+s;
+}
+
xoshiro and xoroshiro use rotations in addition to shifts. According to Vigna, they are faster and produce better quality output than xorshift.[13][14]
+
This class of generator has variants for 32-bit and 64-bit integer and floating point output; for floating point, one takes the upper 53 bits (for binary64) or the upper 23 bits (for binary32), since the upper bits are of better quality than the lower bits in the floating point generators. The algorithms also include a jump function, which sets the state forward by some number of steps – usually a power of two that allows many threads of execution to start at distinct initial states.
+
For 32-bit output, xoshiro128** and xoshiro128+ are exactly equivalent to xoshiro256** and xoshiro256+, with uint32_t in place of uint64_t, and with different shift/rotate constants.
+
More recently, the xoshiro++ generators have been made as an alternative to the xoshiro** generators. They are used in some implementations of Fortran compilers such as GNU Fortran, Java, and Julia.[15]
+
xoshiro256** is the family's general-purpose random 64-bit number generator. It is used in GNU Fortrancompiler, Lua (as of Lua 5.4), and the .NET framework (as of .NET 6.0).[15]
+
+
/* Adapted from the code included on Sebastiano Vigna's website */
+
+#include<stdint.h>
+
+uint64_trol64(uint64_tx,intk)
+{
+return(x<<k)|(x>>(64-k));
+}
+
+structxoshiro256ss_state{
+uint64_ts[4];
+};
+
+uint64_txoshiro256ss(structxoshiro256ss_state*state)
+{
+uint64_t*s=state->s;
+uint64_tconstresult=rol64(s[1]*5,7)*9;
+uint64_tconstt=s[1]<<17;
+
+s[2]^=s[0];
+s[3]^=s[1];
+s[1]^=s[2];
+s[0]^=s[3];
+
+s[2]^=t;
+s[3]=rol64(s[3],45);
+
+returnresult;
+}
+
xoshiro256+ is approximately 15% faster than xoshiro256**, but the lowest three bits have low linear complexity; therefore, it should be used only for floating point results by extracting the upper 53 bits.
+
If space is at a premium, xoroshiro128** and xoroshiro128+ are equivalent to xoshiro256** and xoshiro256+. These have smaller state spaces, and thus are less useful for massively parallel programs. xoroshiro128+ also exhibits a mild dependency in the population count, generating a failure after 5 TB of output. The authors do not believe that this can be detected in real world programs.
+
xoroshiro64** and xoroshiro64* are equivalent to xoroshiro128** and xoroshiro128+. Unlike the xoshiro generators, they are not straightforward ports of their higher-precision counterparts.
+
In the xoshiro paper, it is recommended to initialize the state of the generators using a generator which is radically different from the initialized generators, as well as one which will never give the "all-zero" state; for shift-register generators, this state is impossible to escape from.[14][16] The authors specifically recommend using the SplitMix64 generator, from a 64-bit seed, as follows:
+
+
#include<stdint.h>
+
+structsplitmix64_state{
+uint64_ts;
+};
+
+uint64_tsplitmix64(structsplitmix64_state*state){
+uint64_tresult=(state->s+=0x9E3779B97f4A7C15);
+result=(result^(result>>30))*0xBF58476D1CE4E5B9;
+result=(result^(result>>27))*0x94D049BB133111EB;
+returnresult^(result>>31);
+}
+
+structxorshift128_state{
+uint32_tx[4];
+};
+
+// one could do the same for any of the other generators
+voidxorshift128_init(structxorshift128_state*state,uint64_tseed){
+structsplitmix64_statesmstate={seed};
+
+uint64_ttmp=splitmix64(&smstate);
+state->x[0]=(uint32_t)tmp;
+state->x[1]=(uint32_t)(tmp>>32);
+
+tmp=splitmix64(&smstate);
+state->x[2]=(uint32_t)tmp;
+state->x[3]=(uint32_t)(tmp>>32);
+}
+
^ abcdLemire, Daniel; O’Neill, Melissa E. (April 2019). "Xorshift1024*, Xorshift1024+, Xorshift128+ and Xoroshiro128+ Fail Statistical Tests for Linearity". Computational and Applied Mathematics. 350: 139–142. arXiv:1810.05313. doi:10.1016/j.cam.2018.10.019. S2CID52983294. We report that these scrambled generators systematically fail Big Crush—specifically the linear-complexity and matrix-rank tests that detect linearity—when taking the 32 lowest-order bits in reverse order from each 64-bit word.
+
I have this homework where i need to implement xorshift32(i can t use anything else) so i can generate some numbers but i don t understand how the algorithm works or how to implement it.
+
+
I am trying to print the generated number but i don t know how to call the xorshift32 function because of the state[static 1] argument.
+
+
uint32_t xorshift32(uint32_t state[static 1])
+{
+ uint32_t x = state[0];
+ x ^= x << 13;
+ x ^= x >> 17;
+ x ^= x << 5;
+ state[0] = x;
+ return x;
+}
+
+
+
I do not have much information on xorshft32 other that what is on wikipedia(en.wikipedia.org/wiki/Xorshift).
+
+ Can you even compile that function? It does not conform to standard C (with respect to the function parameter), so if your compiler accepts it then some language extension is in play. You'll need to check your implementation's documentation (or maybe your class notes) to find out what it means. On the other hand, maybe there's simply a typo there. It would make more sense if the static keyword were removed, or perhaps moved to the beginning of the function declaration.
+
+
+
+ @Jabberwocky here is the wikipeda: en.wikipedia.org/wiki/Xorshift . I don t even know to explain it to you. It s a number generator using xor and shift made by a guy. The teacher didn t tell us much either
+
+
The Xorshift variants, rand(), and basically all random number generator functions, are actually pseudorandom number generators. They are not "real random", because the sequence of numbers they generate depends on their internal state; but they are "pseudorandom", because if you do not know the generator internal state, the sequence of numbers they generate is random in the statistical sense.
+
+
George Marsaglia, the author of the Xorshift family of pseudorandom number generators, also developed a set of statistical tools called Diehard tests that can be used to analyse the "randomness" of the sequences generated. Currently, the TestU01 tests are probably the most widely used and trusted; in particular, the 160-test BigCrush set.
+
+
The sequence generated by ordinary pseudorandom number generators often allows one to determine the internal state of the generator. This means that observing a long enough generated sequence, allows one to fairly reliably predict the future sequence. Cryptographically secure pseudorandom number generators avoid that, usually by applying a cryptographically secure hash function to the output; one would need a catalog of the entire sequence to be able to follow it. When the periods are longer than 2256 or so, there is not enough baryonic matter in the entire observable universe to store the sequence.
+
+
My own favourite PRNG is Xorshift64*, which has a period of 264-1, and passes all but the MatrixRank test in BigCrush. In C99 and later, you can implement it using
+
+
#include <inttypes.h>
+
+typedef struct {
+ uint64_t state;
+} prng_state;
+
+static inline uint64_t prng_u64(prng_state *const p)
+{
+ uint64_t state = p->state;
+ state ^= state >> 12;
+ state ^= state << 25;
+ state ^= state >> 27;
+ p->state = state;
+ return state * UINT64_C(2685821657736338717);
+}
+
+
+
The state can be initialized to any nonzero uint64_t. (A zero state will lead the generator to generate all zeros till infinity. The period is 264-1, because the generator will have each 64-bit state (excluding zero) exactly once during each period.)
+
+
It is good enough for most use cases, and extremely fast. It belongs to the class of linear-feedback shift register pseudorandom number generators.
+
+
Note that the variant which returns an uniform distribution between 0 and 1,
uses the high bits; the high 32 bits of the sequence does pass all BigCrunch tests in TestU01 suite, so this is a surprisingly good (randomness and efficiency) generator for double-precision uniform random numbers -- my typical use case.
+
+
The format above allows multiple independent generators in a single process, by specifying the generator state as a parameter. If the basic generator is implemented in a header file (thus the static inline; it is a preprocessor macro-like function), you can switch between generators by switching between header files, and recompiling the binary.
+
+
(You are usually better off by using a single generator, unless you use multiple threads in a pseudorandom number heavy simulator, in which case using a separate generator for each thread will help a lot; avoids cacheline ping-pong between threads competing for the generator state, in particular.)
+
+
The rand() function in most C standard library implementations is a linear-congruential generator. They often suffer from poor choices of the coefficients, and nowadays, also from the relative slowness of the modulo operator (when the modulus is not a power of two).
+
+
The most widely used pseudorandom number generator is the Mersenne Twister, by Makoto Matsumoto (松本 眞) and Takuji Nishimura (西村 拓士). It is a twisted generalized linear feedback shift register, and has quite a large state (about 2500 bytes) and very long period (219937-1).
+
+
+
+
When we talk of true random number generators, we usually mean a combination of a pseudorandom number generator (usually a cryptographically secure one), and a source of entropy; random bits with at least some degree of true physical randomness.
+
+
In Linux, Mac OS, and BSDs at least, the operating system kernel exposes a source of pseudorandom numbers (getentropy() in Linux and OpenBSD, getrandom() in Linux, /dev/urandom, /dev/arandom, /dev/random in many Unixes, and so on). Entropy is gathered from physical electronic sources, like internal processor latencies, physical interrupt line timings, (spinning disk) hard drive timings, possibly even keyboard and mice. Many motherboards and some processors even have hardware random number sources that can be used as sources for entropy (or even directly as "trusted randomness sources").
+
+
The exclusive-or operation (^ in C) is used to mix in randomness to the generator state. This works, because exclusive-or between a known bit and a random bit results in a random bit; XOR preserves randomness. When mixing entropy pools (with some degree of randomness in the bit states) using XOR, the result will have at least as much entropy as the sources had.
+
+
Note that that does not mean that you get "better" random numbers by mixing the output of two or more generators. The statistics of true randomness is hard for humans to grok (just look at how poor the common early rand() implementations were! HORRIBLE!). It is better to pick a generator (or a set of generators to switch between at compile time, or at run time) that passes the BigCrunch tests, and ensure it has a good random initial state on every run. That way you leverage the work of many mathematicians and others who have worked on these things for decades, and can concentrate on the other stuff, what you yourself are good at.
+ By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p b/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p
new file mode 100644
index 0000000..5087b72
--- /dev/null
+++ b/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p
@@ -0,0 +1,2498 @@
+
+
+
+
+
+
+
+ scipy - How to calculate the inverse of the normal cumulative distribution function in python? - Stack Overflow
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
NORMSINV (mentioned in a comment) is the inverse of the CDF of the standard normal distribution. Using scipy, you can compute this with the ppf method of the scipy.stats.norm object. The acronym ppf stands for percent point function, which is another name for the quantile function.
+
In [20]: from scipy.stats import norm
+
+In [21]: norm.ppf(0.95)
+Out[21]: 1.6448536269514722
+
+
Check that it is the inverse of the CDF:
+
In [34]: norm.cdf(norm.ppf(0.95))
+Out[34]: 0.94999999999999996
+
+
By default, norm.ppf uses mean=0 and stddev=1, which is the "standard" normal distribution. You can use a different mean and standard deviation by specifying the loc and scale arguments, respectively.
+
In [35]: norm.ppf(0.95, loc=10, scale=2)
+Out[35]: 13.289707253902945
+
+
If you look at the source code for scipy.stats.norm, you'll find that the ppf method ultimately calls scipy.special.ndtri. So to compute the inverse of the CDF of the standard normal distribution, you could use that function directly:
+
In [43]: from scipy.special import ndtri
+
+In [44]: ndtri(0.95)
+Out[44]: 1.6448536269514722
+
+
ndtri is much faster than norm.ppf:
+
In [46]: %timeit norm.ppf(0.95)
+240 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
+
+In [47]: %timeit ndtri(0.95)
+1.47 µs ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
+
+
+ @bones.felipe, the "standard" normal distribution has mean 0 and standard deviation 1. These are the default values for the location and scale of the scipy.stats.norm methods.
+
+
+
+ Right, I thought I saw this norm.cdf(norm.ppf(0.95, loc=10, scale=2)) and I thought it was weird norm.cdf did not have loc=10 and scale=2 too, I guess it should.
+
+
Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
+
+
It can be used to get the inverse cumulative distribution function (inv_cdf - inverse of the cdf), also known as the quantile function or the percent-point function for a given mean (mu) and standard deviation (sigma):
+
+
from statistics import NormalDist
+
+NormalDist(mu=10, sigma=2).inv_cdf(0.95)
+# 13.289707253902943
+
+
+
Which can be simplified for the standard normal distribution (mu = 0 and sigma = 1):
# given random variable X (house price) with population muy = 60, sigma = 40
+import scipy as sc
+import scipy.stats as sct
+sc.version.full_version # 0.15.1
+
+#a. Find P(X<50)
+sct.norm.cdf(x=50,loc=60,scale=40) # 0.4012936743170763
+
+#b. Find P(X>=50)
+sct.norm.sf(x=50,loc=60,scale=40) # 0.5987063256829237
+
+#c. Find P(60<=X<=80)
+sct.norm.cdf(x=80,loc=60,scale=40) - sct.norm.cdf(x=60,loc=60,scale=40)
+
+#d. how much top most 5% expensive house cost at least? or find x where P(X>=x) = 0.05
+sct.norm.isf(q=0.05,loc=60,scale=40)
+
+#e. how much top most 5% cheapest house cost at least? or find x where P(X<=x) = 0.05
+sct.norm.ppf(q=0.05,loc=60,scale=40)
+
+ By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p.1 b/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p.1
new file mode 100644
index 0000000..159dcb0
--- /dev/null
+++ b/references/how-to-calculate-the-inverse-of-the-normal-cumulative-distribution-function-in-p.1
@@ -0,0 +1,2498 @@
+
+
+
+
+
+
+
+ scipy - How to calculate the inverse of the normal cumulative distribution function in python? - Stack Overflow
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
NORMSINV (mentioned in a comment) is the inverse of the CDF of the standard normal distribution. Using scipy, you can compute this with the ppf method of the scipy.stats.norm object. The acronym ppf stands for percent point function, which is another name for the quantile function.
+
In [20]: from scipy.stats import norm
+
+In [21]: norm.ppf(0.95)
+Out[21]: 1.6448536269514722
+
+
Check that it is the inverse of the CDF:
+
In [34]: norm.cdf(norm.ppf(0.95))
+Out[34]: 0.94999999999999996
+
+
By default, norm.ppf uses mean=0 and stddev=1, which is the "standard" normal distribution. You can use a different mean and standard deviation by specifying the loc and scale arguments, respectively.
+
In [35]: norm.ppf(0.95, loc=10, scale=2)
+Out[35]: 13.289707253902945
+
+
If you look at the source code for scipy.stats.norm, you'll find that the ppf method ultimately calls scipy.special.ndtri. So to compute the inverse of the CDF of the standard normal distribution, you could use that function directly:
+
In [43]: from scipy.special import ndtri
+
+In [44]: ndtri(0.95)
+Out[44]: 1.6448536269514722
+
+
ndtri is much faster than norm.ppf:
+
In [46]: %timeit norm.ppf(0.95)
+240 µs ± 1.75 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
+
+In [47]: %timeit ndtri(0.95)
+1.47 µs ± 1.3 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
+
+
+ @bones.felipe, the "standard" normal distribution has mean 0 and standard deviation 1. These are the default values for the location and scale of the scipy.stats.norm methods.
+
+
+
+ Right, I thought I saw this norm.cdf(norm.ppf(0.95, loc=10, scale=2)) and I thought it was weird norm.cdf did not have loc=10 and scale=2 too, I guess it should.
+
+
Starting Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
+
+
It can be used to get the inverse cumulative distribution function (inv_cdf - inverse of the cdf), also known as the quantile function or the percent-point function for a given mean (mu) and standard deviation (sigma):
+
+
from statistics import NormalDist
+
+NormalDist(mu=10, sigma=2).inv_cdf(0.95)
+# 13.289707253902943
+
+
+
Which can be simplified for the standard normal distribution (mu = 0 and sigma = 1):
# given random variable X (house price) with population muy = 60, sigma = 40
+import scipy as sc
+import scipy.stats as sct
+sc.version.full_version # 0.15.1
+
+#a. Find P(X<50)
+sct.norm.cdf(x=50,loc=60,scale=40) # 0.4012936743170763
+
+#b. Find P(X>=50)
+sct.norm.sf(x=50,loc=60,scale=40) # 0.5987063256829237
+
+#c. Find P(60<=X<=80)
+sct.norm.cdf(x=80,loc=60,scale=40) - sct.norm.cdf(x=60,loc=60,scale=40)
+
+#d. how much top most 5% expensive house cost at least? or find x where P(X>=x) = 0.05
+sct.norm.isf(q=0.05,loc=60,scale=40)
+
+#e. how much top most 5% cheapest house cost at least? or find x where P(X<=x) = 0.05
+sct.norm.ppf(q=0.05,loc=60,scale=40)
+
+ By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.
+
This page describes some new pseudorandom number generators (PRNGs) we (David Blackman and I) have been working on recently, and
+ a shootout comparing them with other generators. Details about the generators can
+ be found in our paper. Information about my previous xorshift-based
+ generators can be found here, but they have been entirely superseded by the new ones, which
+ are faster and better. As part of our study, we developed a very strong test for Hamming-weight dependencies
+ that gave a number of surprising results.
+
+
64-bit Generators
+
+
xoshiro256++/xoshiro256**
+ (XOR/shift/rotate) are our all-purpose
+ generators (not cryptographically secure generators, though,
+ like all PRNGs in these pages). They have excellent (sub-ns) speed, a state
+ space (256 bits) that is large enough for any parallel application, and
+ they pass all tests we are aware of. See the paper
+ for a discussion of their differences.
+
+
If, however, one has to generate only 64-bit floating-point numbers
+ (by extracting the upper 53 bits) xoshiro256+ is a slightly (≈15%)
+ faster generator with analogous statistical properties. For general
+ usage, one has to consider that its lowest bits have low linear
+ complexity and will fail linearity tests; however, low linear
+ complexity of the lowest bits can have hardly any impact in practice, and certainly has no
+ impact at all if you generate floating-point numbers using the upper bits (we computed a precise
+ estimate of the linear complexity of the lowest bits).
+
+
If you are tight on space, xoroshiro128++/xoroshiro128**
+ (XOR/rotate/shift/rotate) and xoroshiro128+ have the same
+ speed and use half of the space; the same comments apply. They are suitable only for
+ low-scale parallel applications; moreover, xoroshiro128+
+ exhibits a mild dependency in Hamming weights that generates a failure
+ after 5 TB of output in our test. We believe
+ this slight bias cannot affect any application.
+
+
All generators, being based on linear recurrences, provide jump
+ functions that make it possible to simulate any number of calls to
+ the next-state function in constant time, once a suitable jump
+ polynomial has been computed. We provide ready-made jump functions for
+ a number of calls equal to the square root of the period, to make it easy
+ generating non-overlapping sequences for parallel computations, and equal
+ to the cube of the fourth root of the period, to make it possible to
+ generate independent sequences on different parallel processors.
+
+
We suggest to use SplitMix64 to initialize
+ the state of our generators starting from a 64-bit seed, as research
+ has shown that initialization must be performed with a generator
+ radically different in nature from the one initialized to avoid
+ correlation on similar seeds.
+
+
+
32-bit Generators
+
+
xoshiro128++/xoshiro128** are our
+ 32-bit all-purpose generators, whereas xoshiro128+ is
+ for floating-point generation. They are the 32-bit counterpart of
+ xoshiro256++, xoshiro256** and xoshiro256+, so similar comments apply.
+ Their state is too small for
+ large-scale parallelism: their intended usage is inside embedded
+ hardware or GPUs. For an even smaller scale, you can use xoroshiro64** and xoroshiro64*. We not believe
+ at this point in time 32-bit generator with a larger state can be of
+ any use (but there are 32-bit xoroshiro generators of much larger size).
+
+
All 32-bit generators pass all tests we are aware of, with the
+ exception of linearity tests (binary rank and linear complexity) for
+ xoshiro128+ and xoroshiro64*: in this case,
+ due to the smaller number of output bits the low linear complexity of the
+ lowest bits is sufficient to trigger BigCrush tests when the output is bit-reversed. Analogously to
+ the 64-bit case, generating 32-bit floating-point number using the
+ upper bits will not use any of the bits with low linear complexity.
+
+
16-bit Generators
+
+
We do not suggest any particular 16-bit generator, but it is possible
+ to design relatively good ones using our techniques. For example,
+ Parallax has embedded in their Propeller 2 microcontroller multiple 16-bit
+ xoroshiro32++ generators.
+
+
Congruential Generators
+
+
In case you are interested in 64-bit PRNGs based on congruential arithmetic, I provide
+ three instances of a
+ Marsaglia's Multiply-With-Carry generators,
+ MWC128, MWC192, and MWC256, for which I computed good constants. They are some
+ of the fastest generator available, but they need 128-bit operations.
+
+
I provide here a shootout of a few recent 64-bit PRNGs that are quite widely used.
+ The purpose is that of providing a consistent, reproducible assessment of two properties of the generators: speed and quality.
+ The code used to perform the tests and all the output from statistical test suites is available for download.
+
+
The speed reported in this page is the time required to emit 64
+ random bits, and the number of clock cycles required to generate a byte (thanks to the PAPI library). If a generator is 32-bit in nature, I glue two
+ consecutive outputs. Note that
+ I do not report results using GPUs or SSE instructions, with an exception for the very common SFMT: for that to be
+ meaningful, I should have implementations for all generators.
+ Otherwise, with suitable hardware support I could just use AES in
+ counter mode and get 64 secure bits in 0.56 ns (or just use Randen). The tests were performed on a
+ 12th Gen Intel® Core™ i7-12700KF @3.60GHz using gcc 12.2.1.
+
+
A few caveats:
+
+
There is some looping overhead, but subtracting it from the timings is not going to
+ be particularly meaningful due to instruction rescheduling, etc.
+
Relative speed might be different on different CPUs and on different scenarios.
+
I do not use -march=native, which can improve the timing of some generators
+ by vectorization or special instructions, because those improvements might not be possible
+ when the generator is embedded in user code.
+
Code has been compiled using gcc's -fno-unroll-loops
+ option. This options is essential to get a sensible result: without it, the compiler
+ may perform different loop unrolling depending on the generator. Previosuly I was using also
+ -fno-move-loop-invariants, which was essential in not giving generators using several
+ large constants an advantage by preventing the compiler from loading them into registers. However,
+ as of gcc 12.2.1 the compiler loads the constants into registers anyway, so the
+ option is no longer used. Timings
+ with clang at the time of this writing
+ are very close to those obtained with gcc.
+ If you find timings that are significantly better than those shown here on
+ comparable hardware, they are likely to be due to compiler artifacts (e.g., vectorization).
+
Timings are taken running a generator for billions of times in a loop; but this is not the way you use generators. Register
+ allocation might be very different when the generator is embedded in an application, leading to constants being reloaded
+ or part of the state space being written to main memory at each iteration. These costs do not appear in the benchmarks below.
+
+
+
To ease replicability, I distribute a harness performing the measurement. You just
+ have to define a next() function and include the harness. But the only realistic
+ suggestion is to try different generators in your application and see what happens.
+
+
This is probably the more elusive property
+ of a PRNG. Here quality is measured using the powerful
+ BigCrush suite of tests. BigCrush is part of TestU01,
+ a monumental framework for testing PRNGs developed by Pierre L'Ecuyer
+ and Richard Simard (“TestU01: A C library for empirical testing
+ of random number generators”, ACM Trans. Math. Softw.
+ 33(4), Article 22, 2007).
+
+
I run BigCrush starting from 100 equispaced points of the state space
+ of the generator and collect failures—tests in which the
+ p-value statistics is outside the interval [0.001..0.999]. A failure
+ is systematic if it happens at all points.
+
+
Note that TestU01 is a 32-bit test suite. Thus, two 32-bit integer values
+ are passed to the test suite for each generated 64-bit value. Floating point numbers
+ are generated instead by dividing the unsigned output of the generator by 264.
+ Since this implies a bias towards the high bits (which is anyway a known characteristic
+ of TestU01), I run the test suite also on the reverse
+ generator. More detail about the whole process can be found in this paper.
+
+
Beside BigCrush, I analyzed generators using a test for Hamming-weight dependencies
+ described in our paper. As I already remarked, our only
+ generator failing the test (but only after 5 TB of output) is xoroshiro128+.
+
+
I report the period of each generator and its footprint in bits: a generator gives “bang-for-the-buck”
+ if the base-2 logarithm of the period is close to the footprint. Note
+ that the footprint has been always padded to a multiple of 64, and it can
+ be significantly larger than expected because of padding and
+ cyclic access indices.
+
+
The following table compares instead two ways of generating floating-point numbers, namely the 521-bit dSFMT, which
+ generates directly floating-point numbers with 52 significant bits, and
+ xoshiro256+ followed by a standard conversion of its upper bits to a floating-point number with 53 significant bits (see below).
+
+
dSFMT (uses SSE2 instructions, returns only 52 significant bits)
704
2521 − 1
MatrixRank, LinearComp
6 TB
0.85
3.07
+
+
+
xoshiro256+ is ≈8% slower than the dSFMT, but it has a doubled range of output values, does not need any extra SSE instruction (can be programmed in Java, etc.),
+ has a much smaller footprint, and its upper bits do not fail any test.
+
+
Some of the generators can be very easily vectorized, so that multiple instances can be run in parallel to provide
+ fast bulk generation. Thanks to an interesting discussion with the Julia developers,
+ I've become aware that AVX2 vectorizations of multiple instances of generators using the +/++ scrambler are impressively fast (links
+ below point at a speed test to be used with the harness, and the result will be multiplied by 1000):
+
+
Note that sometimes convincing the compiler to vectorize is a
+ slightly quirky process: for example, on gcc 12.2.1 I have to use -O3 -fdisable-tree-cunrolli -march=native
+ to vectorize xoshiro256-based generators
+ (-O3 alone will not vectorize; thanks to to Chris Elrod for pointing me at -fdisable-tree-cunrolli).
+
+
A long period does not imply high quality
+
+
This is a common misconception. The generator x++ has
+ period \(2^k\), for any \(k\geq0\), provided that x is
+ represented using \(k\) bits: nonetheless, it is a horrible generator.
+ The generator returning \(k-1\) zeroes followed by a one has period
+ \(k\).
+
+
It is however important that the period is long enough. A first heuristic rule of thumb
+ is that if you need to use \(t\) values, you need a generator with period at least \(t^2\).
+ Moreover, if you run \(n\) independent computations starting at random seeds,
+ the sequences used by each computation should not overlap.
+
+
Now, given a generator with period \(P\), the probability that \(n\) subsequences of length \(L\) starting at random points in the state space
+ overlap is bounded by \(n^2L/P\). If your generator has period \(2^{256}\) and you run
+ on \(2^{64}\) cores (you will never have them) a computation using \(2^{64}\) pseudorandom numbers (you will never have the time)
+ the probability of overlap would be less than \(2^{-64}\).
+
+
In other words: any generator with a period beyond
+ \(2^{256}\) has a period that is
+ sufficient for every imaginable application. Unless there are other motivations (e.g., provably
+ increased quality), a generator with a larger period is only a waste of
+ memory (as it needs a larger state), of cache lines, and of
+ precious high-entropy random bits for seeding (unless you're using
+ small seeds, but then it's not clear why you would want a very long
+ period in the first place—the computation above is valid only if you seed all bits of the state
+ with independent, uniformly distributed random bits).
+
+
In case the generator provides a jump function that lets you skip through chunks of the output in constant
+ time, even a period of \(2^{128}\) can be sufficient, as it provides \(2^{64}\) non-overlapping sequences of length \(2^{64}\).
+
+
Equidistribution
+
+
Every 64-bit generator of ours with n bits of state scrambled
+ with * or ** is n/64-dimensionally
+ equidistributed: every n/64-tuple of consecutive 64-bit
+ values appears exactly once in the output, except for the zero tuple
+ (and this is the largest possible dimension). Generators based on the
+ + or ++ scramblers are however only (n/64 −
+ 1)-dimensionally equidistributed: every (n/64 −
+ 1)-tuple of consecutive 64-bit values appears exactly 264
+ times in the output, except for a missing zero tuple. The same considerations
+ apply to 32-bit generators.
+
+
Generating uniform doubles in the unit interval
+
+
A standard double (64-bit) floating-point number in
+ IEEE floating point format has 52 bits of
+ significand, plus an implicit bit at the left of the significand. Thus,
+ the representation can actually store numbers with 53 significant binary digits.
+
+
Because of this fact, in C99 a 64-bit unsigned integer x should be converted to a 64-bit double
+ using the expression
+
+ #include <stdint.h>
+
+ (x >> 11) * 0x1.0p-53
+
+
In Java you can use almost the same expression for a (signed) 64-bit integer:
+
+ (x >>> 11) * 0x1.0p-53
+
+
+
+
This conversion guarantees that all dyadic rationals of the form k / 2−53
+ will be equally likely. Note that this conversion prefers the high bits of x (usually, a good idea), but you can alternatively
+ use the lowest bits.
+
+
An alternative, multiplication-free conversion is
+
The code above cooks up by bit manipulation
+ a real number in the interval [1..2), and then subtracts
+ one to obtain a real number in the interval [0..1). If x is chosen uniformly among 64-bit integers,
+ d is chosen uniformly among dyadic rationals of the form k / 2−52. This
+ is the same technique used by generators providing directly doubles, such as the
+ dSFMT.
+
+
This technique is supposed to be fast, but on recent hardare it does not seem to give a significant advantage.
+ More importantly, you will be generating half the values you could actually generate.
+ The same problem plagues the dSFMT. All doubles generated will have the lowest significand bit set to zero (I must
+ thank Raimo Niskanen from the Erlang team for making me notice this—a previous version of this site
+ did not mention this issue).
+
+
In Java you can obtain an analogous result using suitable static methods:
+
To adhere to the principle of least surprise, my implementations now use the multiplicative version, everywhere.
+
+
Interestingly, these are not the only notions of “uniformity” you can come up with. Another possibility
+ is that of generating 1074-bit integers, normalize and return the nearest value representable as a
+ 64-bit double (this is the theory—in practice, you will almost never
+ use more than two integers per double as the remaining bits would not be representable). This approach guarantees that all
+ representable doubles could be in principle generated, albeit not every
+ returned double will appear with the same probability. A reference
+ implementation can be found here. Note that unless your generator has
+ at least 1074 bits of state and suitable equidistribution properties, the code above will not do what you expect
+ (e.g., it might never return zero).
+
+
+
On 14 May 2018, Sebastiano Vigna added a page to his website (archived here) entitled “The wrap-up on PCG generators” that attempts to persuade readers to avoid various PCG generators.
+
That day, he also submitted a link to his critique to Reddit (archived here). I think it is fair to say that his remarks did not get quite the reception he might have hoped for. Readers mostly seemed to infer a certain animosity in his tone and his criticisms gained little traction with that audience.
+
Although I'm pleased to see readers of Reddit thinking critically about these things, it is worth taking the time to dive in and see what what lessons we can learn from all of this.
+
+
+
Background
+
We have to feel a little sympathy for Vigna. On May 4, he updated his website to announce a new generation scheme, Xoshiro and accompanying paper, the product of two years of work. He posted a link to his work on Reddit (archived here and here), and although he got some praise and thanks for his work, he ended up spending quite a lot of time talking not about his new work, but about flaws in his old work and about my work.
+
Here is an example of the kind of remarks he had to contend with; Reddit user “TomatoCo” wrote:
+
+
I liked xoroshiro a lot until I read all of the dire condemnations of it, so I switched to PCG. I'm not a mathematician, I can't understand your papers and PCG's write ups are a lot easier to understand. I'm sure that you've analyzed the shit out of your previous generator and I can see on your site you've come up with new techniques to measure if xoshiro suffers the same flaws. But once bitten, twice shy. Xoroshiro was defended as great with the sole exception of the lowest bit. But then it was "the lowest bit is just a LSFR, so don't use that. Well, actually, the other low bits are also just really long period LSFRs, well, actually," and new flaws were constantly appearing.
+Respectfully, I think you need to explain more and in simpler terms to earn everyone's trust back.
+
The reason I picked PCG was because its author could, in plain language, describe its behavior and why some authors witnessed patterns in your RNG.
+
+
I think it's quite understandable that Vigna would want to look for ways to take PCG (and me) down a peg or two, and in various comment replies he endeavored to express things he didn't like about PCG (and the PCG website).
+
Most of the issues he raised were, I thought, adequately addressed and refuted in the Reddit discussion, but having gone to the effort already to try to articulate the things he did not like, even writing code to do so, it makes sense that he would want circulate these thoughts more broadly.
+
Reddit Reaction #2
+
Reddit's reaction to Vigna's new PCG-critique page was perhaps not what he hoped for. From what I can tell, pretty much none of the commenters were persuaded by his claims, and much was made of his tone.
+
Regarding tone, user “notfancy” said:
+
+
Take your feud somewhere else. […] theory and practice definitely belong here. The petty squabbling and the name calling definitely don't. Seeing that Vigna himself is posting links to his own site, this is to me self-promoting spam.
+
+
and user “foofel” added:
+
+
the style in which he presents his stuff is always full of hate and despise, that's not a good way to represent it and probably why people are fed up.
+
+
and user “evand” added:
+
+
I would describe a lot of it as written very... condescendingly. There's also a lot that is written to attack her and not PCG
+
+
and user “AntiauthoritarianNow” chimed in, saying;
+
+
Yeah, it's one thing to tease other researchers a little bit, but this guy has a real problem sticking to arguments on the merits rather than derailing into reddit-esque ad-hom.
+
+
But the thread also had plenty of rebuttals. For just about every claim Vigna had made in his critique, there was a comment explaining why the claim was flawed.
+
My Reaction
+
I could settle back into my chair here, and say, “Thank you, Reddit, for keeping your wits about you!”, but since (at the time of writing) Vigna's page remains live with the same claims, it seems sensible for me to create my own writeup (this one) to address his claims directly.
+
Moreover, I believe firmly that although it's never much fun to be on the receiving end for invective or personal attacks, in academia peer critique makes everything stronger. While much of what Vigna says about PCG doesn't hold up to closer scrutiny, it is worth trying to find value of some kind in every criticism. I believe in the approach taken in the world of improvisational comedy, known as “Yes, and…”, which suggests that a participant should accept what another participant has stated (“yes”) and then expand on that line of thinking (“and”).
+
Thus, in the subsequent sections, I'll look at each of Vigna's critiques, first give a defensive response, and then endeavor to find a way to say “Yes, and…” to each one.
+
Correlations Due to Contrived Seeding
+
Vigna's first two claims relate to creating two PCG generators whose outputs are correlated because he has specifically set them up to have internal states that would cause them to be correlated.
+
PCG ext Variants: Single Bit Change to the Extension Array
+
In the first claim, he modifies the code for the PCG of the extended generation scheme to so that he can flip a single bit in the extension array that adds k-dimensional equidistribution to a base generator.
+
Vigna creates two pcg64_k32 generators that are the same in all respects except for a single bit difference in one element of the 32-element extension array, and then observes that 31 of every 32 outputs will remain identical between the generators for some considerable time. Vigna clearly considers this behavior to be problematic and notes multiple LFSR-based PRNGs where such behavior would not occur.
+
Vigna states
+
+
Said otherwise, the whole sequence of the generator is made by an enormous number of strongly correlated, very short sequences. And this makes the correlation tests fail.
+
+
Vigna concludes that no one should use generators like pcg64_k32 as a result.
+
Defensive Response
+
Vigna actually created a custom version of the PCG code to effect his single bit change. The pcg64_k32 generator has 2303 bits of state, 127 bits of LCG increment (which stays constant), 128 bits of LCG current state, and 32 64-bit words in the extension array. The odds of seeding two pcg64_k32 generators each with 2303 bits of seed entropy and finding that they only differ by a single bit in the extension array is 1 in 22292, an order of magnitude so vast that it cannot be represented as a floating point double.
+
If the PRNG were properly initialized (e.g., using std::seed_seq or pcg_extras::sed_seq_from<std::random_device>), Vigna's observed correlation would not have occurred. Likewise, had the single bit change been in the LCG side of the PRNG, it would also not have occurred.
+
But what of Vigna's other claim, that PRNGs that are slow to diffuse single-bit changes to their internal state are necessarily bad? Vigna is right that for LFSR-based designs, the rate of bit diffusion (a.k.a. “avalanche”) matters a lot.
+
However, numerous perfectly good designs for PRNGs would fail Vigna's criteria. All counter-based designs (e.g., SplitMix, Random123, Chacha) will preserve the single bit difference indefinitely if we examine their internal state. In fact, Vigna's collaborator, David Blackman, is author of gjrand, which also includes a counter whose internal state won't diverge significantly over time. But of these designs, only SplitMix would fail a test that looks for output correlations rather than similar internal states.
+
The closest design to PCG's extension array is found in George Marsaglia's venerable XorWow PRNG, shown below (code taken from the Wikipedia page):
+
/* The state array must be initialized to not be all zero in the first four
+ words */
+uint32_t xorwow(uint32_t state[static 5])
+{
+ /* Algorithm "xorwow" from p. 5 of Marsaglia, "Xorshift RNGs" */
+ uint32_t s, t = state[3];
+ t ^= t >> 2;
+ t ^= t << 1;
+ state[3] = state[2]; state[2] = state[1]; state[1] = s = state[0];
+ t ^= s;
+ t ^= s << 4;
+ state[0] = t;
+ return t + (state[4] += 362437);
+}
+
+
+
In Marsaglia's design, state[4] is a counter in much the same way that PCG's extension array is a “funky counter”. Marsaglia calls this counter a Weyl sequence after Hermann Weyl, who proved the equidistribution theorem in 1916.
+
We can exactly reproduce Vigna's claim's about pcg64_k32 producing similar output with XorWow. The program uncxorwow.c is a port of his demonstration program to XorWow. It fails if tested with PractRand, and, if we uncomment the printf statements, after 1 billion iterations we see that the outputs have not become uncorrelated. They continue to differ only in their high bit. And they will continue that way forever:
Similarly, Vigna's complaint about “strongly correlated very short sequences” could likewise be applied to XorWow. It consists of 264 very similar sequences (differing only by a constant). It might seem bad at a glance to concatenate a number of very similar sequences but it is worth realizing that the nearest similar sequence is 2128-1 steps away. If Vigna would characterize 2128-1 as “very short”, he must be using a mathematician's sense of scale.
+
Marsaglia's design of Xorwow quite deliberately uses a very simple and weak generator (a Weyl sequence) for a specific purpose. We could say “a counter isn't a very good random number generator”, but the key idea is that it doesn't need to be. It's not the whole story. It's a piece with a specific role to play, and it doesn't need to be any better than it is.
+
PCG's extended generation scheme is a similar story. The extension array is a funky counter akin to a Weyl sequence (each array element is like a digit of a counter). It's slightly better than a Weyl sequence (a single bit change will quickly affect all the bits in the in that array element), but it is essentially the same idea.
+
The pcg64_k32_oneseq and pcg64_k32_fast generators follow XorWow's scheme of just joining together the similar sequences. pcg64_k32 swaps around chunks of size 216 from each similar sequence. In all cases, from any starting point you would need 2128 outputs before the base linear congruential generator lined up to the same place again, and vastly more for the extension array to line up similarly. In short, for pcg64_k32 the correlated states are quite literally unimaginably far away from each other.
+
Talking about his contrived seedings, Vigna notes that, “This is all the decorrelation we get after a billion iterations, and it will not improve (not significantly before the thermodynamical death of the universe).” What he seems to have missed is the corollary to his statements—correlation and decorrelation are sides of the same coin. Two currently uncorrelatedpcg64_k32 states will not correlate before the heat death of the universe either.
+
In short, Vigna contrived a seed to show correlation that would never arise in practice with normal seeding, nor could arise by advancing one generator. His critique is not unique to PCG, and should not be a concern for users of PCG.
+
“Yes, and…” Response
+
A rather flippant “Yes, and…” response is that I'm perfectly happy for people to avoid pcg64_k32, as I'm not at all sure it is buying you anything meaningful over and above pcg64— it's a fair amount of added code complexity for something of dubious value. In fact, I didn't even bother to implement it in the C version and only a small number of people who have ported PCG have implemented it. As I see it, k-dimensional equidistribution sounds like a cool property, but the only use case I've found for such a property is performing party tricks. But some people do like k-dimensional equidistribution, so let's press on…
+
First, Vigna went to far too much trouble to create correlated states. He copied the entire C++ source for PCG and hacked it to make a private data member public so he could set a single bit. Had he been more familiar with the features the extended generators provide, he could instead have written.
This code uses pcg64_k32's party-trick functionality to leap unimaginably huge distances across the state space to find exactly the correlated generator you want, one that is the same in every respect except for one differing output.
+
In other words, what he sees as a deficiency, I've already highlighted as a feature.
+
But whether it is achieved by the simple method above, or the more convoluted method Vigna used, we have the question of what to do if people are allowed to create very correlated generator states that would not normally arise in practice. One option is to just say “don't do that”, but a more “Yes, and…” perspective would be to allow people to create such states if they choose but provide a means to detect them. More on that in the next section.
+
It's also worth asking whether the slowness with which a single bit change diffuses across the extension array is something inherent in the design of PCG's extended generation scheme, or mere happenstance. In fact, it is the latter.
+
The only cleverness in the extended generation scheme isn't the idea of combining two generators, a strong one and a weaker-but-k-dimensionally-equidistributed one, it's the fact that we can do so without any extra state to keep track of what we're doing.
+
I'm thus not wedded to the particular Weyl-sequence inspired method I used. If it's important that unimaginably distant similar generators do not stay correlated for long, that's a very easy feature to provide.
+
When I designed how the extension array advances, I made a choice to make it “no better than it needs to be”. It doesn't need good avalanche properties, so that wasn't a design concern. But that doesn't mean it couldn't be tweaked to have good avalanche properties, so that a single bit change affects all the bits the next time the extension array advances. In fact, having designed seed_seq_fe for randutils, I'm aware of elegant and amply efficient ways to have better avalanche, so why not?
+
It may not really be necessary, but I actually like this idea. So thanks, Sebastiano, I'll address this issue in a future update to PCG that provides some alternative schemes for updating the extension array!
+
PCG Regular Variants: Contrived seeds for Inter-Stream Correlations
+
In his next concern, Vigna uses makes correlated generators from two “random looking” seeds. He presents a program, corrpcg.c that mixes together the two correlated generators and can then be fed into statistical tests (which will fail because of the correlation).
+
Defensive Response
+
We can devise bad seed pairs for just about any PRNG. Here are three example programs, corrxoshiro.c, corrsplitmix.c, and corrxorwow.c, which initialize generators with two “random looking” seeds but create correlated streams that will fail statistical tests if mixed.
+
In all cases, despite being “random looking”, the seeds are carefully contrived. Seeds such as these would be vanishingly unlikely with proper seeding practice.
+
As before, the concerns Vigna expresses apply to many prior generators. We can view XorWow's state[4] value as being a stream selection constant, but this time let's focus in on SplitMix. For SplitMix, different gamma_ values constitute different streams.
+
In corrsplitmix.c the implementation is hard-wired to use a single stream (0x9e3779b97f4a7c15), but in corrsplitmix2.c we mix two streams, (0x9e3779b97f4a7c15 and 0xdf67d33dd518d407) and observe correlations. Although these gamma values look random, they are not, they are carefully contrived. In particular, here 0xdf67d33dd518d407 * 3 = 0x9e3779b97f4a7c15 (in 64-bit arithmetic), which means that every third output from the second stream will exactly match an output from the first.
+
Vigna's critique thus applies at least as strongly to SplitMix's streams as it does to PCG's.
+
I have written at length about PCG's streams (and discussed SplitMix's, too). I freely acknowledge that these streams exist in a space of trade-offs where we are choosing to do the cheap thing, leveraging the properties of the underlying LCG (or Weyl sequence for SplitMix). In that article, I say:
+
+
Changing the increment parameter is just barely enough for streams that are actually useful. They aren't statistically independent, far from it, but they are distinct and they do help.
+
+
No one should worry that PCG's streams makes anything worse.
+
“Yes, and…” Response
+
Although it is vanishingly unlikely that two randomly seeded pcg64 generators would be correlated (it would only happen with poor/adversarial seeding), it is reasonable to ask if this kind of correlation due to bad seeding can be detected.
+
We can even argue that another checklist feature for a general-purpose PRNG is the ability to tell how independent the sequences from two seeds are likely to be. PCG goes some way towards this goal with its - operator that calculates the distance between two generators, but the functionality was originally designed for generators on the same stream. I've now updated that functionality so that for generators on different streams, it will calculate the distance to their point of closest approach (i.e., where the differences between successive values of the underlying LCG align).
+
So it's now possible with PCG to compare two generators to see whether they have been badly seeded so that they correlate.
And here are the results of running it (in each case, each line shows the distance between the streams and a value from each PRNG; the distance stays the same because the PRNGs are advancing together):
As we can see in the last example, Vigna contrived seeds that had the streams exactly aligned. The values from each stream are distinct in this case, but a statistical test will see that they are correlated.
+
Interpreting the distance value is easy in this case, but not every user will be able to do so, and some distances (e.g., a just a single high bit set) would also be bad, so better detection of contrived seeds probably demands a new function, independence_score(), based on this distance metric.
+
Beyond these functions, there is also the question of whether it is wise to allow users to seed generators where they can specify the entire internal state. Vigna's generators (and all basically LFSR generators) must avoid the all-zeros state and do not like states with low hamming weight (so { seed, 0, 0, 0 } is also a poor choice). With these issues in mind, perhaps we should deny users the ability to seed the entire state. That might prevent some contrived seedings like the one Vigna used. I'm not fully sold on this idea, but it is a widely-used approach use by other generators (e.g., Blackman's gjrand) and worth considering.
+
Although Vigna's contrived seeding was a bit silly, his example has helped me improve the PCG distance metric, given us another checklist feature that some people might want (detecting bad seed pairs), got me thinking about future features, and returned me to the topic of good seeding. All in all, we can call this a positive contribution. Thanks, Sebastiano!
+
Prediction Difficulty
+
The next two sections relate to predicting PCG.
+
Predicting pcg32_oneseq
+
+
Vigna writes:
+
+
To me, it has been always evident that PCG generators are very easy to predict. Of course, no security expert ever tried to to that: it would be like beating 5-year-old kid on a race. It would be embarrassing.
+
So we had this weird chicken-and-egg situation: nobody who could easily predict a PCG generator would write the code, because they are too easy to predict; but since nobody was predicting a PCG generator, Melissa O'Neill kept on the absurd claim that they were challenging to predict.
+
+
Vigna then goes on to show code to predict pcg32_oneseq, a 64-bit PRNG with 32-bit output.
+
Defensive Response
+
As one reddit observer wrote:
+
+
[Vigna's] program needs to totally brute force half of the state, and then some additional overhead to brute force bits of the rest of the state, so runtime is 2n/2, exponential, not polynomial.
+
+
Vigna has written an exponential algorithm to brute force 32 bits of state. I hope it was obvious to almost everyone that I never claimed that brute-forcing 32-bits of state was hard. In fact, I have already outlined how to predict pcg32 (more bits to figure out given the unknown stream). I observed that pcg32 is predictable using established techniques (specifically the LLL algorithm), and I have even linked to an implementation of those ideas by Marius Lombard-Platet.
+
I characterize pcg32_oneseq as easy to brute force, and pcg32 as annoying (as Marius Lombard-Platet discovered). Only when we get to pcg64 do we have something where there is a meaningful challenge.
+
If Vigna really believes that all members of the PCG family are easy to predict, he should have predicted pcg64 or pcg64_c32.
+
“Yes, and…” Response
+
The best part of Vigna's critique are these lines:
+
+
Writing the function that performs the prediction, recover(), took maybe half an hour of effort. It's a couple of loops, a couple of if's and a few logical operations. Less than 10 lines of code (of course this can be improved, made faster, etc.).
+
+
and the source code comment that reads:
+
+
Pass an initial state (in decimal or hexadecimal), see it recovered from the output in a few seconds.
+
+
So, here Vigna is essentially endorsing all the practical aspects I've previously noted regarding trivial predictability. Specifically, he's noting that with little time or effort, he can write a simple program that quickly predicts a PRNG and has actually done so. This is very different from taking a purely theoretical perspective (e.g., noting that techniques exist to solve a problem in polynomial time without ever implementing them).
+
In other words, clearly ease of prediction matters to Vigna. So we both agree—pcg32_oneseq is easy to predict.
+
Now let's keep that characterization of easiness and move on to some of the other generators.
+
Vigna and I would agree, I think, that I lack the necessary insight to develop fast prediction methods for pcg64 or pcg64_c32 (it's an instance of Schneier's Law). Vigna is also right that, if it is tractable to predict, those who might have the necessary skill lack much incentive to try. For some years I have been intending to have a prediction contest with real prizes and I remain hopeful that I'll find the time to run such a contest this summer. When the contest finally launches, I hope he'll have a go—I'd be delighted to send him a prize.
+
Predicting pcg64_once_insecure
+
+
Vigna also notes that he can invert the bijection that serves as the output function for pcg64_once_insecure, which reveals the underlying LCG with all its statistical flaws.
+
Defensive Response
+
I noted this exact issue in 2014 in the PCG paper. It's why pcg64_once_insecure has the name it does. I discourage its use as a general-purpose PRNG precisely because of its invertible output function.
+
“Yes, and…” Response
+
Vigna is at least acknowledging that some people might care about this property.
+
Speed and Comparison against LCGs
+
Finally, Vigna develops a PCG variant using a traditional integer hash function based on MurmurHash (I would call it PCG XS M XS M XS). He claims it is faster than the PCG variants I recommend and notes that he doesn't consider PCG especially fast.
+
Defensive Response
+
I considered this exact idea in the 2014 PCG paper. In my tests, I found that a variant using a very similar general integer hash function was not as fast as the PCG permutations I used.
+
Testing is a finicky business.
+
“Yes, and…” Response
+
I absolutely agree with Vigna's claim that people should run their own speed tests.
+
I also realized long ago that PCG probably won't have the speed crown, because it can't. A simple truncated 128-bit LCG passes all standard statistical tests once we get up to 128 bits, and beats everything, including Vigna's generators. Because pcg64 is built from a 128-bit LCG, it can never beat it in speed.
+
I should write a blog post on speed testing. But here's a taste. We'll use Vigna's hamming-weight test as our benchmark, because it is a real program that puts randomness to actual use but is coded with execution speed in mind.
+
First, let's test the Mersenne Twister. Compiling with Clang, we get
+
processed 1.75e+11 bytes in 130 seconds (1.346 GB/s, 4.847 TB/h). Fri May 25 14:03:25 2018
+
+
+
whereas compiling with GCC, we get
+
processed 1.75e+11 bytes in 73 seconds (2.397 GB/s, 8.631 TB/h). Fri May 25 14:05:44 2018
+
+
+
With GCC, it runs almost twice as fast.
+
Now let's contrast that result with this 128-bit MCG:
+
static uint128_t state = 1; // can be seeded to any odd number
+
+static inline uint64_t next()
+{
+ constexpr uint128_t MULTIPLIER =
+ (uint128_t(0x0fc94e3bf4e9ab32ULL) << 64) | 0x866458cd56f5e605ULL;
+ // Spectral test: M8 = 0.71005, M16 = 0.66094, M24 = 0.61455
+ state *= MULTIPLIER;
+ return state >> 64;
+}
+
+
+
Compiling with Clang, we get
+
processed 1.75e+11 bytes in 39 seconds (4.488 GB/s, 16.16 TB/h). Fri May 25 14:16:25 2018
+
+
+
whereas with GCC we get
+
processed 1.75e+11 bytes in 58 seconds (3.017 GB/s, 10.86 TB/h). Fri May 25 14:18:14 2018
+
+
+
The GCC code is no slouch, but Clang's code here is much faster. Clang is apparently better at 128-bit math.
+
If we really care about speed though, this 128-bit MCG (which uses a carefully chosen 64-bit multiplier instead of a more typical 128-bit multiplier) is even faster and still passes statistical tests:
+
static uint128_t state = 1; // can be seeded to any odd number
+
+static inline uint64_t next()
+{
+ return (state *= 0xda942042e4dd58b5ULL) >> 64;
+}
+
+
+
Compiling with Clang, we get
+
processed 1.75e+11 bytes in 37 seconds (4.73 GB/s, 17.03 TB/h). Fri May 25 14:09:26 2018
+
+
+
whereas with GCC we get
+
processed 1.75e+11 bytes in 44 seconds (3.978 GB/s, 14.32 TB/h). Fri May 25 14:11:40 2018
+
+
+
Again, Clang takes the speed crown; its executable generates and checks 1 TB of randomness about every 3.5 minutes.
+
If we test Vigna's latest generator, xoshiro256**, and compile with Clang, it gives us
+
processed 1.75e+11 bytes in 50 seconds (3.5 GB/s, 12.6 TB/h). Fri May 25 14:30:05 2018
+
+
+
whereas with GCC we get
+
processed 1.75e+11 bytes in 43 seconds (4.07 GB/s, 14.65 TB/h). Fri May 25 14:31:52 2018
+
+
+
This result is very fast, but not faster than either 128-bit MCG.
+
Finally, let's look at PCG-style generators. First let's look at Vigna's proposed variant. Compiling with Clang, we get
+
processed 1.75e+11 bytes in 59 seconds (2.966 GB/s, 10.68 TB/h). Fri May 25 14:44:37 2018
+
+
+
and with GCC we get
+
processed 1.75e+11 bytes in 62 seconds (2.823 GB/s, 10.16 TB/h). Fri May 25 14:46:42 2018
+
+
+
This is one of the rare occasions where GCC and Clang actually turn in almost equivalent times.
+
In contrast, with the general-purpose pcg64 generator, compiling with Clang I see:
+
processed 1.75e+11 bytes in 57 seconds (3.07 GB/s, 11.05 TB/h). Fri May 25 14:57:02 2018
+
+
+
whereas with GCC, I see
+
processed 1.75e+11 bytes in 64 seconds (2.735 GB/s, 9.844 TB/h). Fri May 25 14:59:07 2018
+
+
+
Thus, depending on which compiler we choose, Vigna's variant is either slightly faster or slightly slower.
+
Finally, if we look at pcg64_fast, compiling with Clang gives us
+
processed 1.75e+11 bytes in 49 seconds (3.572 GB/s, 12.86 TB/h). Fri May 25 15:00:45 2018
+
+
+
and with GCC we get
+
processed 1.75e+11 bytes in 65 seconds (2.693 GB/s, 9.693 TB/h). Fri May 25 15:02:15 2018
+
+
+
Again the performance of GCC is a bit disappointing; this MCG-based generator is actually running slower than the LCG-based one.
+
From this small amount of testing, we can see that pcg64 is not as fast as xoshiro256**, but a lot depends on the compiler you're using—if you're using Clang (which is the default compiler on OS X), pcg64_fast will beat xoshiro256**.
+
There's plenty of room for speed improvement in PCG. My original goal was to be faster than the Mersenne Twister, which it is, but knowing that it'll always be beaten by the speed of its underlying LCG I haven't put a lot of effort into optimizing the code. I could have used the faster multiplier that I used above, and there is actually a completely different way of handling LCG increment that reduces dependences and enhances speed but implementing LCGs that way makes the code more opaque. If PCG's speed is an issue, these are design decisions are worth revisiting.
+
But the speed winner is clearly a 128-bit MCG. It's actually what I use when speed is the primary criterion.
+
Conclusion
+
None of Vigna's concerns raise any serious worries about PCG. But critique is useful, and helps spur us to do better.
+
I'm sure Vigna has spent far longer thinking about PCG than he would like, so it is best to say a big thank you to him for all the thought and energy he has expended in these efforts. I'm pleased that I've mostly been able to put the critique to good use—it may be mostly specious for users, but it is certainly helpful for me. Reddit mostly saw vitriol and condescension, but I prefer to see it as a gift of his time and thought.
+ +-
+
+
+ 1
+
+
+
+
+
+ Can you even compile that function? It does not conform to standard C (with respect to the function parameter), so if your compiler accepts it then some language extension is in play. You'll need to check your implementation's documentation (or maybe your class notes) to find out what it means. On the other hand, maybe there's simply a typo there. It would make more sense if the
+– John Bollinger
+
+ Dec 21, 2018 at 14:16
+
+
+
+ -
+
+
+
+
+
+
+
+ You need to tell us what the
+– Jabberwocky
+
+ Dec 21, 2018 at 14:21
+
+
+
+ -
+
+
+
+
+
+
+
+ @Jabberwocky here is the wikipeda: en.wikipedia.org/wiki/Xorshift . I don t even know to explain it to you. It s a number generator using xor and shift made by a guy. The teacher didn t tell us much either
+
+
+– Predescu Eduard
+
+ Dec 21, 2018 at 14:30
+
+
+
+ -
+
+
+
+
+
+
+
+ @PredescuEduard that information belongs into the question. You can edit your question.
+
+
+– Jabberwocky
+
+ Dec 21, 2018 at 14:31
+
+
+
+
+
+static
keyword were removed, or perhaps moved to the beginning of the function declaration. + +xorshift32
is supposed to do. + +