...one of the most highly
regarded and expertly designed C++ library projects in the
world.
— Herb Sutter and Andrei
Alexandrescu, C++
Coding Standards
All the code in this library is inside namespace boost::math.
In order to use a distribution my_distribution you will need to include either the header <boost/math/distributions/my_distribution.hpp> or the "include everything" header: <boost/math/distributions.hpp>.
For example, to use the Students-t distribution include either <boost/math/distributions/students_t.hpp> or <boost/math/distributions.hpp>
Each kind of distribution in this library is a class type.
Policies provide fine-grained control of the behaviour of these classes, allowing the user to customise behaviour such as how errors are handled, or how the quantiles of discrete distribtions behave.
Tip | |
---|---|
If you are familiar with statistics libraries using functions, and 'Distributions as Objects' seem alien, see the comparison to other statistics libraries. |
Making distributions class types does two things:
Although the distribution classes in this library are templates, there
are typedefs on type double that mostly take the usual
name of the distribution (except where there is a clash with a function
of the same name: beta and gamma, in which case using the default template
arguments - RealType =
double
- is nearly as convenient).
Probably 95% of uses are covered by these typedefs:
using namespace boost::math; // Construct a students_t distribution with 4 degrees of freedom: students_t d1(4); // Construct a double-precision beta distribution // with parameters a = 10, b = 20 beta_distribution<> d2(10, 20); // Note: _distribution<> suffix !
If you need to use the distributions with a type other than double
, then you can instantiate the template
directly: the names of the templates are the same as the double
typedef but with _distribution
appended, for example: Students
t Distribution or Binomial
Distribution:
// Construct a students_t distribution, of float type, // with 4 degrees of freedom: students_t_distribution<float> d3(4); // Construct a binomial distribution, of long double type, // with probability of success 0.3 // and 20 trials in total: binomial_distribution<long double> d4(20, 0.3);
The parameters passed to the distributions can be accessed via getter member functions:
d1.degrees_of_freedom(); // returns 4.0
This is all well and good, but not very useful so far. What we often want is to be able to calculate the cumulative distribution functions and quantiles etc for these distributions.
Want to calculate the PDF (Probability Density Function) of a distribution? No problem, just use:
pdf(my_dist, x); // Returns PDF (density) at point x of distribution my_dist.
Or how about the CDF (Cumulative Distribution Function):
cdf(my_dist, x); // Returns CDF (integral from -infinity to point x) // of distribution my_dist.
And quantiles are just the same:
quantile(my_dist, p); // Returns the value of the random variable x // such that cdf(my_dist, x) == p.
If you're wondering why these aren't member functions, it's to make the library more easily extensible: if you want to add additional generic operations - let's say the n'th moment - then all you have to do is add the appropriate non-member functions, overloaded for each implemented distribution type.
Tip | |
---|---|
Random numbers that approximate Quantiles of Distributions If you want random numbers that are distributed in a specific way, for example in a uniform, normal or triangular, see Boost.Random. Whilst in principal there's nothing to prevent you from using the quantile function to convert a uniformly distributed random number to another distribution, in practice there are much more efficient algorithms available that are specific to random number generation. |
For example, the binomial distribution has two parameters: n (the number of trials) and p (the probability of success on one trial).
The binomial_distribution
constructor therefore has two parameters:
binomial_distribution(RealType n, RealType
p);
For this distribution the random variate is k: the number of successes observed. The probability density/mass function (pdf) is therefore written as f(k; n, p).
Note | |
---|---|
Random Variates and Distribution Parameters Random variates and distribution parameters are conventionally distinguished (for example in Wikipedia and Wolfram MathWorld by placing a semi-colon (or sometimes vertical bar) after the random variate (whose value you 'choose'), to separate the variate from the parameter(s) that defines the shape of the distribution. |
As noted above the non-member function pdf
has one parameter for the distribution object, and a second for the random
variate. So taking our binomial distribution example, we would write:
pdf(binomial_distribution<RealType>(n, p), k);
The distribution (effectively the random variate) is said to be 'supported'
over a range that is "the
smallest closed set whose complement has probability zero".
MathWorld uses the word 'defined' for this range. Non-mathematicians might
say it means the 'interesting' smallest range of random variate x that
has the cdf going from zero to unity. Outside are uninteresting zones where
the pdf is zero, and the cdf zero or unity. Mathematically, the functions
may make sense with an (+ or -) infinite value, but except for a few special
cases (in the Normal and Cauchy distributions) this implementation limits
random variates to finite values from the max
to min
for the RealType
. (See Handling
of Floating-Point Infinity for rationale).
The range of random variate values that is permitted and supported can
be tested by using two functions range
and support
.
Note | |
---|---|
Discrete Probability Distributions
Note that the discrete
distributions, including the binomial, negative binomial, Poisson
& Bernoulli, are all mathematically defined as discrete functions:
that is to say the functions However, because the method of calculation often uses continuous functions it is convenient to treat them as if they were continuous functions, and permit non-integral values of their parameters.
Users wanting to enforce a strict mathematical model may use The quantile functions for these distributions are hard to specify in a manner that will satisfy everyone all of the time. The default behaviour is to return an integer result, that has been rounded outwards: that is to say, lower quantiles - where the probablity is less than 0.5 are rounded down, while upper quantiles - where the probability is greater than 0.5 - are rounded up. This behaviour ensures that if an X% quantile is requested, then at least the requested coverage will be present in the central region, and no more than the requested coverage will be present in the tails. This behaviour can be changed so that the quantile functions are rounded differently, or return a real-valued result using Policies. It is strongly recommended that you read the tutorial Understanding Quantiles of Discrete Distributions before using the quantile function on a discrete distribtion. The reference docs describe how to change the rounding policy for these distributions. For similar reasons continuous distributions with parameters like "degrees of freedom" that might appear to be integral, are treated as real values (and are promoted from integer to floating-point if necessary). In this case however, there are a small number of situations where non-integral degrees of freedom do have a genuine meaning. |
Often you don't want the value of the CDF, but its complement, which is
to say 1-p
rather than p
.
You could calculate the CDF and subtract it from 1
,
but if p
is very close
to 1
then cancellation error
will cause you to lose significant digits. In extreme cases, p
may actually be equal to 1
, even though the true value of the complement
is non-zero.
In this library, whenever you want to receive a complement, just wrap all
the function arguments in a call to complement(...)
, for example:
students_t dist(5); cout << "CDF at t = 1 is " << cdf(dist, 1.0) << endl; cout << "Complement of CDF at t = 1 is " << cdf(complement(dist, 1.0)) << endl;
But wait, now that we have a complement, we have to be able to use it as
well. Any function that accepts a probability as an argument can also accept
a complement by wrapping all of its arguments in a call to complement(...)
,
for example:
students_t dist(5); for(double i = 10; i < 1e10; i *= 10) { // Calculate the quantile for a 1 in i chance: double t = quantile(complement(dist, 1/i)); // Print it out: cout << "Quantile of students-t with 5 degrees of freedom\n" "for a 1 in " << i << " chance is " << t << endl; }
Tip | |
---|---|
Critical values are just quantiles Some texts talk about quantiles, others about critical values, the basic rule is: Lower critical values are the same as the quantile. Upper critical values are the same as the quantile from the complement of the probability. For example, suppose we have a Bernoulli process, giving rise to a binomial distribution with success ratio 0.1 and 100 trials in total. The lower critical value for a probability of 0.05 is given by:
and the upper critical value is given by:
which return 4.82 and 14.63 respectively. |
Tip | |
---|---|
Why bother with complements anyway?
It's very tempting to dispense with complements, and simply subtract
the probability from 1 when required. However, consider what happens
when the probability is very close to 1: let's say the probability expressed
at float precision is Or to look at this another way: consider that we want the risk of falsely rejecting the null-hypothesis in the Student's t test to be 1 in 1 billion, for a sample size of 10,000. This gives a probability of 1 - 10^{-9}, which is exactly 1 when calculated at float precision. In this case calculating the quantile from the complement neatly solves the problem, so for example:
returns the expected t-statistic
raises an overflow error, since it is the same as:
Which has no finite result. |
Sometimes it's the parameters that define the distribution that you need to find. Suppose, for example, you have conducted a Students-t test for equal means and the result is borderline. Maybe your two samples differ from each other, or maybe they don't; based on the result of the test you can't be sure. A legitimate question to ask then is "How many more measurements would I have to take before I would get an X% probability that the difference is real?" Parameter finders can answer questions like this, and are necessarily different for each distribution. They are implemented as static member functions of the distributions, for example:
students_t::find_degrees_of_freedom( 1.3, // difference from true mean to detect 0.05, // maximum risk of falsely rejecting the null-hypothesis. 0.1, // maximum risk of falsely failing to reject the null-hypothesis. 0.13); // sample standard deviation
Returns the number of degrees of freedom required to obtain a 95% probability that the observed differences in means is not down to chance alone. In the case that a borderline Students-t test result was previously obtained, this can be used to estimate how large the sample size would have to become before the observed difference was considered significant. It assumes, of course, that the sample mean and standard deviation are invariant with sample size.
complement(...)
.
complement(...)
.
Now that you have the basics, the next section looks at some worked examples.