7. Common Probability Distributions#

There are several common probability distributions which can be used to describe random events. These distributions have an analytical formulation which depends generally on one or more parameters.

When we have enough evidence that a given random variable is well described by one of these distributions, we can simply “fit” the distribution to the data (i.e., choose the correct parameters for the distribution) and use the analytical formulation to deal with the random variable.

It is hence useful to know the most common probability distributions so that we can recognize the cases in which they can be used.

7.1. Discrete Uniform Distribution#

The discrete uniform distribution is controlled by a parameter \(k \in \mathbb{N}\) and assumes that all outcomes have the same probability of occurring:

\[P(X=a_i) = \frac{1}{k}\]

Where \(\Omega = \{a_1,\ldots,a_k\}\).

It can be shown that:

\[E[X] = \frac{k+1}{2}\]
\[Var[X] = \frac{1}{12}(k^2-1)\]

7.1.1. Example#

The outcomes of rolling a fair die follow a uniform distribution with \(k=6\), as shown in the diagram below:

../_images/689d5dc078fb3b115fe23860123251ca54e9328aa880529a13032e78754d263a.png

7.2. Bernoulli Distribution#

The Bernoulli distribution is a distribution over a single binary random variable, i.e., the variable \(X\) can take only two values: \(\left\{ 0,1 \right\}\).

The distribution is controlled by a single parameter \(\phi \in \lbrack 0,1\rbrack\), which gives the probability of the variable to be equal to 1.

The analytical formulation of the Bernoulli distribution is very simple:

  • \(P(X = 1) = \phi\);

  • \(P(X = 0) = 1 - \phi\)

The expected value and variance of the associated random variable are:

  • \(E\lbrack x\rbrack = \phi\);

  • \(Var(x) = \phi(1 - \phi)\).

7.2.1. Example#

A skewed coin lands on “head” \(60\%\) of the times. If we define \(X = 1\) when the outcome is head and \(X = 0\) when the outcome is tail, then the variable follows a Bernoulli distribution with \(\phi = 0.6\).

7.3. Binomial Distribution#

../_images/b4ba486ffca81fed577fcc5d0d825a6576c5dcd4433a1eed434d02c99df6b8aa.png

The binomial distribution is a discrete probability distribution (PMF) over natural numbers with parameters \(\mathbf{n}\) and \(\mathbf{p}\)

It models the probability of obtaining \(\mathbf{k}\) successes in a sequence of \(\mathbf{n}\) independent experiments which follow a Bernoulli distribution with parameter \(\mathbf{p}\mathbf{\ (}\mathbf{\phi}\mathbf{=}\mathbf{p}\mathbf{)}\);

The probability mass function of the distribution is given by:

\[P(k) = \binom{n}{k}p^{k}(1 - p)^{n - k}\]

Where:

  • \(k\) is the number of successes

  • \(n\) is the number of independent trials

  • \(p\) is the probability of a success in a single trial

The expected value is \(E\lbrack k\rbrack = np\);

The variance is \(Var\lbrack k\rbrack = np(1 - p)\);

7.3.1. Example#

What is the probability of tossing a coin three times and obtaining three heads? We have:

  • \(k = 3\): number of successes (three times head)

  • \(n = 3\): number of trials

  • \(p = 0.5\): the probability of getting a head when tossing a coin

The required probability will be given by:

\[P(3) = \binom{3}{3}{0.5}^{3}(1 - 0.5)^{3 - 3} = {0.5}^{3} = 0.125\]

Exercise

What is the probability of tossing an unfair coin (\(P\left( 'head^{'} \right) = 0.6\)) 7 times and obtaining \(2\) tails?

7.4. Poisson Distribution#

The Poisson distribution expresses the probability of a given number of events occurring in a fixed interval of time or space if they occur independently with a known rate. It is useful to model situations in which the number of events is very large and the probability of a given event happening is relatively small. The Poisson distribution is controlled by a parameter \(\lambda>0\) (the rate at which each event occurs) and defined as:

\[P(X=x) = \frac{\lambda^x}{x!}\exp(-\lambda)\]

with \(x=0,1,2,3,\ldots\) representing the number of events happening at the same time/space. The plot below shows the Poisson distribution for different choices of the \(\lambda\) parameter:

../_images/26b32f3ff10869b84d807ff3b8402495ab558285601402006112a858df2434d1.png

7.4.1. Example#

During World War II Germans bombed the south of London many times. People started to believe that some areas were more targeted by the bombing and hence started moving from one part of a city to another in an attempt to avoid those areas.

After the end of the war, R. D. Charles showed that the bombings were actually random and independent. To do so, he divided the interested area in \(576\) squares of \(0.24 Km^2\). The number of total bombs was \(537\), hence less than one per square (a rare event). Assuming that the bombing was uniform and casual, the probability of a square being bombed would be:

\[\lambda = \frac{537}{576}\]

We can hence compute the expected probability that \(x\) squares were bombed as \(P(X=x)\), where \(P\) is a Poisson function with \(\lambda=\frac{537}{576}\). The expected number of areas being bombed \(x\) times will be: \(P(X=x)\times576\)

Comparing the expected numbers with the measured ones, we obtain the following table:

Expected Real
0 226.742723 229
1 211.390351 211
2 98.538731 93
3 30.622279 35
4 7.137224 8

Since the numbers are very close, we can imagine that the bombing was really random.

7.5. Categorical Distribution#

The multinoulli or categorical distribution is a distribution of a single discrete variable with \(k\) different states, where \(k\) is finite.

  • The distribution is parametrized by a vector \(\mathbf{p} \in \lbrack 0,1\rbrack^k\), where \(p_{i}\) gives the probability of the i^th^ state.

  • \(\mathbf{p}\) must be such that \(\sum_{i = 1}^kp_{i} = 1\) to obtain a valid probability distribution.

  • The analytical form of the distribution is given by: \(p(x = i) = p_{i}\);

This distribution is the generalization of the Bernoulli distribution to the case of multiple states.

Example:

Rolling a fair die. In this case, \(k = 6\) and \(p_{i} = \frac{1}{k}\ ,\ i = 0,\ldots,k \ \).

7.5.1. Multinomial Distribution#

The multinomial distribution generalizes the binomial distribution to the case in which the experiments are not binary, but they can have multiple outcomes (e.g., a dice vs a coin).

In particular, the multinomial distribution models the probability of obtaining exactly \((n_{1},\ldots,n_{k})\) occurrences (with \(n = \sum_{i}^{}n_{i}\)) for each of the \(k\) possible outcomes in a sequence of \(n\) independent experiments which follow a Categorial distribution with probabilities \(p_{1},\ldots,p_{k}\).

The parameters of the distribution are:

  • \(n\): the number of trials

  • \(k\): the number of possible outcomes

  • \(p_{1},\ldots,p_{k}\) the probabilities of obtaining a given class in each trial (with \(\sum_{i = 1}^{k}p_{i} = 1\))

The PMF of the distribution is:

\[P\left( n_{1},\ldots,n_{k} \right) = \frac{n!}{n_{1}!\ldots n_{k}!}p_{1}^{n_{1}} \cdot \ldots \cdot p_{k}^{n_{k}}\]

The mean is: \(E\left\lbrack n_{i} \right\rbrack = np_{i}\).

The variance is: \(Var\left\lbrack n_{i} \right\rbrack = np_{i}(1 - p_{i})\).

The covariance between two of the input variables is: \(Cov\left( n_{i},n_{j} \right) = - np_{i}p_{j}\ (i \neq j)\).

Example

Given a fair die with 6 possible outcomes, what is the probability of getting 3 times 1, 2 times 2, 4 time 3, 5 times 4, 0 times 5, and 1 time 6, rolling the dice for 15 times?

We have:

  • \(n = 15\)

  • \(k = 6\)

  • \(p_{1} = p_{2} = \ldots p_{6} = \frac{1}{6}\)

The required probability is given by:

\[P(3,2,4,5,0,1) = \frac{15!}{3!2!4!5!0!1!} \cdot \frac{1}{6^{3}} \cdot \frac{1}{6^{2}} \cdot \frac{1}{6^{4}} \cdot \frac{1}{6^{5}} \cdot \frac{1}{6^{0}} \cdot \frac{1}{6^{1}} = 8.04 \cdot 10^{- 5}\]

7.6. Gaussian Distribution#

The Bernoulli and Categorical distributions are PMF, i.e., distributions over discrete random variables.

A common PDF when dealing with real values is the Gaussian distribution, also known as Normal Distribution.

The distribution is characterized by two parameters:

  • The mean \(\mu\mathfrak{\in R}\)

  • The standard deviation \(\sigma \in (0, + \infty)\)

In practice, the distribution is often seen in terms of \(\mu\) and \(\sigma^{2}\) rather than \(\sigma\), where \(\sigma^{2}\) is called the variance.

The analytical formulation of the Normal distribution is as follows:

\[N\left( x;\mu,\sigma^{2} \right) = \sqrt{\frac{1}{2\pi\sigma^{2}}}e^{- \frac{1}{2\sigma^{2}}(x - \mu)^{2}}\]

The term under the square root is a normalization term which ensures that the distribution integrates to 1.

The expectation and variance of a variable following the Normal distribution are as follows:

  • \(E\lbrack x\rbrack = \mu\)

  • \(Var\lbrack x\rbrack = \sigma^{2}\)

The Gaussian distribution is very used when we do not have much prior knowledge on the real distribution we wish to model. This in mainly due to the central limit theorem, which states that the sum of many independent random variables with the same distribution is approximately normally distributed.

7.6.1. Interpretation#

../_images/ff3fc200b1dac4e599d0357ac8790cdeb0b07329a61bda14eceb69124445e94e.png

If we plot the PDF of a Normal distribution, we can find that it is easy to interpret the meaning of its parameters:

  • The resulting curve has a maximum (highest probability) when \(x = \mu\)

  • The curve is symmetric, with the inflection points at \(x = \mu \pm \sigma\)

  • The example shows a normal distribution for \(\mu = 0\) and \(\sigma = 1\)

../_images/f881bd32ce8ea398c660d4f1b689083eceb9caa2865df155d8674291b720b160.png

Another notable property of the Normal distribution is that:

  • About \(68\%\) of the density is comprised in the interval \(\lbrack - \sigma,\sigma\rbrack\);

  • About \(95\%\) of the density is comprised in the interval \(\lbrack - 2\sigma,2\sigma\rbrack\);

  • Almost 100% of the density is comprised in the interval \(\lbrack - 3\sigma,3\sigma\rbrack\).

7.6.2. Multivariate Gaussian#

The formulation of the Gaussian distribution generalizes to the multivariate case, i.e., the case in which \(X\) is n-dimensional.

In that case, the distribution is parametrized by a n-dimensional vector \(\mathbf{\mu}\) and a \(\mathbf{n}\mathbf{\times}\mathbf{n}\mathbf{\ }\)positive definite symmetric matrix \(\mathbf{\Sigma}\). The formulation of the multi-variate Gaussian is:

\[N\left( \mathbf{x;\mu,}\mathbf{\Sigma} \right) = \sqrt{\frac{1}{(2\pi)^{n}\det(\Sigma)}}e^{( - \frac{1}{2}\left( \mathbf{x} - \mathbf{\mu} \right)^{T}\Sigma^{- 1}\left( \mathbf{x} - \mathbf{\mu} \right))}\]

In the 2D case, \(\mathbf{\mu}\) is a 2D point representing the center of the Gaussian (the position of the mode), whereas the matrix \(\Sigma\) influences the “shape” of the Gaussian.

Examples of bivariate Gaussian distributions are shown below.

../_images/1f251b1c4d04abd9ad75d11b2b34c3263168f41d5ce8fa485598f7bf30f7bc9e.png ../_images/3616cdd0ac2346f211bdc4356b38ff015a138159409832c1a5774354c77fa7a1.png

The two plots above are common representations for bivariate continuous distributions:

  • The plot on the top shows a 3D representation of the PDF in which the X and Y axes are the values of the variables, while the third axis reports the probability density.

  • Since it’s often hard to draw 3D graphs, we often use a contour plot to represent the 3D curves. In the 3D plot, curves of the same color represent points which have the same density in the 3D plot.

7.6.2.1. Effect of \(\Sigma\)#

Similar to how variance affects the dispersion of a 1D Gaussian, the covariance matrix \(\Sigma\) affects the dispersion in both axes. As a result, changing the values of the matrix will affect the shape of the distribution. Let’s consider the general covariance matrix:

\[\begin{split} \Sigma = \begin{bmatrix} \sigma_x^2 & \sigma_{xy} \\ \sigma_{yx} & \sigma_y^2 \end{bmatrix} \end{split}\]

If the matrix is diagonal (\(\sigma_{xy}=\sigma_{yx}=0\)), then we have an isotropic Gaussian, meaning that it is symmetric along the two axes. Adding values different from zeros in the secondary diagonal will change the shape. Some examples are shown below:

../_images/c405ad8432a9efe42acc83cf7d6fbe84f621d3ae27880a0196c8cbe0320eea8b.png

7.6.3. Estimation of the Parameters of a Gaussian Distribution#

We have noted that in many cases we can assume a random variable follows a Gaussian distribution. However, it is not yet clear how to choose the parameters of the Gaussian distribution.

Given some data (remember, data is values assumed by random variables!), we can obtain the parameters of the Gaussian distribution related to the data with a maximum likelihood estimation.

This consists in computing the mean and variance parameters using the following formula (in the univariate case):

  • \(\mu = \frac{1}{n}\sum_{j}^{}x_{j}\)

  • \(\sigma^{2} = \frac{1}{n}\sum_{j}^{}\left( x_{j} - \mu \right)^{2}\)

Where \(x_{j}\) represent the different data points.

In the multi-variate case, the computation of the multi-dimensional \(\mathbf{\mu}\) vector is similar:

  • \(\mathbf{\mu} = \frac{1}{n}\sum_{j}^{}\mathbf{x}_{j}\)

  • \(\Sigma\) is instead computed as the covariance matrix related to \(X\): \(\Sigma = Cov(\mathbf{X})\), i.e., \(\Sigma_{ij} = Cov\left( \mathbf{X}_{i},\mathbf{X}_{j} \right)\)

The diagram below shows an example in which we fit a Gaussian to a set of data and compare it with a 2D KDE of the data.

../_images/407ed8c3dbc3b9d75b2781cb4de5256c8657b136c8c7f5196ef720d3bbc78fe1.png

7.6.4. Central Limit Theorem#

The Central Limit Theorem is a statistical principle that states that the distribution of the sum (or average) of a large number of independent, identically distributed (i.i.d.) random variables \(\{X_i\}_{i=1}^n\) approaches a normal distribution as \(n \to \infty\), regardless of the shape of the original population’s distribution.

While we will not see this theorem formally, it is a fundamental result which in some sense “justifies” the pervasive use of the Gaussian distribution in data analysis.

In the plot below, we plot the distributions of the average outcome of rolling a given number of dice. We can see each die as described by a different and independent random variable, hence the distribution will get close to Gaussian as we increase the number of dice.

../_images/d3b0dcfd55343902762bdbfdf6de40e0d4d72373e1bf9592ba8247dffffb8d78.png

7.7. Laplacian Distribution#

The Gaussian distribution assumes that the probability of an observation deviating from the mean decreases exponentially as the square of the deviation. For some types of data, this assumption is not accurate: in some cases, deviating from the mean is much more likely than prescribed by the Gaussian model. An alternative mathematical model, introduced by Laplace, posits that the probability of an observation deviating from the mean decreases exponentially with the absolute value of the deviation:

\[L(x, M, b) = \frac{1}{2b} e^{-\frac{|x - M|}{b}}\]

Where \(M\) represents the central/mean value of the distribution, and b is a scaling parameter known as diversity. It should be noted that due to the absolute value involved, this function is not differentiable at the mean value.

The best fit of this function to the data occurs when M is chosen as the median of the data, and b is chosen as the mean of the absolute differences between the data points and the median:

\[M=median(\{x_i\}_{i=1}^n)\]
\[b=\frac{\sum_{i=1}^n|x_i-M|}{n}\]

In this model, values far from the central value occur more frequently than they would in a Gaussian model. This phenomenon is referred to as fat tails in contrast to the Gaussian model, which is described as having thin tails.

Expectation and variance of \(X \sim L\) are:

\[E[X] = \lambda\]
\[Var(X) = \lambda\]

The following plot shows some examples of Laplacian distributions:

../_images/364732194241ad5e066822a4c7dc4a56aebae6f81d906fac0afbead5a714d0c2.png

7.8. References#

  • Parts of chapter 1 of [1];

  • Most of chapter 3 of [2];

  • Parts of chapter 8 of [3].

[1] Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006. https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf

[2] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016. https://www.deeplearningbook.org/

[3] Heumann, Christian, and Michael Schomaker Shalabh. Introduction to statistics and data analysis. Springer International Publishing Switzerland, 2016.