Gaussian Distribution

Univariate Gaussian Distribution

Overview

Gaussian distribution is one of the most important distributions in probability and statistics. It is a continuous probability distribution that is symmetric around the mean, with the highest probability at the mean.

Probability density function

probability density function for a univariate Gaussian distribution \(X \sim \mathcal{N}(\mu, \sigma^2)\) where \(X \in \mathbb{R}\) is given by: \[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\]

Where \(\mu\) can be any real number and \(\sigma\) must be a positive real number.

Mean and Variance

Mean and Variance are given by: \[\mathbb{E}[X] = \mu \quad \text{} \quad \operatorname{Var}[X] = \sigma^2\]

KL Divergence

KL divergence between two univariate normals \(p = \mathcal{N}(\mu_0, \sigma^2_0)\) and \(q = \mathcal{N}(\mu_1, \sigma^2_1)\) is given by: \[D_{\mathrm{KL}}(p \,\|\, q) = \frac{1}{2}\left[\frac{\sigma_1^2}{\sigma_0^2} + \frac{(\mu_1 - \mu_0)^2}{\sigma_0^2} - 1 + \ln\frac{\sigma_0^2}{\sigma_1^2}\right]\]

Mean and Variance Estimation

If we have a sample \(x_1, x_2, \ldots, x_n\) from a univariate Gaussian distribution \(X \sim \mathcal{N}(\mu, \sigma^2)\), then the mean and variance can be estimated by: \[\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i \quad \text{and} \quad \hat{\sigma}^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \hat{\mu})^2\]

Multivariate Gaussian Distribution

Overview

Multivariate Gaussian distribution is a generalization of the univariate Gaussian distribution to multiple variables. It is a continuous probability distribution that is symmetric around the mean, with the highest probability at the mean.

Probability density function

probability density function for a multivariate Gaussian distribution \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), where \(\mathbf{X} \in \mathbb{R}^k\) is given by: \[f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)\]

Mean vector and covariance matrix

Mean vector and covariance matrix are given by: \[\mathbb{E}[\mathbf{X}] = \boldsymbol{\mu} = \begin{pmatrix}\mu_1 \\ \vdots \\ \mu_k\end{pmatrix} \qquad \boldsymbol{\Sigma} = \begin{pmatrix}\sigma_1^2 & \cdots & \sigma_{1k} \\ \vdots & \ddots & \vdots \\ \sigma_{k1} & \cdots & \sigma_k^2\end{pmatrix}\]

where \(\sigma_{ij} = \operatorname{Cov}(X_i, X_j) = \rho_{ij}\sigma_i\sigma_j\) and \(\boldsymbol{\Sigma}\) must be symmetric positive semi-definite.

Mahalanobis distance

Mahalanobis distance is given by: \[\Delta^2 = (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}) \sim \chi^2(k)\]

Contours of constant density are ellipsoids satisfying \(\Delta^2 = c\).

KL divergence

KL divergence between two multivariate normals \(p = \mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)\) and \(q = \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1)\) is given by: \[D_{\mathrm{KL}}(p \,\|\, q) = \frac{1}{2}\left[\operatorname{tr}(\boldsymbol{\Sigma}_1^{-1}\boldsymbol{\Sigma}_0) + (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)^\top\boldsymbol{\Sigma}_1^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0) - k + \ln\frac{|\boldsymbol{\Sigma}_1|}{|\boldsymbol{\Sigma}_0|}\right]\]

Marginal distributions

Any subset \(\mathbf{X}_A \subset \mathbf{X}\) of size \(m\) is itself multivariate normal and is given by: \[\mathbf{X}_A \sim \mathcal{N}(\boldsymbol{\mu}_A,\, \boldsymbol{\Sigma}_{AA})\]

Conditional distribution

Partitioning \(\mathbf{X} = (\mathbf{X}_1, \mathbf{X}_2)\), the distribution of \(\mathbf{X}_1\) given \(\mathbf{X}_2 = \mathbf{x}_2\) is given by: \[\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \;\sim\; \mathcal{N}\!\left(\boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2),\;\; \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\right)\]

The term \(\boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\) is the Schur complement of \(\boldsymbol{\Sigma}_{22}\) in \(\boldsymbol{\Sigma}\), and represents the reduction in uncertainty about \(\mathbf{X}_1\) after observing \(\mathbf{X}_2\).

Linear transformation

If \(\mathbf{Y} = A\mathbf{X} + \mathbf{b}\) for matrix \(A \in \mathbb{R}^{m \times k}\): \[\mathbf{Y} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b},\;\; A\boldsymbol{\Sigma} A^\top)\]

Empirical rule

For a univariate normal random variable \(X \sim \mathcal{N}(\mu, \sigma^2)\), the empirical rule (also called the 68–95–99.7 rule or three-sigma rule) states that nearly all probability mass lies within a few standard deviations of the mean (see the illustration at the top of the page):

  • About 68.2% of values fall in the interval \(\mu \pm 1\sigma\) (between one standard deviation below and above the mean).
  • About 95.4% fall in \(\mu \pm 2\sigma\).
  • About 99.7% fall in \(\mu \pm 3\sigma\).

These figures are exact for the normal distribution; the rule is often summarized in teaching as roughly 68%, 95%, and 99.7%. The illustration above marks those intervals along the horizontal axis under the bell curve. The rule is useful for quick mental checks (e.g., outliers beyond \(\mu \pm 3\sigma\) are very rare under exact normality) but applies exactly only when the data-generating process is normal. For heavy-tailed or skewed distributions, tail probabilities can differ markedly.