A (Formal) Probabilistic Interpretation of Linear Regression

Linear regression is a prolific and natural algorithm often justified probabilistically by assuming that the error in the relationship between target and input variables is Gaussian. Here, I provide a formal proof of this justification.

Linear regression is a canonical problem often introduced at the start of courses in machine intelligence. (Ng, 2000) justifies the model by endowing the data with certain probabilistic assumptions, from which the least-squares cost function is derived. However, the justifications provided are at times handwavy, leaving the reader grasping at straws. Here, we explicitly state the assumptions needed to derive least-squares and provide a formal justification of its derivation. Our work is inspired by these discussions.

Necessary Assumptions

We begin by detailing the assumptions required to derive the linear regression model. Let our dataset $\mathcal{D}$ consist of input-target pairs $(x_i, y_i)$. Let us assume that:

  1. The target variables $y_i$ and inputs $x_i$ originate from random variables $Y_i, X_i$ that have a common density $f_{X_i, Y_i}$. The variables $Z_i = (X_i, Y_i)$ are independent.

  2. The target variables and input variables are related via the equation \(y_i = \theta^T x_i + \epsilon_i\) where the error terms $\epsilon_i$ capture random noise or unmodeled effects.

  3. The error terms $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ are independently and identically distributed.

These assumptions are quite reasonable: in English, we’re simply requiring that the datapoints $(x_i, y_i)$ are drawn independently from the data distribution $f_{X, Y}$, and that each target variable is related to its input variable by a linear transformation and a component representing random noise.

Representing $\ f_{X_i, Y_i}$

Our next step is to derive the likelihod of our dataset under the aforementioned assumptions. To do so, we must represent the joint distribution of each target-variable pair, $f_{X_i, Y_i}$. We’ll approach this problem by first representing the joint distribution of each input-error pair, $f_{\epsilon_i, X_i}$, and applying the change-of-variables formula.

By assumption 3, $\epsilon_i \perp x_i$. We can therefore write the joint distribution of each input-error pair as

\[f_{\epsilon_i, X_i}(\epsilon, x) = f_{\epsilon_i}(\epsilon) f_{X_i}(x)\]

Furthermore, by assumption 2, there exists a linear relationship between $y_i$ and $x_i$. This relationship allows us to define a transformation $\phi : (\epsilon_i, X_i) \to (Y_i, X_i)$ such that

\[\begin{align} \phi(\epsilon_i, x_i) &= (\theta^Tx_i + \epsilon_i, x_i) \\ \phi^{-1}(y_i, x_i) &= (y_i - \theta^T x_i, x_i) \end{align}\]

We are now ready to apply the change-of-variables formula from $f_{\epsilon_i, X_i}$ to $f_{Y_i, X_i}$. Specifically, for an invertible mapping $\phi : \mathbf{R}^n \to \mathbf{R}^n$ between random variables $A_1 \dots A_n$ and $B_1 \dots B_n$ such that $\mathbf{B} = \phi(\mathbf{A})$ and $\mathbf{A} = \phi^{-1}(\mathbf{B})$, we have that

\[p_{B_1 \dots B_n}(B_1 \dots B_n) = p_{A_1 \dots A_n} (\phi^{-1} (B_1 \dots B_n)) \left| \text{det} \left( \frac{\partial \phi^{-1} (A_1 \dots A_n)}{\partial A_1 \dots A_n} \right) \right|\]

In our case, $B_1 = Y_i$, $B_2 = X_i$, $A_1 = \epsilon_i$, $A_2 = X_i$. We first compute

\[\partial \phi^{-1} = \begin{bmatrix} 1 & - \theta^T \\ 0 & 1 \end{bmatrix}\]

which has determinant 1. We therefore have

\[f_{Y_i, X_i}(y_i, x_i) = f_{\epsilon_i, X_i} (y_i - \theta^T x_i, x_i) = f_{\epsilon_i} (y_i - \theta^T x_i) f_{X_i}(x_i)\]

again due to $\epsilon_i \perp x_i$.

Deriving the Likelihood

Since linear regression is a discriminative model, we do not model the prior density of the input variables $f_{X_i}$ and focus our efforts solely on maximizing the conditional likelihood $f_{Y_i \mid X_i}$ across our dataset. We can write the conditional as

\[f_{Y_i \mid X_i}(y_i \mid x_i) = \frac{f_{Y_i, X_i} (y_i, x_i)}{f_{X_i}(x_i)} = f_{\epsilon_i} (y_i - \theta^T x_i)\]

By assumption 1, each $(X_i, Y_i)$ is independent, and so we have

\[f_{X \mid Y} = \prod_{(x_i, y_i) \in \mathcal{D}} f_{\epsilon_i}(y_i - \theta^T x_i)\]

Explicitly, this is

\[f_{X \mid Y} = \prod_{(x_i, y_i) \in \mathcal{D}} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( - \frac{(y_i - \theta^T x_i)^2}{2 \sigma^2} \right)\]

Maximizing $f_{X \mid Y}$ with respect to $\theta$ is now a simple exercise in calculus; one typically maximizes $\log f_{X \mid Y}$ as a proxy to transform the product into a sum. After some calculus as in (Ng, 2000), we conclude that maximizing the log-likelihood is equivalent to minimizing

\[\sum_{(x_i, y_i) \in \mathcal{D}} (y_i - \theta^T x_i)^2\]

which is the canonical least-squares cost function for linear regression, as desired.

  1. Ng, A. (2000). CS229 Lecture notes. CS229 Lecture Notes, 1(1), 1–178.