A (Formal) Probabilistic Interpretation of Linear Regression

Linear regression is a prolific and natural algorithm often justified probabilistically by assuming that the error in the relationship between target and input variables is Gaussian. Here, I provide a formal proof of this justification.

Linear regression is a canonical problem often introduced at the start of courses in machine intelligence. (Ng, 2000) justifies the model by endowing the data with certain probabilistic assumptions, from which the least-squares cost function is derived. However, the justifications provided are at times handwavy, leaving the reader grasping at straws. Here, we explicitly state the assumptions needed to derive least-squares and provide a formal justification of its derivation. Our work is inspired by these discussions.

Necessary Assumptions

We begin by detailing the assumptions required to derive the linear regression model. Let our dataset $\mathcal{D}$ consist of input-target pairs $(x_i, y_i)$. Let us assume that:

  1. The target variables $y_i$ and inputs $x_i$ originate from random variables $Y_i, X_i$ that have a common density $f_{X_i, Y_i}$. The variables $Z_i = (X_i, Y_i)$ are independent.

  2. The target variables and input variables are related via the equation where the error terms $\epsilon_i$ capture random noise or unmodeled effects.

  3. The error terms $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ are independently and identically distributed.

These assumptions are quite reasonable: in English, we’re simply requiring that the datapoints $(x_i, y_i)$ are drawn independently from the data distribution $f_{X, Y}$, and that each target variable is related to its input variable by a linear transformation and a component representing random noise.

Representing $\ f_{X_i, Y_i}$

Our next step is to derive the likelihod of our dataset under the aforementioned assumptions. To do so, we must represent the joint distribution of each target-variable pair, $f_{X_i, Y_i}$. We’ll approach this problem by first representing the joint distribution of each input-error pair, $f_{\epsilon_i, X_i}$, and applying the change-of-variables formula.

By assumption 3, $\epsilon_i \perp x_i$. We can therefore write the joint distribution of each input-error pair as

Furthermore, by assumption 2, there exists a linear relationship between $y_i$ and $x_i$. This relationship allows us to define a transformation $\phi : (\epsilon_i, X_i) \to (Y_i, X_i)$ such that

We are now ready to apply the change-of-variables formula from $f_{\epsilon_i, X_i}$ to $f_{Y_i, X_i}$. Specifically, for an invertible mapping $\phi : \mathbf{R}^n \to \mathbf{R}^n$ between random variables $A_1 \dots A_n$ and $B_1 \dots B_n$ such that $\mathbf{B} = \phi(\mathbf{A})$ and $\mathbf{A} = \phi^{-1}(\mathbf{B})$, we have that

In our case, $B_1 = Y_i$, $B_2 = X_i$, $A_1 = \epsilon_i$, $A_2 = X_i$. We first compute

which has determinant 1. We therefore have

again due to $\epsilon_i \perp x_i$.

Deriving the Likelihood

Since linear regression is a discriminative model, we do not model the prior density of the input variables $f_{X_i}$ and focus our efforts solely on maximizing the conditional likelihood $f_{Y_i \mid X_i}$ across our dataset. We can write the conditional as

By assumption 1, each $(X_i, Y_i)$ is independent, and so we have

Explicitly, this is

Maximizing $f_{X \mid Y}$ with respect to $\theta$ is now a simple exercise in calculus; one typically maximizes $\log f_{X \mid Y}$ as a proxy to transform the product into a sum. After some calculus as in (Ng, 2000), we conclude that maximizing the log-likelihood is equivalent to minimizing

which is the canonical least-squares cost function for linear regression, as desired.

  1. Ng, A. (2000). CS229 Lecture notes. CS229 Lecture Notes, 1(1), 1–178.