3 minute read

Bias-Variance Trade-off

In many cases, it’s desired to trade off a bit larger bias for a smaller variance for better estimations of the generalisation error. One way to do so is by adding a regularizer. In the case of linear regression, L1 (Lasso) and L2 (Ridge) regularization are the most common ones. Lasso suppresses weights of small magnitudes to zero, making the feature space sparse, whilst Ridge “condense” all weights to smaller values. Both can restrict the norm of the weights, and therefore, mitigate overfitting.

However, regularization is only one way to strike the balance. Another way is to introduce more bias into the equation is through the Bayesian lens. Specifically, we can impose prior knowledge by adding a prior distribution to constrain the norm of the leaned parameters. For example, if we know the weights are small and centred, then we can set a prior to be $\vec{w} \sim \mathcal{N}(0, \beta\textbf{I})$. Then, by Bayes rule, we have:

\[\mathbb{P}(\vec{w} | X, \vec{y}) = \cfrac{\mathbb{P}(\vec{w}, X, \vec{y})}{\mathbb{P}(X, \vec{y})} = \cfrac{\mathbb{P}(\vec{w}, \vec{y} | X) {\mathbb{P}(X)}}{\mathbb{P}(\vec{y} | X) {\mathbb{P}(X)}} = \cfrac{\mathbb{P}(\vec{w}, \vec{y} | X)}{\mathbb{P}(\vec{y} | X)}\]

Thus, both regularization and Bayesian modelling can achieve the same goal, which begs the question: are they connected?

Laplace is to Lasso as Gaussian is to Ridge

The answer to the above question turns out to be yes! To illustrate this further, let’s use two common prior distributions, Lapace and Gaussian, as our running examples.

Firstly, we assume the following general setting for regression: ${y} = X{\theta}$ and $f_X = y + \epsilon$ where $\theta \sim \text{Laplace}(0, s) = 1/2s \cdot\exp{-\mid\theta\mid / s}$ and $\epsilon \sim \mathcal{N}(0, \delta^2_\epsilon)$.

Then, we obtain the maximum a posteriori (MAP) esitmation as:

\[\begin{align} \arg\max_\theta\mathbb{P}({\theta} | X, {y}) &= \arg\max_\theta\cfrac{\mathbb{P}(X, y | \theta)\mathbb{P}(\theta)} {\mathbb{P}(X, y)} \nonumber\\ &\propto \arg\max_\theta\mathbb{P}(y |X, \theta)\mathbb{P}(X| \theta)\mathbb{P}(\theta) \nonumber\\ &\propto \arg\max_\theta\mathbb{P}(y | X, \theta)\mathbb{P}(\theta) \nonumber\\ &\propto \arg\max_\theta\mathbb{P}(\theta) \prod^n_i \mathbb{P}(y_i | X_i, \theta) \nonumber\\ &\propto \arg\min_\theta -\log \mathbb{P}(\theta) - \sum_i^n \log \mathbb{P}_\theta(y_i | X_i) \end{align}\]

Then, we can substitute both the likelihood and prior into Eq. (1).

\[\begin{align} \arg\min_\theta -\log\cfrac{1}{2s} \exp\left\{-\cfrac{|\theta|}{s}\right\} - \sum^n_i \log \cfrac{1}{Z} \exp\left\{-\cfrac{1}{2}\left(\cfrac{y-f}{\delta_\epsilon}\right)^2\right\} \end{align}\]

where $Z$ is the Gaussian normalising constant. By simplifying Eq. (2), we obtain the following form:

\[\begin{align} & \arg\min_\theta - \cfrac{|\theta|}{s} + \cfrac{1}{2\delta^2_\epsilon} \sum^n_i(y - f)^2 \\ =& \arg\min_\theta - \sum^n_i(y - f)^2 - \cfrac{2\delta^2_\epsilon}{s}||\theta||_1 \end{align}\]

Now, we have recovered the exact form of Lasso, where $\cfrac{2\delta^2_\epsilon}{s}$ is the coefficient of the L1 regularizor $\lambda$ that controls the strength of the constraint.

Next, let’s play the same trick in the same setting but with a Gaussian prior instead, i.e., $\theta \sim \mathcal{N}(0, \delta_\theta^2)$.

Starting from Eq. (1), we subsitute in the likelihood and prior as above:

\[\begin{align} \arg\max_\theta\mathbb{P}({\theta} | X, {y}) &\propto \arg\min_\theta -\log \mathbb{P}(\theta) - \sum_i^n \log \mathbb{P}_\theta(y_i | X_i) \nonumber \\ &\propto \arg\min_\theta -\log \cfrac{1}{Z'} \exp\left\{-\cfrac{1}{2}\left(\cfrac{\theta-0}{\delta_\theta}\right)^2 \right\} \\ & \quad - \sum^n_i \log \cfrac{1}{Z} \exp\left\{-\cfrac{1}{2}\left(\cfrac{y-f}{\delta_\epsilon}\right)^2\right\} \nonumber \end{align}\]

Finally, letting go all the fluff in Eq. (5), we have:

\[\begin{align} & \arg\min_\theta - \cfrac{||\theta||_2^2}{2\delta^2_\theta} + \cfrac{1}{2\delta^2_\epsilon} \sum^n_i(y - f)^2 \nonumber\\ =& \arg\min_\theta - \sum^n_i(y - f)^2 - \cfrac{\delta^2_\epsilon}{\delta^2_\theta}||\theta||_2^2 \end{align}\]

By Eq. (6), we have recovered Ridge regression where the fraction $\cfrac{\delta_{\epsilon}^2}{\delta_{\theta}^2}$ denotes regularization constant $\lambda$.


By working out the above two examples, we found that regularised regression is nothing but Bayesian modelling in disguise. In fact, imposing various priors has the same effect as using corresponding regularizers. By the same token, choosing different likelihoods gives us different loss functions. In this post, we used Gaussian likelihood in both examples, and, in turn, recovered the square loss.

There are many other options for prior and likelihood functions. For instance, one can use a student-t as opposed to Gaussian. Lastly, a family of conjugate prior can drastically reduce the cost of Bayesian inference.

Leave a comment