Updated:

3 minute read

Bias-Variance Trade-off

In many cases, it’s better to trade a bit more bias for a much smaller variance to tighten generalisation error estimates. One way to do so is by adding a regularizer. In linear regression, L1 (Lasso) and L2 (Ridge) regularization are two common options. Lasso suppresses small weights to zero, making the feature space sparse, while Ridge shrinks all weights. Both restrict the norm of the weights and help mitigate overfitting.

However, regularization is only one way to strike the balance. Another way is to introduce more bias into the equation through the Bayesian lens. Specifically, we can impose prior knowledge by adding a prior distribution to constrain the norm of the leant parameters. For example, if we know the weights are small and centred, then we can set our prior to be $\vec{w} \sim \mathcal{N}(0, \beta\textbf{I})$. Then, by Bayes rule, we have:

\[\mathbb{P}(\vec{w} | X, \vec{y}) = \cfrac{\mathbb{P}(\vec{w}, X, \vec{y})}{\mathbb{P}(X, \vec{y})} = \cfrac{\mathbb{P}(\vec{w}, \vec{y} | X) {\mathbb{P}(X)}}{\mathbb{P}(\vec{y} | X) {\mathbb{P}(X)}} = \cfrac{\mathbb{P}(\vec{w}, \vec{y} | X)}{\mathbb{P}(\vec{y} | X)}\]

Thus, both regularization and Bayesian modelling can achieve the same goal, which begs the question: are they connected?

Laplace is to Lasso as Gaussian is to Ridge

The answer to the above question turns out to be yes! To illustrate this further, let’s use two common prior distributions, Lapace and Gaussian, as our running examples.

Firstly, we assume the following general setting for regression: ${y} = X{\theta}$ and $f_X = y + \epsilon$ where $\theta \sim \text{Laplace}(0, s) = 1/2s \cdot\exp(-\mid\theta\mid / s)$ and $\epsilon \sim \mathcal{N}(0, \delta^2_\epsilon)$.

Then, we obtain the maximum a posteriori (MAP) esitmation as:

\[\begin{align} \arg\max_\theta\mathbb{P}({\theta} | X, {y}) &= \arg\max_\theta\cfrac{\mathbb{P}(y | X, \theta)\mathbb{P}(\theta)} {\mathbb{P}(y)} \nonumber\\ &\propto \arg\max_\theta\mathbb{P}(y |X, \theta)\mathbb{P}(X| \theta)\mathbb{P}(\theta) \nonumber\\ &\propto \arg\max_\theta\mathbb{P}(y | X, \theta)\mathbb{P}(\theta) \nonumber\\ &\propto \arg\max_\theta\mathbb{P}(\theta) \prod^n_i \mathbb{P}(y_i | X_i, \theta) \nonumber\\ &\propto \arg\min_\theta -\log \mathbb{P}(\theta) - \sum_i^n \log \mathbb{P}_\theta(y_i | X_i) \end{align}\]

Next, we can substitute both the likelihood and prior into Eq. (1).

\[\begin{align} \arg\min_\theta -\log\cfrac{1}{2s} \exp\left\{-\cfrac{|\theta|}{s}\right\} - \sum^n_i \log \cfrac{1}{Z} \exp\left\{-\cfrac{1}{2}\left(\cfrac{y_i - f_i}{\delta_\epsilon}\right)^2\right\} \end{align}\]

where $Z$ is the Gaussian normalising constant. By simplifying Eq. (2), we obtain the following form:

\[\begin{align} & \arg\min_\theta - \cfrac{|\theta|}{s} + \cfrac{1}{2\delta^2_\epsilon} \sum^n_i(y_i - f_i)^2 \\ =& \arg\min_\theta - \sum^n_i(y_i - f_i)^2 - \cfrac{2\delta^2_\epsilon}{s}||\theta||_1 \end{align}\]

Now, we have recovered the exact form of Lasso, where $\cfrac{2\delta^2_\epsilon}{s}$ is the coefficient of the L1 regularizor $\lambda$ that controls the strength of the constraint.


Next, let’s play the same trick in the same setting but with a Gaussian prior instead, i.e., $\theta \sim \mathcal{N}(0, \delta_\theta^2)$.

Starting from Eq. (1), we subsitute in the likelihood and prior as above:

\[\begin{align} \arg\max_\theta\mathbb{P}({\theta} | X, {y}) &\propto \arg\min_\theta -\log \mathbb{P}(\theta) - \sum_i^n \log \mathbb{P}_\theta(y_i | X_i) \nonumber \\ &\propto \arg\min_\theta -\log \cfrac{1}{Z'} \exp\left\{-\cfrac{1}{2}\left(\cfrac{\theta-0}{\delta_\theta}\right)^2 \right\} \\ & \quad - \sum^n_i \log \cfrac{1}{Z} \exp\left\{-\cfrac{1}{2}\left(\cfrac{y_i - f_i}{\delta_\epsilon}\right)^2\right\} \nonumber \end{align}\]

Finally, letting go all the fluff in Eq. (5), we have:

\[\begin{align} & \arg\min_\theta - \cfrac{||\theta||_2^2}{2\delta^2_\theta} + \cfrac{1}{2\delta^2_\epsilon} \sum^n_i(y_i - f_i)^2 \nonumber\\ =& \arg\min_\theta - \sum^n_i(y_i - f_i)^2 - \cfrac{\delta^2_\epsilon}{\delta^2_\theta}||\theta||_2^2 \end{align}\]

By Eq. (6), we have recovered Ridge regression where the fraction $\cfrac{\delta_{\epsilon}^2}{\delta_{\theta}^2}$ denotes regularization constant $\lambda$.

Summary

By working out the above two examples, we found that regularised regression is nothing but Bayesian modelling in disguise. In fact, imposing various priors has the same effect as using corresponding regularizers. By the same token, choosing different likelihoods gives us different loss functions. In this post, we used Gaussian likelihood in both examples, and, in turn, recovered the square loss.

There are many other options for prior and likelihood functions. For instance, one can use a student-t as opposed to Gaussian. Lastly, a family of conjugate prior can drastically reduce the cost of Bayesian inference.

Tags:

Leave a comment