A nice intuition for L2 regularization comes from having a prior on the distribution of parameters: the prior assumes that the parameters are close to zero. Let’s assume that the prior is . The MAP estimate of the parameters would then be

is simply the log-likelihood of the model, and optimizing that would give you the MLE parameters. However, incorporating gives us the L2 regularization parameter

The constant at the start goes out of the argmax, and we’re left with the L2 regularization term. Taking a negative on both sides would give us the negative log-likelihood, which is the unregularized loss function. We’d then minimize . The net expression would look something like this:

Regularization generally features a strength term : We can think of as being the inverse of every term in the diagonal of the covariance matrix (if it is a diagonal covariance matrix). We’d then get

And this is the familiar L2 reglarized loss function.


I got this from one of the Deep Learning lectures; it’s a nice treatment which I couldn’t find in other places (maybe because I haven’t looked hard enough). Goodfellow certainly doesn’t feature it though.