A nice intuition for L2 regularization comes from having a prior on the
distribution of parameters: the prior assumes that the parameters are close to
zero. Let’s assume that the prior is
The constant at the start goes out of the argmax, and we’re left with the L2
regularization term. Taking a negative on both sides would give us the
negative log-likelihood, which is the unregularized loss function. We’d then
minimize
Regularization generally features a strength term
And this is the familiar L2 reglarized loss function.
I got this from one of the Deep Learning lectures; it’s a nice treatment which I couldn’t find in other places (maybe because I haven’t looked hard enough). Goodfellow certainly doesn’t feature it though.