Lā regularization
Add Lā regularization term
to penalize high weight matrix (set weight as 0, equivalent of eliminating hidden nodes), thus create a smaller network and reduce overfitting
Try weight_decay
in pytorch.optim
l1=1e-5
, l2=1e-4
L2 and L1 penalize weights differently:
L2 penalizes weight^2.
L1 penalizes |weight|.
Consequently, L2 and L1 have different derivatives:
The derivative of L2 is 2 * weight.
The derivative of L1 is k (a constant, whose value is independent of weight).
Weight decay: a regularization technique (such as Lā regularization) that results in gradient descent shrinking the weights on every iteration.
Last updated