L₂ regularization

Add L₂ regularization term to penalize high weight matrix (set weight as 0, equivalent of eliminating hidden nodes), thus create a smaller network and reduce overfitting

Try weight_decay in pytorch.optim

optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-4)

l1=1e-5, l2=1e-4

L2 and L1 penalize weights differently:

  • L2 penalizes weight^2.

  • L1 penalizes |weight|.

Consequently, L2 and L1 have different derivatives:

  • The derivative of L2 is 2 * weight.

  • The derivative of L1 is k (a constant, whose value is independent of weight).

Weight decay: a regularization technique (such as L₂ regularization) that results in gradient descent shrinking the weights on every iteration.

Link to Google Developer

Last updated