Left: need to be careful about the learning rate, or else will oscillate around the optimal point
Right: normalized data can use bigger learning rate and learn faster
Two steps:
Zero center:
subtract mean: all features around 0
normalize variable: make it round (variance around 1)
x = np.mean(x, axis = 0)