Why setting all weights and biases to 0 is a bad idea?¶
As alj=σ(∑kwljkal−1k+blj), all nodes will produce the same activation. Although the weights and the biases would later diverge if the scalar components of the input vector are different, this is still not a good idea for the general input.
Why wlj having variance of 1nin is a better idea?¶
Estimation of variance of activation j at layer l:
Var(zlj)=Var(nin∑k=1wljkal−1k) assuming that blj=0=nin∑k=1Var(wljkal−1k)=nin∑k=1Var(wljk)Var(al−1k) assuming that wljk and al−1k have 0 mean=ninVar(wlj)Var(al−1) assuming that wljk and al−1k are identically distributedFor alj=σ(zlj) to have the same variance, wlj should have the variance of 1nin. nin in the number of nodes in layer l−1, which is right before layer l.
If ReLU activation is used:
Var(zlj)=Var(nin∑k=1wljkal−1k) assuming that blj=0=nin2Var(wlj)Var(al−1) as half of the activations are killed by ReLUFor alj=σ(zlj) to have the same variance, wlj should have the variance of 2nin.
Why noncentering activations is bad?¶
∂L∂wlj=∂L∂alj∂alj∂zlj∂zlj∂wlj=∂L∂alj∂alj∂zljal−1If the activations al−1 are all positive or negative, the weights are always updated in the same direction, which may not be desirable in some cases.