Why setting all weights and biases to 0 is a bad idea?¶
As $a_j^l = \sigma\big(\sum _k w_{jk}^l a_k^{l-1} +b_j^l\big)$, all nodes will produce the same activation. Although the weights and the biases would later diverge if the scalar components of the input vector are different, this is still not a good idea for the general input.
Why $w^l_j$ having variance of $\frac{1}{n_{in}}$ is a better idea?¶
Estimation of variance of activation $j$ at layer $l$:
\begin{align*} Var\big(z^l_j\big) &= Var\big(\sum_{k=1}^{n_{in}} w_{jk}^l a_k^{l-1}\big) \text{ assuming that $b^l_j=0$}\\ & = \sum_{k=1}^{n_{in}} Var\big(w_{jk}^l a_k^{l-1}\big) \\ & = \sum_{k=1}^{n_{in}} Var\big(w_{jk}^l\big) Var\big(a_k^{l-1}\big) \text{ assuming that $w_{jk}^l$ and $a_k^{l-1}$ have 0 mean} \\ & = n_{in} Var(w_{j}^l\big) Var\big(a^{l-1}\big) \text{ assuming that $w_{jk}^l$ and $a_k^{l-1}$ are identically distributed} \end{align*}For $a^l_j = \sigma(z^l_j)$ to have the same variance, $w^l_j$ should have the variance of $\frac{1}{n_{in}}$. $n_{in}$ in the number of nodes in layer $l-1$, which is right before layer $l$.
If ReLU activation is used:
\begin{align*} Var\big(z^l_j\big) &= Var\big(\sum_{k=1}^{n_{in}} w_{jk}^l a_k^{l-1}\big) \text{ assuming that $b^l_j=0$}\\ & = \frac{n_{in}}{2} Var(w_{j}^l\big) Var\big(a^{l-1}\big) \text{ as half of the activations are killed by ReLU} \end{align*}For $a^l_j = \sigma(z^l_j)$ to have the same variance, $w^l_j$ should have the variance of $\frac{2}{n_{in}}$.
Why noncentering activations is bad?¶
\begin{align*} \frac{\partial L}{\partial w^l_j} &= \frac{\partial L}{\partial a^l_j} \frac{\partial a^l_j}{\partial z^l_j} \frac{\partial z^l_j}{\partial w^l_j}\\ & = \frac{\partial L}{\partial a^l_j} \frac{\partial a^l_j}{\partial z^l_j} a^{l-1} \\ \end{align*}If the activations $a^{l-1}$ are all positive or negative, the weights are always updated in the same direction, which may not be desirable in some cases.