Why setting all weights and biases to 0 is a bad idea?¶

As $a_j^l = \sigma\big(\sum _k w_{jk}^l a_k^{l-1} +b_j^l\big)$, all nodes will produce the same activation. Although the weights and the biases would later diverge if the scalar components of the input vector are different, this is still not a good idea for the general input.

Why $w^l_j$ having variance of $\frac{1}{n_{in}}$ is a better idea?¶

Estimation of variance of activation $j$ at layer $l$:

\begin{align*} Var\big(z^l_j\big) &= Var\big(\sum_{k=1}^{n_{in}} w_{jk}^l a_k^{l-1}\big) \text{ assuming that $b^l_j=0$}\\ & = \sum_{k=1}^{n_{in}} Var\big(w_{jk}^l a_k^{l-1}\big) \\ & = \sum_{k=1}^{n_{in}} Var\big(w_{jk}^l\big) Var\big(a_k^{l-1}\big) \text{ assuming that $w_{jk}^l$ and $a_k^{l-1}$ have 0 mean} \\ & = n_{in} Var(w_{j}^l\big) Var\big(a^{l-1}\big) \text{ assuming that $w_{jk}^l$ and $a_k^{l-1}$ are identically distributed} \end{align*}

For $a^l_j = \sigma(z^l_j)$ to have the same variance, $w^l_j$ should have the variance of $\frac{1}{n_{in}}$. $n_{in}$ in the number of nodes in layer $l-1$, which is right before layer $l$.

If ReLU activation is used:

\begin{align*} Var\big(z^l_j\big) &= Var\big(\sum_{k=1}^{n_{in}} w_{jk}^l a_k^{l-1}\big) \text{ assuming that $b^l_j=0$}\\ & = \frac{n_{in}}{2} Var(w_{j}^l\big) Var\big(a^{l-1}\big) \text{ as half of the activations are killed by ReLU} \end{align*}

For $a^l_j = \sigma(z^l_j)$ to have the same variance, $w^l_j$ should have the variance of $\frac{2}{n_{in}}$.

Why noncentering activations is bad?¶

\begin{align*} \frac{\partial L}{\partial w^l_j} &= \frac{\partial L}{\partial a^l_j} \frac{\partial a^l_j}{\partial z^l_j} \frac{\partial z^l_j}{\partial w^l_j}\\ & = \frac{\partial L}{\partial a^l_j} \frac{\partial a^l_j}{\partial z^l_j} a^{l-1} \\ \end{align*}

If the activations $a^{l-1}$ are all positive or negative, the weights are always updated in the same direction, which may not be desirable in some cases.

Comments

Weight Initialization

Why setting all weights and biases to 0 is a bad idea?¶

Why $w^l_j$ having variance of $\frac{1}{n_{in}}$ is a better idea?¶

Why noncentering activations is bad?¶

Published

Category