Sigmoid and Quadratic Cost¶
$$a^L = \frac{1}{1 + e^{-z^L}}$$
$$C = \sum_j (a^L_j-y_j)^2$$
Backpropagating the loss to the biases and weights of the output layer:
$$\frac{\partial C}{\partial b_j^L} = \frac{\partial C}{\partial a_j^L}\frac{\partial a_j^L}{\partial b_j^L} = (a_j^L-y_j)\sigma'(z_j^L)$$$$\frac{\partial C}{\partial w_{jk}^L} = \frac{\partial C}{\partial a_j^L}\frac{\partial a_j^L}{\partial w_{jk}^L} = a_k^{L-1}(a_j^L-y_j)\sigma'(z_j^L)$$
As $\sigma'(z_j^L) = \sigma(z_j^L) (1- \sigma(z_j^L))$, $\frac{\partial C}{\partial b_j^L}$ and $\frac{\partial C}{\partial w_{jk}^L}$ become small when $\sigma(z_j^L)\approx 0$ or $\sigma(z_j^L) \approx 1$. This behavior is bad when $\sigma(z_j^L)$ is near to the wrong extreme.
Sigmoid and Cross-entropy¶
$$a^L = \frac{1}{1 + e^{-z^L}}$$
$$C = -\sum_j y_j \ln a^L_j$$
Backpropagating the loss to the biases and weights of the output layer:
$$\frac{\partial C}{\partial b_j^L} = \frac{\partial C}{\partial a_j^L}\frac{\partial a_j^L}{\partial b_j^L} = a_j^L-y_j$$$$\frac{\partial C}{\partial w_{jk}^L} = \frac{\partial C}{\partial a_j^L}\frac{\partial a_j^L}{\partial w_{jk}^L} = a_k^{L-1}(a_j^L-y_j)$$
Linear Output and Quadratic Cost¶
$$a^L = z^L$$
$$C = \sum_j (a^L_j-y_j)^2$$
Backpropagating the loss to the biases and weights of the output layer:
$$\frac{\partial C}{\partial b_j^L} = \frac{\partial C}{\partial a_j^L}\frac{\partial a_j^L}{\partial b_j^L} = a_j^L-y_j$$$$\frac{\partial C}{\partial w_{jk}^L} = \frac{\partial C}{\partial a_j^L}\frac{\partial a_j^L}{\partial w_{jk}^L} = a_k^{L-1}(a_j^L-y_j)$$
Softmax Output and Log-likelihood¶
$$a^L_j = \frac{e^{z^L_j}}{\sum_j e^{z^L_j}}$$
$$C = -\ln a^L_y$$
Backpropagating the loss to the biases and weights of the output layer:
$$\frac{\partial C}{\partial b_j^L} = a_j^L-y_j$$$$\frac{\partial C}{\partial w_{jk}^L} = a_k^{L-1}(a_j^L-y_j)$$
comments powered by Disqus