Basic idea¶

The basic idea is to consider detection as a pure regression problem. The image is divided into a grid. Each grid cell is responsible for predicting 5 objects which have centers lying inside the cell. There are two key tricks to improve network stability:

Using Anchors. The anchors are learned from the target dataset using dimension clustering. The set of 5 anchors are used corresponding to the 5 objects that each grid cell is responsible to predict.

Coordinate Normalization: indirect predictions of x, y, w, h and objectness confidence:

$$b_x = \sigma(t_x) + c_x$$$$b_y = \sigma(t_y) + c_y$$$$b_w = p_w e^{t_w}$$$$b_h = p_h e^{t_h}$$$$P(object) \times IOU(b,object) = \sigma(t_0)$$

Network architecture¶

Loss function¶

$$\begin{multline} \lambda_\textbf{coord} \sum_{i = 0}^{S^2} \sum_{j = 0}^{B} L_{ij}^{\text{obj}} \left[ \left( x_i - \hat{x}_i \right)^2 + \left( y_i - \hat{y}_i \right)^2 \right] \\ + \lambda_\textbf{coord} \sum_{i = 0}^{S^2} \sum_{j = 0}^{B} L_{ij}^{\text{obj}} \left[ \left( \sqrt{w_i} - \sqrt{\hat{w}_i} \right)^2 + \left( \sqrt{h_i} - \sqrt{\hat{h}_i} \right)^2 \right] \\ + \sum_{i = 0}^{S^2} \sum_{j = 0}^{B} L_{ij}^{\text{obj}} \left( C_i - \hat{C}_i \right)^2 \\ + \lambda_\textrm{noobj} \sum_{i = 0}^{S^2} \sum_{j = 0}^{B} L_{ij}^{\text{noobj}} \left( C_i - \hat{C}_i \right)^2 \\ + \sum_{i = 0}^{S^2} L_i^{\text{obj}} \sum_{c \in \textrm{classes}} \left( p_i(c) - \hat{p}_i(c) \right)^2 \end{multline}$$

Comments

Basic Yolo with Keras

Basic idea¶

Network architecture¶

Loss function¶

Published

Category