<head> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: { skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'], inlineMath: [['$','$']] } }); </script> </head>
<center>Photo by <a style="background-color:black;color:white;text-decoration:none;padding:4px 6px;font-family:-apple-system, BlinkMacSystemFont, "San Francisco", "Helvetica Neue", Helvetica, Ubuntu, Roboto, Noto, "Segoe UI", Arial, sans-serif;font-size:12px;font-weight:bold;line-height:1.2;display:inline-block;border-radius:3px" href="https://unsplash.com/photos/9O3_JJOT3As" target="_blank" rel="noopener noreferrer" title="Download free do whatever you want high-resolution photos from Jess Barnett"><span style="display:inline-block;padding:2px 3px"><svg xmlns="http://www.w3.org/2000/svg" style="height:12px;width:auto;position:relative;vertical-align:middle;top:-2px;fill:white" viewBox="0 0 32 32"><title>unsplash-logo</title><path d="M10 9V0h12v9H10zm12 5h10v18H0V14h10v9h12v-9z"></path></svg></span><span style="display:inline-block;padding:2px 3px">Jess Barnett</span></a></center>
Origin: You Only Look Once: Unified, Real-Time Object Detection
Abstract
- As a regression problem to spatially separated bounding boxes and associated class probabilities
- A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation
- more localization errors but is less likely to predict false positives on background
The YOLO Detection System
Processing images with YOLO is simple and straightforward
- resizes the input image to 448 × 448
- runs a single convolutional network on the image
- thresholds the resulting detections bythe model’s confidence
The Model
Procedure
- It divides the image into an S × S grid [448 × 448 -> 7 x 7]<br>
If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
- Each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.<br>
Bounding Box
: x, y, w, h(center)
<br>Confidence
: $Pr(object) \cdot IoU^{pred}_{truth}$
- Final output tensor: S × S × (B ∗ 5 + C)
The Loss Function
$$\begin{array}{c} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} 1_{i j}^{\text {obj }}\left[\left(x_{i}-\hat{x}{i}\right)^{2}+\left(y{i}-\hat{y}{i}\right)^{2}\right] \ +\lambda{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left[(\sqrt{w{i}}-\sqrt{\hat{w}{i}})^{2}+(\sqrt{h{i}}-\sqrt{\hat{h}{i}})^{2}\right] \ +\sum{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left(C{i}-\hat{C}{i}\right)^{2} \ +\lambda{\text {noobj }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {noobj }}\left(C{i}-\hat{C}{i}\right)^{2} \ \quad+\sum{i=0}^{S^{2}} \mathbb{1}{i}^{\text {obj }} \sum{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2} \end{array}$$
i: 0($S^2$-1) `[iterate each grid (048)]<br>
***j***: 0~(B-1)
[iterate each bbox (0~1)]`<br>
$$1_{i j}^{\mathrm{obj}} & 1_{i j}^{\mathrm{noobj}}:\left[\begin{array}{lllllll}
0 & 0 & 0 & 0 & 0 & 0 & 1 \
0 & 0 & 0 & 0 & 0 & 1 & 0 \
0 & 0 & 0 & 0 & 1 & 0 & 0 \
0 & 0 & 0 & 1 & 0 & 0 & 0 \
0 & 0 & 1 & 0 & 0 & 0 & 0 \
0 & 1 & 0 & 0 & 0 & 0 & 0 \
1 & 0 & 0 & 0 & 0 & 0 & 0
\end{array}\right]\left[\begin{array}{lllllll}
1 & 1 & 1 & 1 & 1 & 1 & 0 \
1 & 1 & 1 & 1 & 1 & 0 & 1 \
1 & 1 & 1 & 1 & 0 & 1 & 1 \
1 & 1 & 1 & 0 & 1 & 1 & 1 \
1 & 1 & 0 & 1 & 1 & 1 & 1 \
1 & 0 & 1 & 1 & 1 & 1 & 1 \
0 & 1 & 1 & 1 & 1 & 1 & 1
\end{array}\right]$$
For $1_{i j}^{\mathrm{obj}}$, we have B predictions in each cell, only the one with largest IoU shall be labeled as 1
Coordinate Loss
$$\begin{array}{l} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left[\left(x{i}-\hat{x}{i}\right)^{2}+\left(y{i}-\hat{y}{i}\right)^{2}\right] \ \quad+\lambda{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left[(\sqrt{w{i}}-\sqrt{\hat{w}{i}})^{2}+(\sqrt{h{i}}-\sqrt{\hat{h}_{i}})^{2}\right] \end{array}$$
- x, y: predicated bbox center
- w, h: predicated bbox width & height
- $\hat{x}, \hat{y}$: labeled bbox center
- $\hat{w}, \hat{h}$: labeled bbox width & height
- $\sqrt{w}, \sqrt{h}$: Suppress the effect for larger bbox
- $\lambda_{\text {coord }}$: 5. because there's only 8 dimensions. Too less comparing to other losses weighted loss essentially.
Confidence Loss
$$\begin{array}{c} +\sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\mathrm{obj}}\left(C{i}-\hat{C}{i}\right)^{2} \ +\lambda{\text {noobj }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {noobj }}\left(C{i}-\hat{C}_{i}\right)^{2} \end{array}$$
- $\hat{C}_{i}$: confidence score [IoU] of predicted and ground truth
- $C_{i}$: preidcted confidence score [IoU] generated from network
Note:
- $\hat{C}_{i}$ is 0 or 1 integer
- $\lambda_{\text {noobj }}$=0.5, because there's so many non-object bboxes
- Train: confidence = $Pr(object) \cdot IoU^{pred}_{truth}$
- Test: individual box confidence predicton:<br> confidence = $Pr(cls_{i}obj)Pr(obj) \cdot IoU^{pred}_{truth}$
Classification loss
$$+\sum_{i=0}^{S^{2}} 1_{i}^{\mathrm{obj}} \sum_{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2}$$
Each cell will only predict 1 object, which is decided by the bbox with the largest IoU.<br>
▶ Don't forget to do NMS after generating bboxes.
The YOLOv1 Pros & Cons
Pros:
- one stage, really fast
Cons:
- Bad for crowed objects[1 cell 1 obj]
- Bad for small objects
- Bad for objects with new width-height ratio
- No BN