HOME/Articles/

YOLOv1

Article Outline

<head> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: { skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'], inlineMath: [['$','$']] } }); </script> </head>

image

<center>Photo by <a style="background-color:black;color:white;text-decoration:none;padding:4px 6px;font-family:-apple-system, BlinkMacSystemFont, &quot;San Francisco&quot;, &quot;Helvetica Neue&quot;, Helvetica, Ubuntu, Roboto, Noto, &quot;Segoe UI&quot;, Arial, sans-serif;font-size:12px;font-weight:bold;line-height:1.2;display:inline-block;border-radius:3px" href="https://unsplash.com/photos/9O3_JJOT3As" target="_blank" rel="noopener noreferrer" title="Download free do whatever you want high-resolution photos from Jess Barnett"><span style="display:inline-block;padding:2px 3px"><svg xmlns="http://www.w3.org/2000/svg" style="height:12px;width:auto;position:relative;vertical-align:middle;top:-2px;fill:white" viewBox="0 0 32 32"><title>unsplash-logo</title><path d="M10 9V0h12v9H10zm12 5h10v18H0V14h10v9h12v-9z"></path></svg></span><span style="display:inline-block;padding:2px 3px">Jess Barnett</span></a></center>

Origin: You Only Look Once: Unified, Real-Time Object Detection

Abstract

  • As a regression problem to spatially separated bounding boxes and associated class probabilities
  • A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation
  • more localization errors but is less likely to predict false positives on background

The YOLO Detection System

Figure 1 Processing images with YOLO is simple and straightforward

  1. resizes the input image to 448 × 448
  2. runs a single convolutional network on the image
  3. thresholds the resulting detections bythe model’s confidence

The Model

Procedure

Figure 2

  1. It divides the image into an S × S grid [448 × 448 -> 7 x 7]<br>     If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
  2. Each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.<br>    Bounding Box: x, y, w, h(center)<br>    Confidence: $Pr(object) \cdot IoU^{pred}_{truth}$
  3. Final output tensor: S × S × (B ∗ 5 + C)

The Loss Function

$$\begin{array}{c} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} 1_{i j}^{\text {obj }}\left[\left(x_{i}-\hat{x}{i}\right)^{2}+\left(y{i}-\hat{y}{i}\right)^{2}\right] \ +\lambda{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left[(\sqrt{w{i}}-\sqrt{\hat{w}{i}})^{2}+(\sqrt{h{i}}-\sqrt{\hat{h}{i}})^{2}\right] \ +\sum{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left(C{i}-\hat{C}{i}\right)^{2} \ +\lambda{\text {noobj }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {noobj }}\left(C{i}-\hat{C}{i}\right)^{2} \ \quad+\sum{i=0}^{S^{2}} \mathbb{1}{i}^{\text {obj }} \sum{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2} \end{array}$$

        i: 0($S^2$-1) `[iterate each grid (048)]<br> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;***j***: 0~(B-1)[iterate each bbox (0~1)]`<br> $$1_{i j}^{\mathrm{obj}} & 1_{i j}^{\mathrm{noobj}}:\left[\begin{array}{lllllll} 0 & 0 & 0 & 0 & 0 & 0 & 1 \ 0 & 0 & 0 & 0 & 0 & 1 & 0 \ 0 & 0 & 0 & 0 & 1 & 0 & 0 \ 0 & 0 & 0 & 1 & 0 & 0 & 0 \ 0 & 0 & 1 & 0 & 0 & 0 & 0 \ 0 & 1 & 0 & 0 & 0 & 0 & 0 \ 1 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}\right]\left[\begin{array}{lllllll} 1 & 1 & 1 & 1 & 1 & 1 & 0 \ 1 & 1 & 1 & 1 & 1 & 0 & 1 \ 1 & 1 & 1 & 1 & 0 & 1 & 1 \ 1 & 1 & 1 & 0 & 1 & 1 & 1 \ 1 & 1 & 0 & 1 & 1 & 1 & 1 \ 1 & 0 & 1 & 1 & 1 & 1 & 1 \ 0 & 1 & 1 & 1 & 1 & 1 & 1 \end{array}\right]$$

        For $1_{i j}^{\mathrm{obj}}$, we have B predictions in each cell, only the one with largest IoU shall be labeled as 1

Coordinate Loss

$$\begin{array}{l} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left[\left(x{i}-\hat{x}{i}\right)^{2}+\left(y{i}-\hat{y}{i}\right)^{2}\right] \ \quad+\lambda{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left[(\sqrt{w{i}}-\sqrt{\hat{w}{i}})^{2}+(\sqrt{h{i}}-\sqrt{\hat{h}_{i}})^{2}\right] \end{array}$$

  • x, y: predicated bbox center
  • w, h: predicated bbox width & height
  • $\hat{x}, \hat{y}$: labeled bbox center
  • $\hat{w}, \hat{h}$: labeled bbox width & height
  • $\sqrt{w}, \sqrt{h}$: Suppress the effect for larger bbox
  • $\lambda_{\text {coord }}$: 5. because there's only 8 dimensions. Too less comparing to other losses weighted loss essentially.

Confidence Loss

$$\begin{array}{c} +\sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\mathrm{obj}}\left(C{i}-\hat{C}{i}\right)^{2} \ +\lambda{\text {noobj }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {noobj }}\left(C{i}-\hat{C}_{i}\right)^{2} \end{array}$$

  • $\hat{C}_{i}$: confidence score [IoU] of predicted and ground truth
  • $C_{i}$: preidcted confidence score [IoU] generated from network

Note:

  • $\hat{C}_{i}$ is 0 or 1 integer
  • $\lambda_{\text {noobj }}$=0.5, because there's so many non-object bboxes
  • Train: confidence = $Pr(object) \cdot IoU^{pred}_{truth}$
  • Test: individual box confidence predicton:<br>      confidence = $Pr(cls_{i}obj)Pr(obj) \cdot IoU^{pred}_{truth}$

Classification loss

$$+\sum_{i=0}^{S^{2}} 1_{i}^{\mathrm{obj}} \sum_{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2}$$

Each cell will only predict 1 object, which is decided by the bbox with the largest IoU.<br>

Don't forget to do NMS after generating bboxes.


The YOLOv1 Pros & Cons

Pros:

  • one stage, really fast

Cons:

  • Bad for crowed objects[1 cell 1 obj]
  • Bad for small objects
  • Bad for objects with new width-height ratio
  • No BN