YOLOv1 - GitPress.io

<head> <script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script> <script type="text/x-mathjax-config"> MathJax.Hub.Config({ tex2jax: { skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'], inlineMath: [['$','$']] } }); </script> </head>

<center>Photo by <a style="background-color:black;color:white;text-decoration:none;padding:4px 6px;font-family:-apple-system, BlinkMacSystemFont, "San Francisco", "Helvetica Neue", Helvetica, Ubuntu, Roboto, Noto, "Segoe UI", Arial, sans-serif;font-size:12px;font-weight:bold;line-height:1.2;display:inline-block;border-radius:3px" href="https://unsplash.com/photos/9O3_JJOT3As" target="_blank" rel="noopener noreferrer" title="Download free do whatever you want high-resolution photos from Jess Barnett"><span style="display:inline-block;padding:2px 3px"><svg xmlns="http://www.w3.org/2000/svg" style="height:12px;width:auto;position:relative;vertical-align:middle;top:-2px;fill:white" viewBox="0 0 32 32"><title>unsplash-logo</title><path d="M10 9V0h12v9H10zm12 5h10v18H0V14h10v9h12v-9z"></path></svg></span><span style="display:inline-block;padding:2px 3px">Jess Barnett</span></a></center>

Origin: You Only Look Once: Unified, Real-Time Object Detection

Abstract

As a regression problem to spatially separated bounding boxes and associated class probabilities
A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation
more localization errors but is less likely to predict false positives on background

The YOLO Detection System

Processing images with YOLO is simple and straightforward

resizes the input image to 448 × 448
runs a single convolutional network on the image
thresholds the resulting detections bythe model’s confidence

The Model

Procedure

It divides the image into an S × S grid [448 × 448 -> 7 x 7]<br> If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
Each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities.<br> Bounding Box: x, y, w, h(center)<br> Confidence: $Pr(object) \cdot IoU^{pred}_{truth}$
Final output tensor: S × S × (B ∗ 5 + C)

The Loss Function

$$\begin{array}{c} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} 1_{i j}^{\text {obj }}\left[\left(x_{i}-\hat{x}{i}\right)^{2}+\left(y{i}-\hat{y}{i}\right)^{2}\right] \ +\lambda{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left[(\sqrt{w{i}}-\sqrt{\hat{w}{i}})^{2}+(\sqrt{h{i}}-\sqrt{\hat{h}{i}})^{2}\right] \ +\sum{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left(C{i}-\hat{C}{i}\right)^{2} \ +\lambda{\text {noobj }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {noobj }}\left(C{i}-\hat{C}{i}\right)^{2} \ \quad+\sum{i=0}^{S^{2}} \mathbb{1}{i}^{\text {obj }} \sum{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2} \end{array}$$

i: 0~~($S^2$-1) `[iterate each grid (0~~48)]<br>         ***j***: 0~(B-1)[iterate each bbox (0~1)]`<br> $$1_{i j}^{\mathrm{obj}} & 1_{i j}^{\mathrm{noobj}}:\left[\begin{array}{lllllll} 0 & 0 & 0 & 0 & 0 & 0 & 1 \ 0 & 0 & 0 & 0 & 0 & 1 & 0 \ 0 & 0 & 0 & 0 & 1 & 0 & 0 \ 0 & 0 & 0 & 1 & 0 & 0 & 0 \ 0 & 0 & 1 & 0 & 0 & 0 & 0 \ 0 & 1 & 0 & 0 & 0 & 0 & 0 \ 1 & 0 & 0 & 0 & 0 & 0 & 0 \end{array}\right]\left[\begin{array}{lllllll} 1 & 1 & 1 & 1 & 1 & 1 & 0 \ 1 & 1 & 1 & 1 & 1 & 0 & 1 \ 1 & 1 & 1 & 1 & 0 & 1 & 1 \ 1 & 1 & 1 & 0 & 1 & 1 & 1 \ 1 & 1 & 0 & 1 & 1 & 1 & 1 \ 1 & 0 & 1 & 1 & 1 & 1 & 1 \ 0 & 1 & 1 & 1 & 1 & 1 & 1 \end{array}\right]$$

For $1_{i j}^{\mathrm{obj}}$, we have B predictions in each cell, only the one with largest IoU shall be labeled as 1

Coordinate Loss

$$\begin{array}{l} \lambda_{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left[\left(x{i}-\hat{x}{i}\right)^{2}+\left(y{i}-\hat{y}{i}\right)^{2}\right] \ \quad+\lambda{\text {coord }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {obj }}\left[(\sqrt{w{i}}-\sqrt{\hat{w}{i}})^{2}+(\sqrt{h{i}}-\sqrt{\hat{h}_{i}})^{2}\right] \end{array}$$

x, y: predicated bbox center
w, h: predicated bbox width & height
$\hat{x}, \hat{y}$: labeled bbox center
$\hat{w}, \hat{h}$: labeled bbox width & height
$\sqrt{w}, \sqrt{h}$: Suppress the effect for larger bbox
$\lambda_{\text {coord }}$: 5. because there's only 8 dimensions. Too less comparing to other losses weighted loss essentially.

Confidence Loss

$$\begin{array}{c} +\sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\mathrm{obj}}\left(C{i}-\hat{C}{i}\right)^{2} \ +\lambda{\text {noobj }} \sum_{i=0}^{S^{2}} \sum_{j=0}^{B} \mathbb{1}{i j}^{\text {noobj }}\left(C{i}-\hat{C}_{i}\right)^{2} \end{array}$$

$\hat{C}_{i}$: confidence score [IoU] of predicted and ground truth
$C_{i}$: preidcted confidence score [IoU] generated from network

Note:

$\hat{C}_{i}$ is 0 or 1 integer
$\lambda_{\text {noobj }}$=0.5, because there's so many non-object bboxes
Train: confidence = $Pr(object) \cdot IoU^{pred}_{truth}$
Test: individual box confidence predicton:<br> confidence = $Pr(cls_{i}obj)Pr(obj) \cdot IoU^{pred}_{truth}$

Classification loss

$$+\sum_{i=0}^{S^{2}} 1_{i}^{\mathrm{obj}} \sum_{c \in \text { classes }}\left(p_{i}(c)-\hat{p}_{i}(c)\right)^{2}$$

Each cell will only predict 1 object, which is decided by the bbox with the largest IoU.<br>

▶ Don't forget to do NMS after generating bboxes.

The YOLOv1 Pros & Cons

Pros:

one stage, really fast

Cons:

Bad for crowed objects[1 cell 1 obj]
Bad for small objects
Bad for objects with new width-height ratio
No BN