<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" type="text/javascript"></script>
<script type="text/x-mathjax-config"> MathJax.Hub.Config({ showProcessingMessages: false, tex2jax: { skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'], inlineMath: [['$','$']] } }); </script>
<center>Photo by <a style="background-color:black;color:white;text-decoration:none;padding:4px 6px;font-family:-apple-system, BlinkMacSystemFont, "San Francisco", "Helvetica Neue", Helvetica, Ubuntu, Roboto, Noto, "Segoe UI", Arial, sans-serif;font-size:12px;font-weight:bold;line-height:1.2;display:inline-block;border-radius:3px" href="https://unsplash.com/photos/1R0hB9WWVvQ" target="_blank" rel="noopener noreferrer" title="Download free do whatever you want high-resolution photos from Grace Brauteseth"><span style="display:inline-block;padding:2px 3px"><svg xmlns="http://www.w3.org/2000/svg" style="height:12px;width:auto;position:relative;vertical-align:middle;top:-2px;fill:white" viewBox="0 0 32 32"><title>unsplash-logo</title><path d="M10 9V0h12v9H10zm12 5h10v18H0V14h10v9h12v-9z"></path></svg></span><span style="display:inline-block;padding:2px 3px">Grace Brauteseth</span></a></center>
Origin: YOLO9000: Better, Faster, Stronger
Imporvements comparing to YOLOv1
1. Add BN
2. Hight Resolution classifier
a. Train on ImageNet (224 x 224)<br> b. Resize & Finetune on ImageNet (448 x 448)<br> c. Finetune on dataset<br> d. Get 13 x 13 grid finally (7 x 7 grid before)
3. Use Anchors
4. Fine-Grained Features
a. Lower features are concatenated directly to heigher features<br> b. A new layer is added for that purpose: reorg<br>
$$\left[\begin{array}{llll} a & b & c & d \ e & f & g & h \ i & j & k & l \ m & n & o & p \end{array}\right] \quad===>\quad\left[\begin{array}{ll} a & c \ i & k \end{array}\right] \quad\left[\begin{array}{ll} b & d \ j & l \end{array}\right] \quad\left[\begin{array}{ll} f & h \ n & p \end{array}\right] \quad\left[\begin{array}{ll} e & g \ m & n \end{array}\right]$$
5. Multi-Scale Training
- Remove FC layers:
Can accpet any size of inputs, enhance model robustness.
- Size across 320, 352, ..., 608. Change 10 per epochs<br>
[border % 32 = 0, decided by down sampling
Anchor in YOLOv2
Box Generation | # | Avg IOU |
---|---|---|
Cluster SSE | 5 | 58.7 |
Cluster IOU | 5 | 58.7 |
Anchor Boxes | 9 | 58.7 |
Cluster IOU | 9 | 58.7 |
1. Anchor size and number
a. Faster RCNN: 9 by hands<br> b. YOLOv2: 5 by K-Means [dist: 1 − IOU(bbox, cluster)]
2. Anchors, Truth BBoxes & Predicted bboxes
Anchors: 0.57273, 0.677385, ..., 9.77052, 9.16828<br>
[10 numbers: ($a_{w_{i}}, a_{h_{i}}$) * 5)]<br> anchors[0] = $a_{w_{i}}$ = $\frac{a_{w_{i}}}W$ * 13
Truth Anchor:
- oringinal bbox: $\left[x_{o}, y_{o}, w_{o}, h_{o}\right] \in[0, \mathrm{W} \mid \mathrm{H}]$
- normalize original bbox: $[x_{r}, y_{r}, w_{r}, h_{r}]\in[0, 1]$
- $[x_{r}, y_{r}, w_{r}, h_{r}] = [x_{o}/W, y_{o}/H, w_{o}/W, h_{o}/H]$
- Transfer to 13 x 13 grid and box: $[x_{s}, y_{s}, w_{s}, h_{s}]\in[0, 13]$
- $\left[x_{i}, y_{i}, w_{i}, h_{i}\right]=\left[x_{r}, y_{r}, w_{r}, h_{r}\right] * (13 \mid 13)$
- save this for calculating
- transfer to 0~1 corresponding to each grid cell
- final box: $[x_{f}, y_{f}, w_{f}, h_{f}]\in[0, 1]$
- $x_{f} = x_{i} - i$ // i,j = 13 x 13 grid
- $y_{f} = y_{i} - j$
- $W_{f} = log(W/anchors[0])$
- $H_{f} = log(H/anchors[1])$
Predicted Anchor:
$$\begin{aligned} b_{x} &=\sigma\left(t_{x}\right)+c_{x} \ b_{y} &=\sigma\left(t_{y}\right)+c_{y} \ b_{w} &=p_{w} e^{t_{w}} \ b_{h} &=p_{h} e^{t_{h}} \ \operatorname{Pr}(\text { object }) * I O U(b, \text { object }) &=\sigma\left(t_{o}\right) \end{aligned}$$
The Model Darknet-19
Output of YOLOv2: [0: 25]
one grid cell: $S^2$ * B * [x, y, w, h, $C_{0}, ..., C_{N}$]
- detection layer
- 3 x 331024 Conv
- add a passthrough layer
- 1*1 avg pooling
Loss Function
$$\begin{array}{c} \sum_{i=0}^{W} \sum_{j=0}^{H} \sum_{k=0}^{A} 1_{\text {Max IOU}<\text {Thresh}} \lambda_{\text {noobj}} * \left(-b_{i j k}^{o}\right)^{2} \ \qquad\qquad\qquad\qquad\qquad\qquad+1_{t<12800} \lambda_{\text {prior}} * \sum_{r \in(x, y, w, h)}\left(\text {prior}{k}^{r}-b{i j k}^{r}\right)^{2} \ \qquad\qquad\qquad\qquad\qquad\quad+1_{k}^{\text {truth}}\left(\lambda_{\text {coord}} * \sum_{r \in(x, y, w, h)}\left(\operatorname{truth}^{r}-b_{i j k}^{r}\right)\right. \ \qquad\qquad\qquad\qquad\qquad\quad+\lambda_{\text {obj}} * \left(\operatorname{IoU}{\text {truth}}^{k}-b{i j k}^{o}\right)^{2} \ \left.\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad+\lambda_{\text {class}} * \left(\sum_{c=1}^{C}\left(\operatorname{truth}^{c}-b_{i j k}^{c}\right)^{2}\right)\right) \end{array}$$
- No longer use the square root
- Confidence: 1 convert to IoU
The path from YOLO to YOLOv2
YOLO | YOLOv2 | ||||||||
---|---|---|---|---|---|---|---|---|---|
batch norm? | √ | √ | √ | √ | √ | √ | √ | √ | |
hi-res classifier? | √ | √ | √ | √ | √ | √ | √ | ||
convolutional? | √ | √ | √ | √ | √ | √ | |||
anchor boxes? | √ | √ | |||||||
new network? | √ | √ | √ | √ | √ | ||||
dimension priors? | √ | √ | √ | √ | |||||
location prediction? | √ | √ | √ | √ | |||||
passthrough? | √ | √ | √ | ||||||
multi-scale? | √ | √ | |||||||
hi-res detector? | √ | ||||||||
VOC2007 mAP | 63.4 | 65.8 | 69.5 | 69.2 | 69.6 | 74.4 | 75.4 | 76.8 | 78.6 |