HOME/Articles/

HRNet

Article Outline

Origin : Deep High-Resolution Representation Learning for Visual Recognition

Abstract

Abstract

Features

  • High-resolution representations through the whole process
  • The resulting representation is semantically richer and spatially more precise

1 INTRODUCTION

The high-resolution representations learned from HRNet are not only semantically strong but also spatially precise. This comes from two aspects. <br>

  • Our approach connects high-to-low resolution convolution streams in parallel rather than in series. Thus, our approach is able to maintain the high resolution instead of recovering high resolution from low resolution, and accordingly the learned representation is potentially spatially more precise.<br>
  • Most existing fusion schemes aggregate high-resolution low-level and high-level representations obtained by upsampling low-resolution representations. Instead, we repeat multiresolution fusions to boost the high-resolution representations with the help of the low-resolution representations, and vice versa. As a result, all the high-to-low resolution representations are semantically strong.

There are three versions of HRNet :

figure4

  1. HRNetV1<br> The output is the representation only from the high-resolution stream. Other three representations are ignored. This is illustrated in Figure 4 (a).
  2. HRNetV2<br> Rescaling the low-resolution representations through bilinear upsampling without changing the number of channels to the high resolution and concatenate the four representations, followed by a 1 × 1 convolution to mix the four representations. This is illustrated in Figure 4 (b).
  3. HRNetV2p<br> Constructing multi-level representations by downsampling the high-resolution representation output from HRNetV2 to multiple levels. This is depicted in Figure 4 (c).

3 HIGH-RESOLUTION NETWORKS

null

null

stem

self.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1,
                       bias=False)
self.bn1 = BatchNorm2d(64, momentum=BN_MOMENTUM)
self.conv2 = nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1,
                       bias=False)
self.bn2 = BatchNorm2d(64, momentum=BN_MOMENTUM)
self.relu = nn.ReLU(inplace=False)

3.1 Parallel Multi-Resolution Convolutions

null

null


3.2 Repeated Multi-Resolution Fusions

null

null


3.4 Instantiation

The main body contains four stages with four parallel convolution streams. The resolutions are 1/4, 1/8, 1/16, and 1/32.

  1. The first stage contains 4 residual units where each unit is formed by a bottleneck with the width 64, and is followed by one 3 × 3 convolution changing the width of feature maps to C.
  2. The 2nd, 3rd, 4th stages contain 1, 4, 3 modularized blocks, respectively. Each branch in multiresolution parallel convolution of the modularized block contains 4 residual units. Each unit contains two 3 × 3 convolutions for each resolution, where each convolution is followed by batch normalization and the nonlinear activation ReLU. The widths (numbers of channels) of the convolutions of the four resolutions are C, 2C, 4C, and 8C, respectively.

Paper : Deep High-Resolution Representation Learning for Visual Recognition
Code : HRNet-Semantic-Segmentation, HRNet-Image-Classification, HRNet-Object-Detection, HRNet-Human-Pose-Estimation