Blog: Review: SENet — Squeeze-and-Excitation Network, Winner of ILSVRC 2017 (Image Classification)
With SE Blocks, Surpasses ResNet, Inception-v4, PolyNet, ResNeXt, MobileNetV1, DenseNet, PyramidNet, DPN, ShuffleNet V1
In this story, Squeeze-and-Excitation Network (SENet), by University of Oxford, is reviewed. With “Squeeze-and-Excitation” (SE) block that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels, SENet is constructed. And it won the first place in ILSVRC 2017 classification challenge with top-5 error to 2.251% which has about 25% relative improvement over the winning entry of 2016. And this is a paper in 2018 CVPR with more than 600 citations. (Sik-Ho Tsang @ Medium)
- Squeeze-and-Excitation (SE) Block
- SE-Inception & SE-ResNet
- Comparison with State-of-the-art Approaches
- Analysis and Interpretation
1. Squeeze-and-Excitation (SE) Block
- where Ftr is the convolutional operator for transformation of X to U.
- This Ftr can be the residual block or Inception block, which will be mentioned in more detail later.
- where V=[v1, v2, …, vc] is the learnt set of filter kernels.
1.1. Squeeze: Global Information Embedding
- The transformation output U can be interpreted as a collection of the local descriptors whose statistics are expressive for the whole image.
- It is proposed to squeeze global spatial information into a channel descriptor.
- This is achieved by using global average pooling to generate channel-wise statistics.
1.2. Excitation: Adaptive Recalibration
- where δ is the ReLU function.
- A simple gating mechanism using sigmoid activation σ is used.
- An excitation operation is proposed to fully capture channel-wise dependencies, and to learn a nonlinear and non-mutually-exclusive relationship between channels.
- As we can see that there are W1 and W2, and the input z is a channel descriptor after global average pooling, there are two fully connected (FC) layers.
- The bottleneck with two FC layers are formed with dimensionality reduction using reduction ratio r.
- The number of additional parameters introduced depends on r as above where S refers to the number of stages (where each stage refers to the collection of blocks operating on feature maps of a common spatial dimension), Cs denotes the dimension of the output channels and Ns denotes the repeated block number for stage s.
- The final output of the block is obtained by rescaling the transformation output U with the activations, as shown above.
- The activations act as channel weights adapted to the input-specific descriptor z. In this regard, SE blocks intrinsically introduce dynamics conditioned on the input, helping to boost feature discriminability.
2. SE-Inception & SE-ResNet
- As shown above, SE block can be added to both Inception and ResNet block easily as SE-Inception and SE-ResNet.
- Particularly in SE-ResNet, squeeze and excitation both act before summation with the identity branch.
- More variants that integrate with ResNeXt, Inception-ResNet, MobileNetV1 and ShuffleNet V1 can be constructed by following the similar schemes.
- A more detailed architectures for SE-ResNet-50, and SE-ResNeXt-50 (32×4d) as below:
3. Comparison with State-of-the-art Approaches
3.1. Single-Crop Error Rates on ImageNet Validation Set
- SE Blocks are added to ResNet, ResNeXt, VGGNet, BN-Inception, and Inception-ResNet-v2. For VGGNet, Batch Normalization layer is added after each convolution for easier training.
- During training, with a mini-batch of 256 images, a single pass forwards and backwards through ResNet-50 takes 190 ms, compared to 209 ms for SE-ResNet-50 (both timings are performed on a server with 8 NVIDIA Titan X GPUs).
- During testing, CPU inference time for each model for a 224 × 224 pixel input image: ResNet-50 takes 164 ms, compared to 167 ms for SE-ResNet-50.
- Remarkably, SE-ResNet-50 achieves a single-crop top-5 validation error of 6.62%, exceeding ResNet-50 (7.48%) by 0.86% and approaching the performance achieved by the much deeper ResNet-101 network (6.52% top-5 error) with only half of the computational overhead (3.87 GFLOPs vs. 7.58 GFLOPs).
- And SE-ResNet-101 (6.07% top-5 error) not only matches, but outperforms the deeper ResNet-152 network (6.34% top-5 error) by 0.27%.
- Similarly, SE-ResNeXt-50 has a top-5 error of 5.49% which is superior to both its direct counterpart ResNeXt-50 (5.90% top-5 error) as well as the deeper ResNeXt-101 (5.57% top-5 error), a model which has almost double the number of parameters and computational overhead.
- SE-Inception-ResNet-v2 (4.79% top-5 error) outperforms the reimplemented Inception-ResNet-v2 (5.21% top-5 error) by 0.42% (a relative improvement of 8.1%)
- The performance improvements are consistent through training across a range of different depths, suggesting that the improvements induced by SE blocks can be used in combination with increasing the depth of the base architecture.
- For light weight model, MobileNetV1 and ShuffleNet V1, SE blocks can consistently improve the accuracy by a large margin at minimal increases in computational cost.
3.2. ILSVRC 2017 Classification Competition
- Multi-scale, multi-crop and ensemble are used.
- 2.251% top-5 error on the test set is obtained.
- On validation set, SENet-154, SE blocks with a modified ResNeXt, achieved a top-1 error of 18.68% and a top-5 error of 4.47% using a 224 × 224 centre crop evaluation.
- It outperforms ResNet, Inception-v3, Inception-v4, Inception-ResNet-v2, ResNeXt, DenseNet, Residual Attention Network, PolyNet, PyramidNet, and DPN.
3.3. Scene Classification
- SE-ResNet-152 (11.01% top-5 error) achieves a lower validation error than ResNet-152 (11.61% top-5 error), providing evidence that SE blocks can perform well on different datasets.
- And SENet surpasses the previous state-of-the-art model Places365-CNN which has a top-5 error of 11.48%.
3.4. Object Detection on COCO
- Faster R-CNN is used as detection network.
- SE-ResNet-50 outperforms ResNet-50 by 1.3% (a relative 5.2% improvement) on COCO’s standard metric AP and 1.6% on AP@IoU=0.5.
- Importantly, SE blocks are capable of benefiting the deeper architecture ResNet-101 by 0.7% (a relative 2.6% improvement) on the AP metric.
4. Analysis and Interpretation
4.1. Reduction Ratio r
- r = 16 achieved a good tradeoff between accuracy and complexity and consequently, this value is used for all experiments.
4.2. The Role of Excitation
- For the above 5 classes, fifty samples are drawn for each class from the validation set and compute the average activations for fifty uniformly sampled channels in the last SE block in each stage.
- First, at lower layers, e.g. SE_2_3, the importance of feature channels is likely to be shared by different classes in the early stages of the network.
- Second, at greater depth, e.g. SE_4_6 and SE_5_1, the value of each channel becomes much more class-specific as different classes exhibit different preferences to the discriminative value of features.
- As a result, representation learning benefits from the recalibration induced by SE blocks which adaptively facilitates feature extraction and specialisation to the extent that it is needed.
- Finally, the last stage, i.e. SE_5_2, it exhibits an interesting tendency towards a saturated state in which most of the activations are close to 1 and the remainder is close to 0. Similar pattern is found in SE_5_3 with slight change in scale.
- This suggests that SE 5 2 and SE 5 3 are less important than previous blocks in providing recalibration to the network.
- The overall parameter count could be significantly reduced by removing the SE blocks for the last stage with only a marginal loss of performance.
SE blocks improve the representational capacity of a network by enabling it to perform dynamic channelwise feature recalibration.
[2018 CVPR] [SENet]
My Previous Reviews
[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [MSDNet] [ShuffleNet V1]
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]