Lorem ipsum dolor sit amet, consectetur adicing elit ut ullamcorper. leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet. Leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet.

  /  Project   /  Blog: Review: SENet — Squeeze-and-Excitation Network, Winner of ILSVRC 2017 (Image Classification)

Blog: Review: SENet — Squeeze-and-Excitation Network, Winner of ILSVRC 2017 (Image Classification)

SENet got the first place in ILSVRC 2017 Classification Challenge

In this story, Squeeze-and-Excitation Network (SENet), by University of Oxford, is reviewed. With “Squeeze-and-Excitation” (SE) block that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels, SENet is constructed. And it won the first place in ILSVRC 2017 classification challenge with top-5 error to 2.251% which has about 25% relative improvement over the winning entry of 2016. And this is a paper in 2018 CVPR with more than 600 citations. (Sik-Ho Tsang @ Medium)


  1. Squeeze-and-Excitation (SE) Block
  2. SE-Inception & SE-ResNet
  3. Comparison with State-of-the-art Approaches
  4. Analysis and Interpretation

1. Squeeze-and-Excitation (SE) Block

Squeeze-and-Excitation (SE) Block
  • where Ftr is the convolutional operator for transformation of X to U.
  • This Ftr can be the residual block or Inception block, which will be mentioned in more detail later.
  • where V=[v1, v2, …, vc] is the learnt set of filter kernels.

1.1. Squeeze: Global Information Embedding

SE Path, Same as the Upper Path at the Figure Above
  • The transformation output U can be interpreted as a collection of the local descriptors whose statistics are expressive for the whole image.
  • It is proposed to squeeze global spatial information into a channel descriptor.
  • This is achieved by using global average pooling to generate channel-wise statistics.

1.2. Excitation: Adaptive Recalibration

  • where δ is the ReLU function.
  • A simple gating mechanism using sigmoid activation σ is used.
  • An excitation operation is proposed to fully capture channel-wise dependencies, and to learn a nonlinear and non-mutually-exclusive relationship between channels.
  • As we can see that there are W1 and W2, and the input z is a channel descriptor after global average pooling, there are two fully connected (FC) layers.
  • The bottleneck with two FC layers are formed with dimensionality reduction using reduction ratio r.
  • The number of additional parameters introduced depends on r as above where S refers to the number of stages (where each stage refers to the collection of blocks operating on feature maps of a common spatial dimension), Cs denotes the dimension of the output channels and Ns denotes the repeated block number for stage s.
  • The final output of the block is obtained by rescaling the transformation output U with the activations, as shown above.
  • The activations act as channel weights adapted to the input-specific descriptor z. In this regard, SE blocks intrinsically introduce dynamics conditioned on the input, helping to boost feature discriminability.

2. SE-Inception & SE-ResNet

Left: SE-Inception, Right: SE-ResNet
  • As shown above, SE block can be added to both Inception and ResNet block easily as SE-Inception and SE-ResNet.
  • Particularly in SE-ResNet, squeeze and excitation both act before summation with the identity branch.
  • More variants that integrate with ResNeXt, Inception-ResNet, MobileNetV1 and ShuffleNet V1 can be constructed by following the similar schemes.
  • A more detailed architectures for SE-ResNet-50, and SE-ResNeXt-50 (32×4d) as below:
ResNet-50 (Left), SE-ResNet-50 (Middle), SE-ResNeXt-50 (32×4d) (Right)

3. Comparison with State-of-the-art Approaches

3.1. Single-Crop Error Rates on ImageNet Validation Set

Single-Crop Error Rates (%) on ImageNet Validation Set
  • SE Blocks are added to ResNet, ResNeXt, VGGNet, BN-Inception, and Inception-ResNet-v2. For VGGNet, Batch Normalization layer is added after each convolution for easier training.
  • During training, with a mini-batch of 256 images, a single pass forwards and backwards through ResNet-50 takes 190 ms, compared to 209 ms for SE-ResNet-50 (both timings are performed on a server with 8 NVIDIA Titan X GPUs).
  • During testing, CPU inference time for each model for a 224 × 224 pixel input image: ResNet-50 takes 164 ms, compared to 167 ms for SE-ResNet-50.
  • Remarkably, SE-ResNet-50 achieves a single-crop top-5 validation error of 6.62%, exceeding ResNet-50 (7.48%) by 0.86% and approaching the performance achieved by the much deeper ResNet-101 network (6.52% top-5 error) with only half of the computational overhead (3.87 GFLOPs vs. 7.58 GFLOPs).
  • And SE-ResNet-101 (6.07% top-5 error) not only matches, but outperforms the deeper ResNet-152 network (6.34% top-5 error) by 0.27%.
  • Similarly, SE-ResNeXt-50 has a top-5 error of 5.49% which is superior to both its direct counterpart ResNeXt-50 (5.90% top-5 error) as well as the deeper ResNeXt-101 (5.57% top-5 error), a model which has almost double the number of parameters and computational overhead.
  • SE-Inception-ResNet-v2 (4.79% top-5 error) outperforms the reimplemented Inception-ResNet-v2 (5.21% top-5 error) by 0.42% (a relative improvement of 8.1%)
  • The performance improvements are consistent through training across a range of different depths, suggesting that the improvements induced by SE blocks can be used in combination with increasing the depth of the base architecture.
Single-Crop Error Rates (%) on ImageNet Validation Set
  • For light weight model, MobileNetV1 and ShuffleNet V1, SE blocks can consistently improve the accuracy by a large margin at minimal increases in computational cost.

3.2. ILSVRC 2017 Classification Competition

Single-Crop Error Rates (%) on ImageNet Validation Set

3.3. Scene Classification

Single-crop error rates (%) on Places365 validation set
  • SE-ResNet-152 (11.01% top-5 error) achieves a lower validation error than ResNet-152 (11.61% top-5 error), providing evidence that SE blocks can perform well on different datasets.
  • And SENet surpasses the previous state-of-the-art model Places365-CNN which has a top-5 error of 11.48%.

3.4. Object Detection on COCO

Object detection results on the COCO 40k validation set by using the basic Faster R-CNN
  • Faster R-CNN is used as detection network.
  • SE-ResNet-50 outperforms ResNet-50 by 1.3% (a relative 5.2% improvement) on COCO’s standard metric AP and 1.6% on AP@IoU=0.5.
  • Importantly, SE blocks are capable of benefiting the deeper architecture ResNet-101 by 0.7% (a relative 2.6% improvement) on the AP metric.

4. Analysis and Interpretation

4.1. Reduction Ratio r

Single-Crop Error Rates (%) on ImageNet Validation Set
  • r = 16 achieved a good tradeoff between accuracy and complexity and consequently, this value is used for all experiments.

4.2. The Role of Excitation

Activations induced by Excitation in the different modules of SE-ResNet-50 on ImageNet.
  • For the above 5 classes, fifty samples are drawn for each class from the validation set and compute the average activations for fifty uniformly sampled channels in the last SE block in each stage.
  • First, at lower layers, e.g. SE_2_3, the importance of feature channels is likely to be shared by different classes in the early stages of the network.
  • Second, at greater depth, e.g. SE_4_6 and SE_5_1, the value of each channel becomes much more class-specific as different classes exhibit different preferences to the discriminative value of features.
  • As a result, representation learning benefits from the recalibration induced by SE blocks which adaptively facilitates feature extraction and specialisation to the extent that it is needed.
  • Finally, the last stage, i.e. SE_5_2, it exhibits an interesting tendency towards a saturated state in which most of the activations are close to 1 and the remainder is close to 0. Similar pattern is found in SE_5_3 with slight change in scale.
  • This suggests that SE 5 2 and SE 5 3 are less important than previous blocks in providing recalibration to the network.
  • The overall parameter count could be significantly reduced by removing the SE blocks for the last stage with only a marginal loss of performance.

SE blocks improve the representational capacity of a network by enabling it to perform dynamic channelwise feature recalibration.


[2018 CVPR] [SENet]
Squeeze-and-Excitation Networks

My Previous Reviews

Image Classification
[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [MSDNet] [ShuffleNet V1]

Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3]

Biomedical Image Segmentation
[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [3D U-Net+ResNet]

Instance Segmentation
[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution

Human Pose Estimation
[DeepPose] [Tompson NIPS’14] [Tompson CVPR’15]

Source: Artificial Intelligence on Medium

(Visited 127 times, 1 visits today)
Post a Comment