Lorem ipsum dolor sit amet, consectetur adicing elit ut ullamcorper. leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet. Leo, eget euismod orci. Cum sociis natoque penati bus et magnis dis.Proin gravida nibh vel velit auctor aliquet.

  /  Project   /  Blog: Review: Shake-Shake Regularization (Image Classification)

Blog: Review: Shake-Shake Regularization (Image Classification)

Concept of Adding Noise to Gradient During Training, Outperforms WRN, ResNeXt and DenseNet.

In this story, Shake-Shake Regularization (Shake-Shake), by Xavier Gastaldi from London Business School, is briefly reviewed. The motivation of this paper is that data augmentation is applied at the input image, it might be also possible to apply data augmentation techniques to internal representations.

It is found in prior art that adding noise to the gradient during training helps training and generalization of complicated neural networks. And Shake-Shake regularization can be seen as an extension of this concept where gradient noise is replaced by a form of gradient augmentation. This is a paper in 2017 ICLR Workshop with over 10 citations. And the long version in 2017 arXiv has got over 100 citations. (Sik-Ho Tsang @ Medium)


  1. Shake-Shake Regularization
  2. Experimental Results
  3. Further Evaluations

1. Shake-Shake Regularization

Left: Forward training pass. Center: Backward training pass. Right: At test time.
  • In particular, 3-branch ResNet is studied in this paper as in the above figure, with the equation as below:
  • With Shake-Shake Regularization, α is added:
  • α is set to 0.5 during test time, just like Dropout.

2. Experimental Results

2.1. CIFAR-10

  • 26 2×32d ResNet (i.e. the network has a depth of 26, 2 residual branches and the first residual block has a width of 32) is used.
  • Shake: All scaling coefficients are overwritten with new random numbers before the pass.
  • Even: All scaling coefficients are set to 0.5 before the pass.
  • Keep: For the backward pass, keep the scaling coefficients used during the forward pass.
  • Batch: For each residual block i, the same scaling coefficient is applied for all the images in the mini-batch.
  • Image: For each residual block i, a different scaling coefficient is applied for each image in the mini-batch.
Error Rates of CIFAR-10
  • Using Shake at forward pass has better performance.
  • And Shake-Shake-Image (S-S-I) obtains the best result for 26 2×64d ResNet and 26 2×64d ResNet.

2.2. CIFAR-100

Error Rates of CIFAR-100
  • Using Shake at forward pass again improves the performance.
  • Particularly, Shake-Even-Image (S-E-I) is the best.

2.3. Comparison with State-of-the-art Approaches

Test error (%) and Model Size on CIFAR

3. Further Evaluation

3.1. Correlation Between Residual Branches

  • To calculate the correlation, first forward the mini-batch, through the residual branch 1 and store the output tensor in yi(1). Similar for residual branch 2 and store it in yi(2).
  • Then flatten yi(1) and yi(2) as flati(1) and flati(2) respectively. And calculate the covariance between each corresponding item in 2 vectors.
  • Calculate the variances of flati(1) and flati(2).
  • Repeat until all images in the test set. Use the resulting covariance and variances to calculate the correlation.
Correlation results on E-E-B and S-S-I models
  • First of all, the correlation between the output tensors of the 2 residual branches seems to be reduced by the regularization. This would support the assumption that the regularization forces the branches to learn something different.
Layer-wise correlation between the first 3 layers of each residual block
  • The summation at the end of the residual blocks forces an alignment of the layers on the left and right residual branches.
  • The correlation is reduced by the regularization.

3.2. Regularization Strength

Update Rules for β
Left: Training curves (dark) and test curves (light) of models M1 to M5. Right: Illustration of the different methods in the above Table.
  • The further away β is from α, the stronger the regularization effect.

3.3. Removing Skip Connection / Batch Normalization

  • Architecture A is 26 2×32d but without skip connection.
  • Architecture B is the same as A but with only 1 convolutional layer per branch and twice the number of blocks.
  • Architecture C is the same as A but without batch normalization.
Error Rates of CIFAR-10
  • The results of architecture A clearly show that shake-shake regularization can work even without a skip connection.
  • The results of architecture B show that regularization no longer works.
  • Architecture C makes the model difficult to converge, makes the model a lot more sensitive. It is also really easy to make the model diverge.

With the simple yet novel idea and of course the positive results, it is published in 2017 ICLR Workshop which is very encouraging.


[2017 arXiv] [Shake-Shake]
Shake-Shake Regularization

[2017 ICLR Workshop] [Shake-Shake]
Shake-Shake Regularization of 3-Branch Residual Networks

My Previous Reviews

Image Classification
[LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [MSDNet] [ShuffleNet V1] [SENet]

Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3]

Biomedical Image Segmentation
[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet]

Instance Segmentation
[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution

Human Pose Estimation
[DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM]

Source: Artificial Intelligence on Medium

(Visited 14 times, 1 visits today)
Post a Comment