Blog: Review: Maxout Network (Image Classification)
By Ian J. GoodFellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio
In this story, Maxout Network, by Université de Montréal, is briefly reviewed. Ian J. GoodFellow, the first author, is also the inventor of Generative Adversarial Network (GAN). And Yoshua Bengio, the last author, just got the Turing Award recently this year (2019), which is the “Nobel Prize of computing”. These two authors plus the second last author Aaron Courville, three authors together, has also published the book “Deep learning”, through publisher MIT Press, in 2016 . And in this paper, for maxout network, it is published in 2013 ICML with over 1500 citations. (Sik-Ho Tsang @ Medium)
- Explanations of Maxout in NIN
- Given an input x, or hidden layer’s state v, z is:
- And a new type of activation function is used:
- At last g is:
- The philosophy behind is that:
Any continuous PWL function can be expressed as a difference of two convex PWL functions.
- Any continuous function can be approximated arbitrarily well, by a piecewise linear function.
- And it can be achieved by a maxout network with two hidden units h1(v) and h2(v), with sufficiently large k.
- And it found that a two hidden unit maxout network can approximate any continuous function f(v) arbitrarily well on the compact domain.
2. Explanations of Maxout in NIN
- As NIN compares with Maxout in an intensive way in the experimental result section, NIN also explains a little bit for Maxout Network.
- The number of feature maps is reduced by maximum pooling over affine feature maps (affine feature maps are the direct results from linear convolution without applying the activation function).
- Maximization over linear functions makes a piecewise linear approximator which is capable of approximating any convex functions.
- The maxout network is more potent as it can separate concepts that lie within convex sets.
- However, maxout network imposes the prior that instances of a latent concept lie within a convex set in the input space, which does not necessarily hold.
- The MNIST (LeCun et al., 1998) dataset consists of 28×28 pixel greyscale images of handwritten digits 0–9, with 60,000 training and 10,000 test examples.
- The last 10,000 training examples are used as validation set.
- A model consisting of two densely connected maxout layers followed by a softmax layer is trained.
- 0.94% test error is obtained, which is the best result that does not use unsupervised pretraining.
- Three convolutional maxout hidden layers (with spatial max pooling on top of the maxout layers) followed by a densely connected softmax layer is used.
- A test set error rate of 0.45%, which is the best result.
- The CIFAR-10 dataset (Krizhevsky & Hinton, 2009) consists of 32 × 32 color images drawn from 10 classes split into 50,000 train and 10,000 test images.
- The model used consists of three convolutional maxout layers, a fully connected maxout layer, and a fully connected softmax layer.
- A test set error of 11.68% is obtained.
- With data augmentation, i.e. translations and horizontal reflection, a test set error of 9.38% is obtained.
- With dropout, a greater than 25% reduction is achieved in the validation set error on CIFAR-10.
- The CIFAR-100 (Krizhevsky & Hinton, 2009) dataset is the same size and format as the CIFAR-10 dataset, but contains 100 classes, with only one tenth as many labeled examples per class.
- A test error of 38.57% is obtained.
3.4. Street View House Numbers (SVHN)
- Each image is of size 32×32 and the task is to classify the digit in the center of the image.
- There are 73,257 digits in the training set, 26,032 digits in the test set and 531,131 additional, somewhat less difficult examples, to use as an extra training set.
- 400 samples per class from the training set and 200 samples per class from the extra set are selected. The remaining digits of the train and extra sets are used for training.
- The model used consists of three convolutional maxout hidden layers and a densely connected maxout layer followed by a densely connected softmax layer.
- A test error of 2.47% is obtained.
It is interested to read through the paper to see how authors make use of neural network to achieve the propositions above. And there are also ablation studies of Maxout network against other activation function like tanh or ReLU at the last part of the paper.
[2013 ICML] [Maxout]
My Previous Reviews
[LeNet] [AlexNet] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [MSDNet]
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]