Blog: Knowledge distillation in a nutshell
Since the emergence of AlexNet in 2012., Deep Convolutional Neural Networks became state of the art in computer vision and natural language processing. Nowadays, deep neural nets grew so large they can occupy a few hundred megabytes of memory. On the other hand, there’s a constant need for storing deep models in small devices like mobile phones and FPGAs. Modern researchers are challenged to compress deep neural networks into small devices without losing the model’s accuracy. All this resulted in methods for compression of neural networks. This article explains the method called “Knowledge distillation”. The method was proposed by Geoffrey Hinton and can be used for compression of deep neural networks.
The goal of knowledge distillation is to extract knowledge from a large network (or even from an ensemble of networks) to a much smaller network. Let T and S be teacher and student network. Let a be the logits of these networks. First, we calculate soft predictions of S and T as follows:
Parameter tau is used for output softening. It should be greater than one.
Soft predictions are used because that way more information is provided about the relative similarity between classes. Relative similarity between classes is known as dark knowledge of deep models. Now we do the distillation by minimizing the loss given by the formula below.
Hyperparameter alpha is a tunable parameter used to balance both cross-entropies denoted by H.
Geoffrey Hinton and Adriana Romero advise that student network should be deeper and narrower than teacher network in order to achieve better results.
Student’s network parameters initialization
Adriana Romero proposes Hint-based pretraining. First, we choose one hidden layer of network T (Ot) and another layer from network S (Os). Now we define loss function as mean squared error between outputs of these two layers. If the chosen layers do not have the same dimensions, we can add one more layer on top of the selected layers. Authors advise selecting layers from the middle of the network.
Wang and Lan  propose initializing parameters of network S by pretraining it on the training set.
This Github repo contains a simple example of knowledge distillation between two neural networks, written in Tensorflow low-level API. After distillation of knowledge, student network accomplishes 1% less accuracy on the test set with 92% fewer parameters.
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton(2012). ImageNet Classification with Deep Convolutional Neural Networks
 Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network
 Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio (2014). FitNets: Hints for Thin Deep Nets
 Chong Wang, Xipeng Lan, Yangang Zhang(2017). Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification