ProjectBlog: Knowledge distillation in a nutshell

Blog: Knowledge distillation in a nutshell


Since the emergence of AlexNet[1] in 2012., Deep Convolutional Neural Networks became state of the art in computer vision and natural language processing. Nowadays, deep neural nets grew so large they can occupy a few hundred megabytes of memory. On the other hand, there’s a constant need for storing deep models in small devices like mobile phones and FPGAs. Modern researchers are challenged to compress deep neural networks into small devices without losing the model’s accuracy. All this resulted in methods for compression of neural networks. This article explains the method called “Knowledge distillation”. The method was proposed by Geoffrey Hinton[2] and can be used for compression of deep neural networks.

Knowledge distillation

The goal of knowledge distillation is to extract knowledge from a large network (or even from an ensemble of networks) to a much smaller network. Let T and S be teacher and student network. Let a be the logits of these networks. First, we calculate soft predictions of S and T as follows:

Parameter tau is used for output softening. It should be greater than one.

Soft predictions are used because that way more information is provided about the relative similarity between classes. Relative similarity between classes is known as dark knowledge of deep models. Now we do the distillation by minimizing the loss given by the formula below.

The loss function for knowledge distillation

Hyperparameter alpha is a tunable parameter used to balance both cross-entropies denoted by H.

Geoffrey Hinton[1] and Adriana Romero[3] advise that student network should be deeper and narrower than teacher network in order to achieve better results.

Student’s network parameters initialization

Adriana Romero proposes Hint-based pretraining[3]. First, we choose one hidden layer of network T (Ot) and another layer from network S (Os). Now we define loss function as mean squared error between outputs of these two layers. If the chosen layers do not have the same dimensions, we can add one more layer on top of the selected layers. Authors advise selecting layers from the middle of the network.

Parameter initialization using MSE between outputs of hidden layers

Wang and Lan [4] propose initializing parameters of network S by pretraining it on the training set.

Parameter initialization using cross-entropy

Demo example

This Github repo contains a simple example of knowledge distillation between two neural networks, written in Tensorflow low-level API. After distillation of knowledge, student network accomplishes 1% less accuracy on the test set with 92% fewer parameters.


[1]Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton(2012). ImageNet Classification with Deep Convolutional Neural Networks

[2] Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network

[3] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, Yoshua Bengio (2014). FitNets: Hints for Thin Deep Nets

[4] Chong Wang, Xipeng Lan, Yangang Zhang(2017). Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification

Source: Artificial Intelligence on Medium

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top

Display your work in a bold & confident manner. Sometimes it’s easy for your creativity to stand out from the crowd.