This is a brief summary of paper for me to study and organize it, Distilling the Knowledge in a Neural Network (Luong et al., NIPS Deep Learning and Representation Workshop 2015) that I read and studied.

They are saying that there is conflicting contraints between training and deployment.

For example, in large-scale machine learning, many similar models are usually uesd on training stage and deployment stage in spite of the different requirements.

In most cases, many models are needed to extract useful structure from dataset to handle the task but in deployment stage, it it important latency and computation resource than extracting useful information.

So they proposed that once the cumbersome model has been trained, the knowledge of the cumbersome model is transferrend to a small model.

They called it “distilation”, in their paper, an obvious way for transferring the generalization ability of the cumbersome model to a small model is to use the class probabilities produced by the cumbersome model as “soft targets” for training the small model.

When transferring the knowlege of cumbersome model with tranfer set, transfer set is whether it is ogriginal traning set or other different set uesed for only transferring the knowledge with soft targets.

They said the softer softmax output is more imformative than harder softmax output(for detailed information, see the vedio below).

\[Softsoftmax(X)=\frac{exp(z_i/T)}{\sum_{j} exp(z_j/T)}\]

As you can see soft softmax above, when T=1, it get the standard softmax function. As T grows, the probability distribution generated by softmax function becomes softer providing more information as to which classes the model found more similar to the predicted class. This is called “dark knowledge” embedded in the model.

When they trained a small model with transfer set, they used the same temperature T.

if the correct labels are known for all or some of the transfer set, they imporoved their method by also training the distilled model to produce the correct labels.

There are simply two ways to improve their method. one is to use the correct labels to modify the soft targets and the other is to simply use a weighted average of two different objective functions.

In their experiment they used the latter as follow:

The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model.
The second objective function is the cross entropy with the correct labels which is computed using exactly the same logits in softmax of the distilled model but at a temperature of 1.

other than “Knowledge distillation”, they propose a new type ensemble composed of one or more generalist models and many speciallist models.

For detailed experiment analysis, you can found in Distilling the Knowledge in a Neural Network (Luong et al., NIPS Deep Learning and Representation Workshop 2015)

TTIC Distinguished Lecture Series - Geoffrey Hinton

Reference