Knowledge
Response-based Knowledge
- Soft targets or label smoothing from teacher models
- Use neural responses of the last output layer → mimic the final prediction layer of teacher
- Limited to supervised learning
Feature-based Knowledge
- Adapt representation knowledge from intermediate layers (aka intermediate representations)
- Provide hints to improve training of student models
- E.g.
- Romero 2015 / Feature Representation
- Attention Layer Projection - KD
- Attention to select intermediate layers to learn from
Relation-based Knowledge
Relation between layers/samples
Distillation Schemes
Offline Distillation
- Require a large pre-trained teacher → distilled to student