Distillation Startup and Performance Monitoring with tensorboard
21
Instruction fine-tuning and dataset selection
22
Instruction FT Startup and Performance Monitoring with tensorboard
23
Running inference to evaluate distillation performance
24
Teacher model performance base SmolLM 135M
25
SmolLM Instruct model performance
26
Raw pruned model performance layer pruned 99M
27
Width + Layer pruning performance raw 99M
28
Distilled model performance before instruction tuning 99M
29
Instruction tuning performance evaluation
30
SmolLM 135M Instruct performance
31
Instruction tuned distilled model performance 99M model
32
Final Tips best pruning approach, learning rate, batch size and model size effects
33
Video Resources
Description:
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Grab it
Dive into an extensive 1-hour 21-minute video tutorial on the distillation of transformer models. Explore various distillation techniques, including layer and width pruning, applied to models like Whisper, Flux, and Minitron. Learn how to initialize student models, compare pre-training and distillation approaches, and understand the differences between cross-entropy loss and KL-divergence. Follow along with a detailed code walk-through for pruning, distillation, and instruction fine-tuning of a SmolLM 135M model to a 99M version. Gain insights into multi-GPU setups, performance monitoring with tensorboard, and dataset selection for instruction fine-tuning. Evaluate distillation performance through various model comparisons and receive valuable tips on pruning approaches, learning rates, and batch sizes. Access additional resources, including slides, research papers, and datasets, to further enhance your understanding of transformer model distillation.
Distillation of Transformer Models - Tutorial and Code Walk-through