Главная
Study mode:
on
1
AI model distillation Whisper, Flux, Minitron, gpt-4o-mini?
2
Video Overview - Distillation Tutorial and Code Walk-through
3
Distillation Examples Diffusion - Flux Schnell / Dev, Transcription - Distil-Whisper, LLMs - Nvidia Minitron
4
How distillation works
5
Student model initialization
6
Layer / depth pruning
7
Width pruning
8
Pre-training versus distillation
9
Cross-entropy loss vs KL-divergence
10
Instruction fine-tuning
11
Distilling SmolLM 135M to a 99M model
12
Code walk-through setup.
13
Pruning Notebook
14
Layer Pruning
15
Width Pruning
16
Why pruning works?
17
Distillation Script - Multi-GPU Setup
18
Distillation Script Walk-through
19
Distillation Configuration File Walk-through
20
Distillation Startup and Performance Monitoring with tensorboard
21
Instruction fine-tuning and dataset selection
22
Instruction FT Startup and Performance Monitoring with tensorboard
23
Running inference to evaluate distillation performance
24
Teacher model performance base SmolLM 135M
25
SmolLM Instruct model performance
26
Raw pruned model performance layer pruned 99M
27
Width + Layer pruning performance raw 99M
28
Distilled model performance before instruction tuning 99M
29
Instruction tuning performance evaluation
30
SmolLM 135M Instruct performance
31
Instruction tuned distilled model performance 99M model
32
Final Tips best pruning approach, learning rate, batch size and model size effects
33
Video Resources
Description:
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Dive into an extensive 1-hour 21-minute video tutorial on the distillation of transformer models. Explore various distillation techniques, including layer and width pruning, applied to models like Whisper, Flux, and Minitron. Learn how to initialize student models, compare pre-training and distillation approaches, and understand the differences between cross-entropy loss and KL-divergence. Follow along with a detailed code walk-through for pruning, distillation, and instruction fine-tuning of a SmolLM 135M model to a 99M version. Gain insights into multi-GPU setups, performance monitoring with tensorboard, and dataset selection for instruction fine-tuning. Evaluate distillation performance through various model comparisons and receive valuable tips on pruning approaches, learning rates, and batch sizes. Access additional resources, including slides, research papers, and datasets, to further enhance your understanding of transformer model distillation.

Distillation of Transformer Models - Tutorial and Code Walk-through

Trelis Research
Add to list
0:00 / 0:00