Play all

AI model distillation Whisper, Flux, Minitron, gpt-4o-mini?

Video Overview - Distillation Tutorial and Code Walk-through

Distillation Examples Diffusion - Flux Schnell / Dev, Transcription - Distil-Whisper, LLMs - Nvidia Minitron

How distillation works

Student model initialization

Layer / depth pruning

Width pruning

Pre-training versus distillation

Cross-entropy loss vs KL-divergence

Instruction fine-tuning

Distilling SmolLM 135M to a 99M model

Code walk-through setup.

Pruning Notebook

Layer Pruning

Width Pruning

Why pruning works?

Distillation Script - Multi-GPU Setup

Distillation Script Walk-through

Distillation Configuration File Walk-through

Distillation Startup and Performance Monitoring with tensorboard

Instruction fine-tuning and dataset selection

Instruction FT Startup and Performance Monitoring with tensorboard

Running inference to evaluate distillation performance

Teacher model performance base SmolLM 135M

SmolLM Instruct model performance

Raw pruned model performance layer pruned 99M

Width + Layer pruning performance raw 99M

Distilled model performance before instruction tuning 99M

Instruction tuning performance evaluation

SmolLM 135M Instruct performance

Instruction tuned distilled model performance 99M model

Final Tips best pruning approach, learning rate, batch size and model size effects

Video Resources

Description:

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Dive into an extensive 1-hour 21-minute video tutorial on the distillation of transformer models. Explore various distillation techniques, including layer and width pruning, applied to models like Whisper, Flux, and Minitron. Learn how to initialize student models, compare pre-training and distillation approaches, and understand the differences between cross-entropy loss and KL-divergence. Follow along with a detailed code walk-through for pruning, distillation, and instruction fine-tuning of a SmolLM 135M model to a 99M version. Gain insights into multi-GPU setups, performance monitoring with tensorboard, and dataset selection for instruction fine-tuning. Evaluate distillation performance through various model comparisons and receive valuable tips on pruning approaches, learning rates, and batch sizes. Access additional resources, including slides, research papers, and datasets, to further enhance your understanding of transformer model distillation.

Distillation of Transformer Models - Tutorial and Code Walk-through

Trelis Research

Add to list

#Computer Science #Machine Learning #Transformer Models #KL Divergence #Fine-Tuning

0:00 / 0:00