Effect of activations, model context and batch size on VRAM
9
Tip for GPU setup - start with a small batch size
10
Reducing VRAM with LoRA and quantisation
11
Quality trade-offs with quantisation and LoRA
12
Choosing between MP, DDP or FSDP
13
Distributed Data Parallel
14
Model Parallel and Fully Sharded Data Parallel FSDP
15
Trade-offs with DDP and FSDP
16
How does DeepSpeed compare to FSDP
17
Using FSDP and DeepSpeed with Accelerate
18
Code examples for MP, DDP and FSDP
19
Using SSH with rented GPUs Runpod
20
Installation
21
slight detour Setting a username and email for GitHub
22
Basic Model Parallel MP fine-tuning script
23
Fine-tuning script with Distributed Data Parallel DDP
24
Fine-tuning script with Fully Shaded Data Parallel FSDP
25
Running ‘accelerate config’ for FSDP
26
Saving a model after FSDP fine-tuning
27
Quick demo of a complete FSDP LoRA training script
28
Quick demo of an inference script after training
29
Wrap up
Description:
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Grab it
Dive into the world of multi-GPU fine-tuning with this comprehensive tutorial on Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP) techniques. Learn how to optimize VRAM usage, understand the intricacies of the Adam optimizer, and explore the trade-offs between various distributed training methods. Gain practical insights on choosing the right GPU setup, implementing LoRA and quantization for VRAM reduction, and utilizing tools like DeepSpeed and Accelerate. Follow along with code examples for Model Parallel, DDP, and FSDP implementations, and discover how to set up and use rented GPUs via SSH. By the end of this tutorial, you'll be equipped with the knowledge to efficiently fine-tune large language models across multiple GPUs.