Главная
Study mode:
on
1
Introduction
2
Why is large model training needed?
3
Scaling creates training and model efficiency
4
Larger models = more efficient, less training, less data
5
Larger models can learn with few shot learning
6
Democratizing largescale language models with OPT175B
7
Challenges of large model training
8
What is PyTorch Distributed?
9
Features Overview
10
DistributedDataParallel
11
FullyShardedDataParallel
12
FSDP Auto wrapping
13
FSDP Auto wrapping example
14
FSDP CPU Offload, Backward Prefetch policies
15
FSDP Mixed Precision control
16
Pipeline
17
Example Auto Partitioning
18
Pipeline + DDP (PDP)
19
Memory Saving Features
20
Activation Checkpointing
21
Activation Offloading
22
Activation Checkpointing & Offloading
23
Parameter Offloading
24
Memory Saving Feature & Training Paradigms
25
Experiments & Insights
26
Model Implementation
27
Scaling Efficiency Varying # GPUs
28
Scaling Efficiency Varying World Size
29
Scaling Efficiency Varying Batch Size
30
Model Scale Limit
31
Impact of Network Bandwidth
32
Best Practices
33
Best Practices FSDP
34
Profiling & Troubleshooting
35
Profiling & Troubleshooting for Large Scale Model Training
36
Uber Prof (Experimental) Profiling & Troubleshooting tool
37
Demonstration
38
Combining DCGM + Profiling
39
Profiling for Large Scale Model Training
40
Nvidia NSights multinode, multigpu Profiling
41
PyTorch Profiler Distributed Training Profiling (single node multigpu)
42
Try it now
43
Resources
44
Closing Notes
Description:
Explore best practices and techniques for scaling machine learning workloads to build large-scale models using PyTorch in this 38-minute conference talk from Microsoft Build 2022. Learn from experiences training 175-billion and 1-Trillion parameter models, covering different training paradigms and techniques for profiling and troubleshooting. Dive into topics such as PyTorch Distributed, DistributedDataParallel, FullyShardedDataParallel, pipeline parallelism, memory-saving features, and scaling efficiency. Gain insights on model implementation, scaling limits, network bandwidth impact, and best practices for large-scale training. Discover profiling and troubleshooting tools like Uber Prof, DCGM, Nvidia NSights, and PyTorch Profiler for distributed training scenarios. By the end, acquire valuable knowledge to jumpstart your efforts in scaling ML workloads with PyTorch.

Scaling ML Workloads with PyTorch

Microsoft
Add to list
0:00 / 0:00