Play all

Introduction

Why is large model training needed?

Scaling creates training and model efficiency

Larger models = more efficient, less training, less data

Larger models can learn with few shot learning

Democratizing largescale language models with OPT175B

Challenges of large model training

What is PyTorch Distributed?

Features Overview

DistributedDataParallel

FullyShardedDataParallel

FSDP Auto wrapping

FSDP Auto wrapping example

FSDP CPU Offload, Backward Prefetch policies

FSDP Mixed Precision control

Pipeline

Example Auto Partitioning

Pipeline + DDP (PDP)

Memory Saving Features

Activation Checkpointing

Activation Offloading

Activation Checkpointing & Offloading

Parameter Offloading

Memory Saving Feature & Training Paradigms

Experiments & Insights

Model Implementation

Scaling Efficiency Varying # GPUs

Scaling Efficiency Varying World Size

Scaling Efficiency Varying Batch Size

Model Scale Limit

Impact of Network Bandwidth

Best Practices

Best Practices FSDP

Profiling & Troubleshooting

Profiling & Troubleshooting for Large Scale Model Training

Uber Prof (Experimental) Profiling & Troubleshooting tool

Demonstration

Combining DCGM + Profiling

Profiling for Large Scale Model Training

Nvidia NSights multinode, multigpu Profiling

PyTorch Profiler Distributed Training Profiling (single node multigpu)

Try it now

Resources

Closing Notes

Description:

Explore best practices and techniques for scaling machine learning workloads to build large-scale models using PyTorch in this 38-minute conference talk from Microsoft Build 2022. Learn from experiences training 175-billion and 1-Trillion parameter models, covering different training paradigms and techniques for profiling and troubleshooting. Dive into topics such as PyTorch Distributed, DistributedDataParallel, FullyShardedDataParallel, pipeline parallelism, memory-saving features, and scaling efficiency. Gain insights on model implementation, scaling limits, network bandwidth impact, and best practices for large-scale training. Discover profiling and troubleshooting tools like Uber Prof, DCGM, Nvidia NSights, and PyTorch Profiler for distributed training scenarios. By the end, acquire valuable knowledge to jumpstart your efforts in scaling ML workloads with PyTorch.

Scaling ML Workloads with PyTorch

Microsoft

Add to list

#Computer Science #Deep Learning #PyTorch #Machine Learning #Distributed Computing

0:00 / 0:00