PyTorch Profiler Distributed Training Profiling (single node multigpu)
42
Try it now
43
Resources
44
Closing Notes
Description:
Explore best practices and techniques for scaling machine learning workloads to build large-scale models using PyTorch in this 38-minute conference talk from Microsoft Build 2022. Learn from experiences training 175-billion and 1-Trillion parameter models, covering different training paradigms and techniques for profiling and troubleshooting. Dive into topics such as PyTorch Distributed, DistributedDataParallel, FullyShardedDataParallel, pipeline parallelism, memory-saving features, and scaling efficiency. Gain insights on model implementation, scaling limits, network bandwidth impact, and best practices for large-scale training. Discover profiling and troubleshooting tools like Uber Prof, DCGM, Nvidia NSights, and PyTorch Profiler for distributed training scenarios. By the end, acquire valuable knowledge to jumpstart your efforts in scaling ML workloads with PyTorch.