Play all

Introduction

GPU Cluster

Model Training Graph

Training

Idle Periods

Pipelining

Pipeline Bubble

Tradeoffs

Interleave Schedule

Results

Hyperparameters

DomainSpecific Optimization

GPU throughput

Implementation

Conclusion

Description:

Explore efficient large-scale language model training on GPU clusters in this 23-minute video from Databricks. Learn about the challenges of training massive models, including GPU memory limitations and lengthy computation times. Discover how to combine tensor, pipeline, and data parallelism methods to scale training to thousands of GPUs, enabling a hundredfold increase in model size capacity. Examine a novel pipeline parallelism schedule that boosts throughput by over 10% compared to existing approaches. Gain insights into the trade-offs between different parallelism techniques and how to optimize distributed training configurations. See how these combined methods achieve 502 petaFLOP/s performance on a 1 trillion parameter model using 3072 GPUs, with 52% of peak per-GPU throughput. Access the open-source code and understand the implementation details for domain-specific optimizations and improved GPU utilization.

Efficient Large-Scale Language Model Training on GPU Clusters

Databricks

Add to list

#Computer Science #Machine Learning #High Performance Computing #Parallel Computing #GPU Computing #Distributed Computing #Computer Architecture #Parallel Processing #Model Training

0:00 / 0:00