Главная
Study mode:
on
1
Intro
2
Deep Learning Training in Shared Clusters
3
Example Shared-Cluster DL Training Workflow
4
Pollux: Co-adaptive Cluster Scheduler for DL
5
Outline
6
Background: Distributed DL (Data Parallelism)
7
System Throughput and Impact of Batch Size
8
Statistical Efficiency and Impact of Batch Size
9
illustration of Overall Training Performance
10
Implications for Cluster Scheduling
11
Pollux Cluster Scheduler
12
Key Idea: Goodput, not Throughput
13
Modeling System Throughput
14
Modeling Statistical Efficiency
15
Optimizing Cluster-Wide Allocations
16
Evaluation of Pollux
17
Cluster-Wide Statistical Efficiency
18
More Experiments in our Paper!
19
Conclusion
Description:
Explore a cutting-edge approach to deep learning cluster scheduling in this 14-minute conference talk from OSDI '21. Dive into Pollux, a co-adaptive cluster scheduler that optimizes goodput in deep learning environments. Learn how this innovative system simultaneously considers per-job and cluster-wide factors to improve resource allocation and utilization. Discover the novel goodput metric that combines system throughput with statistical efficiency, and understand how Pollux dynamically reassigns resources to enhance overall cluster performance. Gain insights into the system's ability to reduce average job completion times, promote fairness, and potentially lower costs in cloud environments. Examine the background of distributed deep learning, the impact of batch size on system throughput and statistical efficiency, and the key components of Pollux's cluster scheduler. Delve into the evaluation results and broader implications of this groundbreaking approach to deep learning cluster management. Read more

Pollux - Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning

USENIX
Add to list