Distributed Training Applications Multi-GPU, Multi-node
7
K8s Challenges & Outline
8
Kes Orchestration Flow
9
Sample PyTorch Job Launch
10
Array Jobs and MPI Operator
11
SRIOV CNI for K8s Multi-Rail
12
Gang Scheduling Multi-Node Pods
13
PodGroup Queue and Manager
14
Demo
15
Sample Job Real-Time Telemetry
16
Sample BERT K8s Scaling
17
Shared K8s Cluster for Multi-node
18
Scheduler Dashboard
19
Summary and Future Work
Description:
Explore production multi-node job execution with gang scheduling, Kubernetes, GPUs, and RDMA in this conference talk from KubeCon + CloudNativeCon. Dive into the challenges and solutions for running distributed deep learning and machine learning workloads in shared Kubernetes clusters. Learn about distributed TensorFlow, PyTorch, Horovod, and MPI implementations, as well as the use of GPU nodes with NCCL and RDMA for accelerated performance. Discover the end-to-end flow for multi-node jobs in Kubernetes, including gang scheduling, quotas, fairness, and backfilling implemented in a custom GPU scheduler. Gain insights into high-speed networking through RoCE and SR-IOV/Multus CNI, and understand design choices, learnings, and operational experiences, including failure handling, performance optimization, and telemetry in large-scale distributed computing environments.
Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA