Play all

Intro

Deep Learning Applications

AL/DL: Models, Frameworks, Hardware

Trends: Big Data, Larger Models

Sample Multi-GPU Node: DGX-1

Distributed Training Applications Multi-GPU, Multi-node

K8s Challenges & Outline

Kes Orchestration Flow

Sample PyTorch Job Launch

Array Jobs and MPI Operator

SRIOV CNI for K8s Multi-Rail

Gang Scheduling Multi-Node Pods

PodGroup Queue and Manager

Demo

Sample Job Real-Time Telemetry

Sample BERT K8s Scaling

Shared K8s Cluster for Multi-node

Scheduler Dashboard

Summary and Future Work

Description:

Explore production multi-node job execution with gang scheduling, Kubernetes, GPUs, and RDMA in this conference talk from KubeCon + CloudNativeCon. Dive into the challenges and solutions for running distributed deep learning and machine learning workloads in shared Kubernetes clusters. Learn about distributed TensorFlow, PyTorch, Horovod, and MPI implementations, as well as the use of GPU nodes with NCCL and RDMA for accelerated performance. Discover the end-to-end flow for multi-node jobs in Kubernetes, including gang scheduling, quotas, fairness, and backfilling implemented in a custom GPU scheduler. Gain insights into high-speed networking through RoCE and SR-IOV/Multus CNI, and understand design choices, learnings, and operational experiences, including failure handling, performance optimization, and telemetry in large-scale distributed computing environments.

Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA

CNCF [Cloud Native Computing Foundation]

Add to list

#Conference Talks #Computer Science #Deep Learning #DevOps #Kubernetes #PyTorch #High Performance Computing #Parallel Computing #MPI #Computer Hardware #GPU Acceleration #Computer Networking #RDMA #Distributed Computing #Horovod

0:00 / 0:00