Description:

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Learn about optimizing GPU resource allocation for Large Language Model (LLM) inference on Kubernetes in this technical conference talk from IBM researchers. Explore the challenges of efficient GPU utilization and discover an analytical approach to understanding the relationship between request loads and resource requirements. Examine how GPU compute and memory requirements for LLM inference servers like vLLM correlate with configuration parameters and key performance metrics. Master the implementation of optimal GPU fractioning at deployment time based on model characteristics and estimated workloads. Watch a demonstration of an open-source controller that automatically converts whole GPU requests into fractional requests using MIG (Multi-Instance GPU) slices, enabling improved resource density and sustainability while maintaining service level objectives.

Load-Aware GPU Fractioning for LLM Inference on Kubernetes

CNCF [Cloud Native Computing Foundation]

Add to list

#Computer Science #DevOps #Kubernetes #High Performance Computing #Parallel Computing #GPU Computing #Load Balancing #Machine Learning #vLLM

Load-Aware GPU Fractioning for LLM Inference on Kubernetes

Load-Aware GPU Fractioning for LLM Inference on Kubernetes - Olivier Tardieu & Yue Zhu, IBM