Play all

Intro

Inference Stack Evolution PYTORCH

Model explanation, model pre-post transformers

GPU Autoscaling the challenge

Challenge: Increase GPU utilization

Use Case: Personalized News Monitoring

Challenge: Deploy many models

Proposed Solution: Multi-model Inference Service

Experience from running a serverless inference platform

Reduce tail latency caused by CPU throttling

Reduce cold start latency

Monitoring and Alerting: Control Plane

Monitoring and Alerting: Access logs

Monitoring and Alerting: Inference Service metrics

KFServing Roadmap 2020

Our Working Group is Open

Description:

Explore serverless machine learning inference with KFServing in this conference talk. Learn how KFServing leverages KNative to simplify deployment of standard and custom ML models, enabling auto-scaling and scale-to-zero functionality. Discover the challenges of integrating serverless technology with ML inference, particularly in latency-critical environments. Gain insights into the evolution of inference stacks, GPU autoscaling challenges, and strategies to improve GPU utilization. Examine a real-world use case of personalized news monitoring and the proposed multi-model inference service solution. Delve into practical experiences of running a serverless inference platform, including techniques to reduce tail latency and cold start latency. Understand the importance of monitoring and alerting for control plane, access logs, and inference service metrics. Get a glimpse of the KFServing roadmap for 2020 and learn about the open working group driving its development.

Serverless Machine Learning Inference with KFServing

CNCF [Cloud Native Computing Foundation]

Add to list

#Conference Talks #Data Science #Computer Science #DevOps #Kubernetes #Knative #Machine Learning #Machine Learning Model Deployment

0:00 / 0:00