Play all

Intro

4 things you can do for more reliable ML

ML on one machine

ML in production

What makes ML in prod interesting

What goes wrong?

4 things for more reliable ML

ML outages from the outside

Where changes happen: binaries

Where changes happen: configuration

Validating binary and config changes

Where changes happen: data

Validating data updates

Improving data integrity

Handling pipeline backlogs

Description:

Explore a comprehensive talk on enhancing machine learning reliability in production environments. Learn about common failure modes in large-scale ML systems and discover best practices for productionization. Gain insights into monitoring systems, protecting against human error, ensuring data integrity, and managing pipeline workloads efficiently. Understand the challenges of ML in production, including binary and configuration changes, data updates, and pipeline backlogs. Apply an outside-in approach to ML reliability, drawing from experiences with a large-scale ML production platform at Google.

Demystifying Machine Learning in Production - Reasoning about a Large-Scale ML Platform

USENIX

Add to list

#Conference Talks #SREcon #Computer Science #Machine Learning #Information Technology #Data Management #Data Integrity