Performance will not usually mean evaluation metrics Optimization also does not mean optimization algorithms such as Adam, Adagrad, NAdam... Bias and Generalization will also not be covered Performan…
3
Software engineers dealing with machine learning models Data Scientists needing to know how to train more performant models Developers generally curious about the harder problems of deploying larger …
4
Labeling and data quality Deploying models: Setting up a REST API Packaging: how to deploy your ML pipeline Experiment Tracking: Metrics, sharing results
5
Computer vision on: o Mobile Devices Single board computers (pis, jetson nano...) Big Servers with GPUS NLP on: Big Servers with GPUs Large CPU models
6
Data needs to be transformed before it can be used Fast transforms are usually an afterthought
7
ETL/Data Pipelines Primer • Raw data needs to be converted to arrays (think pandas data frame to numpy array) Data can come from anywhere: databases, the web (REST), streams (kafka, spark, flink...) …
8
Models Primer • Models are stored in various formats: hdfs (keras), protobuf (tensorflow.onnx), pickle (pytorch) • Model files are a mix of configuration and parameters (ndarrays that represent the w…
9
ML Pipelines are not just models • ETL varies and can be represented in json, code, or even within the model via something like tf.data • Metrics and experiments (evaluation results) may also be stor…
10
Better in memory file formats for data interchange
11
Removing redundancy matters: Identity ops, redundant layers... Model Size matters: less parameters and compute-faster, less storage • Format matters: Some execution engines (If lite vs tensorflow, to…
12
Quantization: Change model data type to int from float (reduces memory and computation) Knowledge Distillation: Train a smaller model based on the outputs of a bigger model (student/teacher) Pruning:…
13
Deep Learning Compilers: TVM, Glow, MLIR Compiles models to executable binaries Handles finding optimal graph for a given hardware configuration Note: Not ready for production use. Very early days ye…
Description:
Explore a comprehensive 45-minute conference talk surveying various techniques for optimizing deep learning pipelines. Dive into advanced optimization methods like quantization, model distillation, and efficient math library selection, while examining their trade-offs in deployment scenarios. Gain insights into performance optimization focusing on latency and examples per second, tailored for software engineers, data scientists, and developers working with large-scale machine learning models. Cover crucial aspects of ML pipeline deployment, including data transformation, ETL processes, model storage formats, and emerging deep learning compilers. Learn about strategies to reduce model size, improve memory efficiency, and enhance computational speed through techniques such as pruning, batch norm folding, and knowledge distillation.