Play all

Intro

Deep Learning @ UBER

Self-Driving Vehicles

Trip Forecasting

Fraud Detection

Why Distributed Deep Learning?

How Distributed Deep Learning Works

Why Mesos?

Mesos Support for GPUs

Mesos Nested Containers

What is Missing?

Peloton Overview

Peloton Architecture

Elastic GPU Resource Management

Resource Pools

Gang Scheduling

Placement Strategies

Why TensorFlow?

Architecture for Distributed TensorFlow on Mesos

Can We Do Better?

Architecture for Horovod on Mesos

Distributed Training Performance with Horovod

What About Usability?

Giving Back

Thank you!

Description:

Explore distributed deep learning on Apache Mesos with GPU support and gang scheduling in this 37-minute conference talk from UBER engineers. Learn how to speed up complex model training, scale to hundreds of GPUs, and shard models that don't fit on a single machine. Discover the design and implementation of running distributed TensorFlow on Mesos clusters with hundreds of GPUs, leveraging key features like GPU isolation and nested containers. Gain insights into GPU and gang scheduling, task discovery, and dynamic port allocation. See real-world examples of distributed training speed-ups using a TensorFlow model for image classification. Delve into UBER's deep learning applications in self-driving vehicles, trip forecasting, and fraud detection. Understand the architecture of Peloton, UBER's cluster management system, and its features for elastic GPU resource management, resource pools, and placement strategies. Compare distributed TensorFlow and Horovod architectures on Mesos, and examine their performance benefits for large-scale deep learning tasks. Read more

Distributed Deep Learning on Apache Mesos with GPUs and Gang Scheduling

Linux Foundation

Add to list

#Computer Science #Distributed Systems #Apache Mesos #Machine Learning #TensorFlow #High Performance Computing #Parallel Computing #GPU Computing #Software Engineering #Scalability #Programming #Cloud Computing #Cluster Management #Deep Learning #Distributed Deep Learning #Distributed Computing #Horovod

0:00 / 0:00