Play all

Intro

Speakers

TensorFlow 2.0 Workflow

Orchestration for DL

Parameter Server

Reduce

Kubernetes Operators

Mirror Strategy in TF

TensorFlow + Hovorod

PyTorch + Hovorod

Recall: TFJob vs. MPIJob

Shared API and Best Practices

Description:

Explore large-scale distributed deep learning deployments on Kubernetes clusters in this conference talk. Delve into the use of operators for managing and automating machine learning training processes, comparing the open-source tf-operator and mpi-operator. Examine different distribution strategies and their impact on performance, particularly regarding CPU, GPU, and network utilization. Gain insights into optimizing orchestration for deep learning tasks, which are both network and GPU intensive, to achieve better economics and prevent idle compute capacity. Learn from shared experiences and best practices for TensorFlow 2.0 workflow, parameter servers, Kubernetes operators, mirror strategy in TensorFlow, and integrations with Horovod for both TensorFlow and PyTorch.

Large Scale Distributed Deep Learning on Kubernetes Clusters

Linux Foundation

Add to list

#Computer Science #DevOps #Kubernetes #Machine Learning #TensorFlow #Deep Learning #PyTorch #Software Engineering #Scalability #Orchestration #Distributed Deep Learning #Distributed Computing #Horovod