Play all

Introduction

Industry Trends

AI by Enterprises

Storage and Compute

Ecosystem

Training and Deployment

Network Interfaces

Middleware Stack

Software

Preprocessing

Summary

Questions

What is GDR

Dual Approach

GPU Direct RDMA

Storage Needs

Training Methods

Network Usage

Collective Operations

Optane Persistent Memory

Description:

Learn about training deep learning models in cloud environments through this 56-minute webcast presented by experts from Habana (Intel) and IBM. Explore industry predictions showing deep learning's dominance in future cloud workloads, with a focus on foundation models trained using billions of parameters. Gain insights into AI adoption benefits across industries, infrastructure selection considerations for both on-premises and cloud deployments, and solution approaches for enterprise AI implementation. Discover how organizations leverage cloud-native AI software stacks like Kubernetes to manage complexity with evolving frameworks like TensorFlow and PyTorch. Examine critical aspects of operationalizing deep learning infrastructure, including scaling solutions, cost optimization, training time reduction, data storage capacity, bandwidth requirements, and additional key infrastructure selection criteria. Dive into technical topics like GPU Direct RDMA, storage needs, training methods, network usage, collective operations, and Optane Persistent Memory. Master the essentials of deep learning infrastructure design while understanding the tradeoffs between cost, performance, and flexibility in modern AI deployments. Read more

Training Deep Learning Models in the Cloud - Infrastructure Considerations and Best Practices

SNIAVideo

Add to list

#Computer Science #Deep Learning #Artificial Intelligence #Machine Learning #Programming #Cloud Computing #TensorFlow #DevOps #Kubernetes #PyTorch #Model Training #Foundation Models

0:00 / 0:00