Play all

[] Simon preferred beverage

[] Takeaways

[] Simon's tech background

[] Zombie models garbage collection

[] The road to LLMs

[] Trained models Simon worked on

[] LLM Checkpoints

[] Confidence in AI Training

[] Different Checkpoints

[] Checkpoint parts

[] Slurm vs Kubernetes

[] Storage choices lessons

[] Paramount components for setup

[] Argo workflows

[] Kubernetes node troubleshooting

[] Cloud virtual machines have pre-installed mentoring

[] Fine-tuning

[] Storage, networking, and complexity in network design

[] Start simple before advanced; consider model needs.

[] Join us at our first in-person conference on June 25 all about AI Quality

Description:

Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Explore the intricacies of handling multi-terabyte LLM checkpoints in this insightful podcast episode featuring Simon Karasik, Machine Learning Engineer at Nebius AI. Delve into the challenges of LLM checkpointing, including checkpoint sizes and various techniques for saving and loading massive datasets. Gain valuable insights on selecting appropriate cloud storage options for checkpointing. Learn about Simon's diverse background in machine learning, covering areas such as ads, speech, and tax. Discover key topics like zombie model garbage collection, the evolution of LLMs, and the importance of confidence in AI training. Examine the differences between Slurm and Kubernetes, storage choice lessons, and essential components for setting up LLM infrastructure. Explore Argo workflows, Kubernetes node troubleshooting, and the complexities of fine-tuning, storage, and networking in LLM development. Benefit from practical advice on starting simple before advancing to more complex setups, and understanding model-specific needs in the rapidly evolving field of large language models. Read more

Handling Multi-Terabyte LLM Checkpoints - MLOps Podcast #228

MLOps.community

Add to list

#Computer Science #Machine Learning #MLOps #DevOps #Kubernetes #Programming #Cloud Computing #Cloud Storage #High Performance Computing #Slurm #Fine-Tuning #Argo #Argo Workflows

0:00 / 0:00