[] Cloud virtual machines have pre-installed mentoring
17
[] Fine-tuning
18
[] Storage, networking, and complexity in network design
19
[] Start simple before advanced; consider model needs.
20
[] Join us at our first in-person conference on June 25 all about AI Quality
Description:
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only!
Grab it
Explore the intricacies of handling multi-terabyte LLM checkpoints in this insightful podcast episode featuring Simon Karasik, Machine Learning Engineer at Nebius AI. Delve into the challenges of LLM checkpointing, including checkpoint sizes and various techniques for saving and loading massive datasets. Gain valuable insights on selecting appropriate cloud storage options for checkpointing. Learn about Simon's diverse background in machine learning, covering areas such as ads, speech, and tax. Discover key topics like zombie model garbage collection, the evolution of LLMs, and the importance of confidence in AI training. Examine the differences between Slurm and Kubernetes, storage choice lessons, and essential components for setting up LLM infrastructure. Explore Argo workflows, Kubernetes node troubleshooting, and the complexities of fine-tuning, storage, and networking in LLM development. Benefit from practical advice on starting simple before advancing to more complex setups, and understanding model-specific needs in the rapidly evolving field of large language models.
Read more