Главная
Study mode:
on
1
Leveraging open technologies to monitor packet drops in AI cluster fabrics
Description:
Learn how to effectively monitor and troubleshoot packet drops in AI cluster networks through this technical talk by eBay's Director of Site Network Engineering. Discover the importance of lossless networks for optimal AI cluster performance and job completion times. Explore the development of a Telemetry and Monitoring (TAM) solution that utilizes Open Compute Project's SAI and open sFlow drop notification technologies. Understand how to implement monitoring tools that capture packet drops, generate notifications, identify drop reasons, and locate congestion points. Gain insights into eBay's experience with open networking hardware and community SONiC implementation in their data centers, while learning best practices for tuning infrastructure components including switches, NICs, and GPU servers to maintain optimal network performance.

Leveraging Open Technologies to Monitor Packet Drops in AI Cluster Fabrics

Open Compute Project
Add to list