FAST '23 - Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems
Description:
Explore a groundbreaking fail-slow detection framework for cloud storage systems in this award-winning conference talk from FAST '23. Dive into Perseus, a practical solution designed to address the emerging challenge of fail-slow failures in both software and hardware components. Learn how this innovative framework utilizes a light regression-based model to swiftly identify and analyze performance degradation at the drive level. Discover the impressive results from a 10-month monitoring period of 248,000 drives, revealing 304 fail-slow cases and demonstrating a 48% reduction in node-level 99.99th tail latency through isolation. Gain insights into the extensive fail-slow dataset compiled from production traces, encompassing 41,000 normal drives and 315 verified fail-slow drives. Uncover the root causes behind fail-slow drives, including poorly implemented scheduling, hardware defects, and environmental factors. This 16-minute presentation by researchers from Shanghai Jiao Tong University, Alibaba Inc., Xiamen University, and Zhejiang Normal University offers valuable knowledge for professionals and researchers in cloud storage and system performance optimization.
Read more
Perseus - A Fail-Slow Detection Framework for Cloud Storage Systems