Explore how Facebook handles partial data center failures in this LISA19 conference talk. Learn about the Sub-Region Disaster Recovery initiative, which aims to keep data centers online during localized physical failures. Discover the development of an "auditor" that simulates power outages and understand the challenges of managing stateless, stateful, and storage systems during partial failures. Gain insights into testing methodologies, including intentional machine disconnections, and hear real-world stories about accidental power disruptions. Examine the impact of various failure types, from submarine cable disconnections to localized issues like power breaker failures and cooling system malfunctions. Understand the complexities of maintaining service availability in large-scale, geo-distributed data center environments and the strategies employed to minimize the impact of partial failures on overall operations.
Sub-Region Failure - How to Handle the Partial Loss of a Data Center