Главная
Study mode:
on
1
Introduction
2
Hurricane Sandy
3
The wakeup call
4
The life of the request
5
Edge points of presence
6
Origin regions
7
Draining regions
8
Data center failures
9
Subregion failures
10
A few thousand servers lost power
11
The switchboard
12
Power panels
13
Fault domain
14
Drain region
15
What did it take out
16
The core problem
17
Which services will be impacted
18
Types of subregion failures
19
Single fault domain
20
Problem statement
21
Easy services
22
Constraints
23
Not everything is gravy
24
What happened next
25
Power Loss Siren
26
Power Failure
27
What I learned
28
Acknowledgement
Description:
Explore how Facebook handles partial data center failures in this LISA19 conference talk. Learn about the Sub-Region Disaster Recovery initiative, which aims to keep data centers online during localized physical failures. Discover the development of an "auditor" that simulates power outages and understand the challenges of managing stateless, stateful, and storage systems during partial failures. Gain insights into testing methodologies, including intentional machine disconnections, and hear real-world stories about accidental power disruptions. Examine the impact of various failure types, from submarine cable disconnections to localized issues like power breaker failures and cooling system malfunctions. Understand the complexities of maintaining service availability in large-scale, geo-distributed data center environments and the strategies employed to minimize the impact of partial failures on overall operations.

Sub-Region Failure - How to Handle the Partial Loss of a Data Center

USENIX
Add to list