Главная
Study mode:
on
1
Intro
2
Culture
3
1. Reliability can't be taken for granted
4
2. Cattle vs. Pets
5
3. Blamelessness
6
4. Measure what matters
7
A word on Ops
8
5. Failure modes
9
6. No heroes
10
7. Automation
11
Change is constant
12
8. Change is No. 1 reason for outages
13
9. Outages are inevitable
14
10. No haunted graveyards
15
What did we learn?
16
Outro
Description:
Explore key insights from Google's Site Reliability Engineering (SRE) practices in this 39-minute conference talk. Discover ten fundamental organizational principles learned from managing one of the world's most complex production infrastructures. Learn about the importance of reliability, the "cattle vs. pets" approach, blameless culture, effective measurement, failure modes, and automation. Understand why change is constant and the leading cause of outages, why outages are inevitable, and the concept of avoiding "haunted graveyards" in systems. Gain valuable knowledge on maintaining reliable, scalable, efficient, and agile production environments from Google's extensive experience in SRE.

Ten Things We've Learned From Running Production Infrastructure at Google

GOTO Conferences
Add to list