Главная
Study mode:
on
1
Introduction
2
Anatomy of a System
3
System Definition
4
Failure Modes
5
Failure Walkthrough
6
Signs of Trouble
7
Shutting It Down
8
What Changed
9
Roll It Back
10
What We Missed
11
A sensible default
12
What we learned
13
Metrics need context
14
Centralized logging
15
Losing alerts
16
Recap
17
Questions
Description:
Save Big on Coursera Plus. 7,000+ courses at $160 off. Limited Time Only! Grab it Explore a real-world failure in a distributed system and the troubleshooting process involved in this 42-minute conference talk from GOTO Chicago 2017. Follow Jeff Smith, Manager of Production Operations at Centro, as he dissects the anatomy of a system, defines failure modes, and walks through the signs of trouble. Learn about shutting down systems, identifying changes, rolling back, and uncovering missed issues. Gain insights on the importance of sensible defaults, contextual metrics, centralized logging, and alert management. Conclude with a recap and Q&A session to deepen your understanding of handling complex system failures in DevOps environments.

Troubleshooting Tiered Tragedy - A Peek Into Failure

GOTO Conferences
Add to list
0:00 / 0:00