Degraded Service Modes, or An Imperfect Experience Usually Beats a Nonexistent One
7
Use Functional Gates Pre-, Post- and During Releases
8
Design to Meet SLAs and Mitigate Incidents Quickly
9
Regularly Exercise All Processes and Tools
10
Enforce Processes with Technology
11
Redirect or Drop Traffic Aggressively During Incidents
12
Production Quality Tools
13
Sanitize and Verify Inputs
14
Understand All of the Scenarios You Support
15
Transition Service Responsibilities Carefully
Description:
Explore a 39-minute conference talk from SREcon20 Americas where David Argent, an Amazon systems engineer, shares invaluable lessons learned from over two decades of failures in running large-scale online services. Gain insights into best practices for designing and operating complex systems, including minimizing change impact, implementing thorough monitoring, automating mitigations, and designing for quick incident resolution. Learn about the importance of regular process exercises, enforcing procedures with technology, and carefully transitioning service responsibilities. Discover practical advice on creating degraded service modes, utilizing functional gates during releases, and aggressively managing traffic during incidents. Benefit from Argent's experience-based wisdom on producing quality tools, input sanitization, and understanding all supported scenarios to enhance your systems engineering skills and avoid costly mistakes.
Confessions of a Systems Engineer - Learning from My 20+ Years of Failure