Degraded Service Modes, or An Imperfect Experience Usually Beats a Nonexistent One
7
Use Functional Gates Pre-, Post- and During Releases
8
Design to Meet SLAs and Mitigate Incidents Quickly
9
Regularly Exercise All Processes and Tools
10
Enforce Processes with Technology
11
Redirect or Drop Traffic Aggressively During Incidents
12
Production Quality Tools
13
Sanitize and verify Inputs
14
Understand All of the Scenarios You Support
15
Transition Service Responsibilities Carefully
Description:
Explore insights from over two decades of systems engineering experience in this 39-minute SREcon conversation with David Argent from Amazon. Gain valuable lessons learned from failures in designing and running large-scale online services. Discover key concepts such as minimizing change impact, implementing thorough monitoring, automating mitigations, and designing for quick incident resolution. Learn about the importance of regular process exercises, enforcing processes with technology, and understanding all supported scenarios. Benefit from Argent's diverse background spanning roles like Technical Writer, Systems Engineer, and Lead Problem Engineer across companies like Microsoft and Amazon.
Confessions of a Systems Engineer - Learning from My 20+ Years of Failure