Play all

Intro

There Are No Safe Changes

Minimize the Blast Radius on Changes

Monitor Accurately and Measure Thoroughly

Automate Mitigations

Degraded Service Modes, or An Imperfect Experience Usually Beats a Nonexistent One

Use Functional Gates Pre-, Post- and During Releases

Design to Meet SLAs and Mitigate Incidents Quickly

Regularly Exercise All Processes and Tools

Enforce Processes with Technology

Redirect or Drop Traffic Aggressively During Incidents

Production Quality Tools

Sanitize and Verify Inputs

Understand All of the Scenarios You Support

Transition Service Responsibilities Carefully

Description:

Explore a 39-minute conference talk from SREcon20 Americas where David Argent, an Amazon systems engineer, shares invaluable lessons learned from over two decades of failures in running large-scale online services. Gain insights into best practices for designing and operating complex systems, including minimizing change impact, implementing thorough monitoring, automating mitigations, and designing for quick incident resolution. Learn about the importance of regular process exercises, enforcing procedures with technology, and carefully transitioning service responsibilities. Discover practical advice on creating degraded service modes, utilizing functional gates during releases, and aggressively managing traffic during incidents. Benefit from Argent's experience-based wisdom on producing quality tools, input sanitization, and understanding all supported scenarios to enhance your systems engineering skills and avoid costly mistakes.

Confessions of a Systems Engineer - Learning from My 20+ Years of Failure

USENIX

Add to list

#Conference Talks #SREcon #Business #Management & Leadership #Change Management #Business Management #Process Improvement #Engineering #Systems Engineering