Play all

Intro

There Are No Safe Changes

Minimize the Blast Radius on Changes

Monitor Accurately and Measure Thoroughly

Automate Mitigations

Degraded Service Modes, or An Imperfect Experience Usually Beats a Nonexistent One

Use Functional Gates Pre-, Post- and During Releases

Design to Meet SLAs and Mitigate Incidents Quickly

Regularly Exercise All Processes and Tools

Enforce Processes with Technology

Redirect or Drop Traffic Aggressively During Incidents

Production Quality Tools

Sanitize and verify Inputs

Understand All of the Scenarios You Support

Transition Service Responsibilities Carefully

Description:

Explore insights from over two decades of systems engineering experience in this 39-minute SREcon conversation with David Argent from Amazon. Gain valuable lessons learned from failures in designing and running large-scale online services. Discover key concepts such as minimizing change impact, implementing thorough monitoring, automating mitigations, and designing for quick incident resolution. Learn about the importance of regular process exercises, enforcing processes with technology, and understanding all supported scenarios. Benefit from Argent's diverse background spanning roles like Technical Writer, Systems Engineer, and Lead Problem Engineer across companies like Microsoft and Amazon.

Confessions of a Systems Engineer - Learning from My 20+ Years of Failure

USENIX

Add to list

#Conference Talks #SREcon #Computer Science #Information Technology #Service Level Agreements