incident response can learn from safety engineers in other domains
4
a definition...
5
catastrophe is always around the corner
6
incident response isn't easy
7
an overreliance of dashboards and runbooks
8
guesswork
9
spending a long time on the wrong hypothesis
10
fear of failure
11
'history doesn't repeat itselg but it often rhymes'
12
'it seems easy to look back at an incident and determine what went wrong ...'
13
normative language
14
mechanistic reasoning
15
above the line, below the line
16
change introduces new forms of failure
17
experienced troubleshootes rely more on case-based strategies
18
science - definition
19
the theory of falsifiability
20
'a more scientific, hypothesis-driven, approach to how humans perform ... can improve reliability
21
why bother?
22
3 steps
23
all practitioner acts are a gamble
24
thank you
Description:
Explore a scientific approach to incident response in this 37-minute conference talk from Conf42 Platform Engineering 2023. Learn how safety engineering principles from other domains can be applied to improve reliability in tech systems. Discover the limitations of traditional incident response methods, including overreliance on dashboards and runbooks, guesswork, and fear of failure. Examine concepts such as normative language, mechanistic reasoning, and the theory of falsifiability. Understand why experienced troubleshooters rely on case-based strategies and how change introduces new forms of failure. Gain insights into a more hypothesis-driven approach to incident management and learn practical steps to implement this scientific methodology in your organization.
Incident Response: A Scientific Approach to Improving System Reliability