Главная
Study mode:
on
1
Intro
2
A PHENOMENAL EVENING
3
ELK @ SQUARESPACE
4
SERVICE RELIABILITY PRINCIPLES
5
THE RELIABILITY STACK
6
SERVICE LEVEL INDICATORS
7
SERVICE LEVEL OBJECTIVES
8
ERROR BUDGETS ARE AWESOME
9
THIS RELIABILITY STUFF ISN'T NEW
10
THE INCIDENT COMMAND SYSTEM
11
PROBLEMS THE ICS ADDRESSES
12
OPERATIONS LEAD
13
INCIDENT COMMANDER 1
14
TIMELINE OF A 37-HOUR INCIDENT
15
SEE THE FOREST FOR THE TREES
16
THE UNSHARDENING
17
KEY COMPONENTS
18
DATA COLLECTION
19
LESSONS LEARNED
20
REPAIR ITEMS
21
PROGRESS IS INCREMENTAL
22
ALERT ON WHAT MATTERS Put your users first
Description:
Discover how to implement Site Reliability Engineering (SRE) practices in a challenging environment through this 40-minute conference talk from SREcon19 Europe/Middle East/Africa. Follow Squarespace engineers Alex Hidalgo and Alex Lee as they share their journey of transforming a struggling centralized logging platform from 85% reliability to a documented 99.9% uptime. Learn about key SRE concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. Explore the implementation of the Incident Command System (ICS) and its role in addressing operational challenges. Gain insights into data collection strategies, lessons learned, and the importance of incremental progress in improving system reliability. Understand how to prioritize user-focused alerting and apply SRE principles to resolve long-standing incidents, even when starting from a critical state.

How to SRE When Everything's Already on Fire

USENIX
Add to list
0:00 / 0:00