Play all

Intro

A PHENOMENAL EVENING

ELK @ SQUARESPACE

SERVICE RELIABILITY PRINCIPLES

THE RELIABILITY STACK

SERVICE LEVEL INDICATORS

SERVICE LEVEL OBJECTIVES

ERROR BUDGETS ARE AWESOME

THIS RELIABILITY STUFF ISN'T NEW

THE INCIDENT COMMAND SYSTEM

PROBLEMS THE ICS ADDRESSES

OPERATIONS LEAD

INCIDENT COMMANDER 1

TIMELINE OF A 37-HOUR INCIDENT

SEE THE FOREST FOR THE TREES

THE UNSHARDENING

KEY COMPONENTS

DATA COLLECTION

LESSONS LEARNED

REPAIR ITEMS

PROGRESS IS INCREMENTAL

ALERT ON WHAT MATTERS Put your users first

Description:

Discover how to implement Site Reliability Engineering (SRE) practices in a challenging environment through this 40-minute conference talk from SREcon19 Europe/Middle East/Africa. Follow Squarespace engineers Alex Hidalgo and Alex Lee as they share their journey of transforming a struggling centralized logging platform from 85% reliability to a documented 99.9% uptime. Learn about key SRE concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. Explore the implementation of the Incident Command System (ICS) and its role in addressing operational challenges. Gain insights into data collection strategies, lessons learned, and the importance of incremental progress in improving system reliability. Understand how to prioritize user-focused alerting and apply SRE principles to resolve long-standing incidents, even when starting from a critical state.

How to SRE When Everything's Already on Fire

USENIX

Add to list

#Conference Talks #SREcon #Data Science #Data Collection #Computer Science #Information Technology #Incident Management #DevOps #Service Level Indicators

0:00 / 0:00