Discover how to implement Site Reliability Engineering (SRE) practices in a challenging environment through this 40-minute conference talk from SREcon19 Europe/Middle East/Africa. Follow Squarespace engineers Alex Hidalgo and Alex Lee as they share their journey of transforming a struggling centralized logging platform from 85% reliability to a documented 99.9% uptime. Learn about key SRE concepts such as Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. Explore the implementation of the Incident Command System (ICS) and its role in addressing operational challenges. Gain insights into data collection strategies, lessons learned, and the importance of incremental progress in improving system reliability. Understand how to prioritize user-focused alerting and apply SRE principles to resolve long-standing incidents, even when starting from a critical state.