Explore how Honeycomb improved the reliability of their Zookeeper, Kafka, and stateful storage systems through intentional node termination in this SREcon20 Americas talk. Discover the journey from manual experiments to automated node recycling, uncovering bugs in replacement tools along the way. Learn about the importance of resilience engineering, continuous delivery, and maintaining operational continuity. Understand how to quantify reliability, identify potential risks, and design experiments to probe those risks. Delve into the concept of Service Level Objectives (SLOs) as a common language for defining success and managing error budgets. Gain insights on handling data persistence, monitoring changes using Service Level Indicators (SLIs), and leveraging observability for debugging. Follow Honeycomb's progression towards continuously running experiments, resulting in no node living longer than 12 months and weekly automated node recycling. Acquire practical knowledge on improving system reliability and scalability, applicable even without advanced automation or Kubernetes deployment.
Read more