Главная
Study mode:
on
1
Intro
2
Background
3
Postmodern Database
4
Automation
5
User escalation
6
Initial investigation
7
Restoring service objects
8
Collecting service definitions
9
The impact of the incident
10
The reason for the failure
11
Fixing the webhooks
12
Why the operator went rogue
13
Kubernetes label selector package
14
Test engineer accidentally created app load balancer
15
What can we learn
16
Paradoxical Finalizer
17
Paging Storm
18
Mitigation
19
Kubernetes Platform
20
Manual Operations
21
Lessons Learned
22
User Complaints
23
Monitoring Dashboard
24
Victim Cluster
25
Security Context Change
26
Learnings
27
Recap
28
Key takeaways
Description:
Explore real-world production incident stories from managing hundreds of Kubernetes clusters, with a focus on clusters scaling to 10K+ nodes. Learn how seemingly simple operations like adding a single node or modifying a configmap can trigger chain reactions that disrupt entire clusters. Discover best practices for maintaining high cluster availability through lessons learned from failures involving postmodern databases, automation, user escalation, and paradoxical finalizers. Gain insights into mitigating paging storms, handling manual operations, and improving monitoring dashboards. Understand the importance of security context changes and key takeaways for effectively managing large-scale Kubernetes environments.

How to Not Destroy Your Production Kubernetes Clusters

USENIX
Add to list
0:00 / 0:00