Главная
Study mode:
on
1
Intro
2
ZALANDO AT A GLANCE
3
2019: DEVELOPERS USING KUBERNETES
4
INGRESS ERRORS
5
COREDNS OOMKILL
6
STOP THE BLEEDING: INCREASE MEMORY LIMIT
7
INCREASE IN MEMORY USAGE
8
CONTRIBUTING FACTORS
9
CUSTOMER IMPACT
10
IAM RETURNING 404
11
NUMBER OF PODS
12
ROUTES FROM API SERVER
13
API SERVER DOWN
14
INNOCENT MANIFEST
15
INCIDENT #2: LESSONS LEARNED
16
CLUSTER DOWN?
17
THE TRIGGER
18
CLUSTER LIFECYCLE MANAGER (CLM)
19
CLUSTER CHANNELS
20
FLANNEL ERRORS
21
RBAC CHANGES
22
NETWORK SPLIT
23
CREDENTIALS QUEUE
24
WHAT HAPPENED
25
SLACK
26
DISABLING CPU THROTTLING
27
RACE CONDITIONS..
28
COMMON PITFALLS
29
READINESS & LIVENESS PROBES
30
RESOURCE REQUESTS & LIMITS
31
AWS EKS IN PRODUCTION
32
AUTOMATED E2E TESTS
33
MONITORING
34
OPENTRACING
35
UPGRADE TO KUBERNETES 1.14
36
EMERGENCY ACCESS SERVICE
37
KUBERNETES FAILURE STORIES
38
INTERNAL TICKETS BASED ON FAILURE STORIES
39
FACTFULNESS
40
WHY KUBERNETES?
41
COMPLEXITY FOR GOOGLE-SCALE INFRA?
42
OPEN SOURCE & MORE
Description:
Explore a senior principal engineer's insights on Kubernetes failure stories in this 33-minute conference talk from GOTO Berlin 2019. Dive into real-world experiences of operating over 100 clusters, uncovering valuable lessons from incidents, failures, and user reports. Learn why Kubernetes remains a sensible choice despite its perceived complexity, and gain practical knowledge on common pitfalls, best practices, and improvements in areas such as ingress errors, CoreDNS OOMKills, and API server issues. Discover the importance of proper resource management, monitoring, and automated testing in maintaining robust Kubernetes environments. Understand the benefits of sharing failure stories for continuous improvement and fostering collaboration across organizations in the Kubernetes ecosystem.

Why I Love Kubernetes Failure Stories and You Should Too

GOTO Conferences
Add to list
0:00 / 0:00