Play all

Intro

ZALANDO AT A GLANCE

2019: DEVELOPERS USING KUBERNETES

INGRESS ERRORS

COREDNS OOMKILL

STOP THE BLEEDING: INCREASE MEMORY LIMIT

INCREASE IN MEMORY USAGE

CONTRIBUTING FACTORS

CUSTOMER IMPACT

IAM RETURNING 404

NUMBER OF PODS

ROUTES FROM API SERVER

API SERVER DOWN

INNOCENT MANIFEST

INCIDENT #2: LESSONS LEARNED

CLUSTER DOWN?

THE TRIGGER

CLUSTER LIFECYCLE MANAGER (CLM)

CLUSTER CHANNELS

FLANNEL ERRORS

RBAC CHANGES

NETWORK SPLIT

CREDENTIALS QUEUE

WHAT HAPPENED

SLACK

DISABLING CPU THROTTLING

RACE CONDITIONS..

COMMON PITFALLS

READINESS & LIVENESS PROBES

RESOURCE REQUESTS & LIMITS

AWS EKS IN PRODUCTION

AUTOMATED E2E TESTS

MONITORING

OPENTRACING

UPGRADE TO KUBERNETES 1.14

EMERGENCY ACCESS SERVICE

KUBERNETES FAILURE STORIES

INTERNAL TICKETS BASED ON FAILURE STORIES

FACTFULNESS

WHY KUBERNETES?

COMPLEXITY FOR GOOGLE-SCALE INFRA?

OPEN SOURCE & MORE

Description:

Explore a senior principal engineer's insights on Kubernetes failure stories in this 33-minute conference talk from GOTO Berlin 2019. Dive into real-world experiences of operating over 100 clusters, uncovering valuable lessons from incidents, failures, and user reports. Learn why Kubernetes remains a sensible choice despite its perceived complexity, and gain practical knowledge on common pitfalls, best practices, and improvements in areas such as ingress errors, CoreDNS OOMKills, and API server issues. Discover the importance of proper resource management, monitoring, and automated testing in maintaining robust Kubernetes environments. Understand the benefits of sharing failure stories for continuous improvement and fostering collaboration across organizations in the Kubernetes ecosystem.

Why I Love Kubernetes Failure Stories and You Should Too

GOTO Conferences

Add to list

#Conference Talks #GOTO Conferences #Computer Science #DevOps #Kubernetes #Business #Business Management #Continuous Improvement #Programming #Cloud Computing #Cloud Infrastructure #Information Technology #Incident Management

0:00 / 0:00