Главная
Study mode:
on
1
Intro
2
What is a "Data Dog" ?
3
Overview
4
Kafka 101
5
Kafka at Datadog
6
Baby's First Keyfunc
7
Node Failure
8
AZ Failure
9
Kube Cluster Failure
10
Cloud Vendor Failure
11
Remember This
12
Partitioning (not the Kafka kind)
13
Partitioning: Before
14
Partitioning: After
15
Balanced Topic
16
Consumer Shards
17
Big Customers
18
Partition Imbalance
19
Slicer
20
Rebalancing
Description:
Explore the inner workings of Datadog's metrics backend in this SREcon22 Americas conference talk. Delve into the evolution of Datadog's distributed system, from its small beginnings to its current large-scale operation across major cloud providers. Learn about the scaling and reliability challenges faced by the team, their solutions, and the key lessons and strategies that emerged. Gain insights into Kafka's role at Datadog, partitioning techniques, and handling various failure scenarios. Discover how the system manages node, availability zone, Kubernetes cluster, and cloud vendor failures. Understand the importance of balanced topics, consumer shards, and addressing partition imbalance. Get a glimpse of unsolved problems and future plans for Datadog's metrics backend. Presented by Adam Mckaig, Staff Engineer, and Tahia Khan, SRE at Datadog, this talk offers valuable knowledge for those interested in large-scale distributed systems and cloud monitoring.

How the Metrics Backend Works at Datadog

USENIX
Add to list