Главная
Study mode:
on
1
Intro
2
Agenda
3
Big Data Platform
4
Old vs New cluster
5
Old Cluster: Performance Bottleneck
6
A Simple Aggregation Query
7
9k Mappers * 9k Reducers
8
New Cluster: Choose the right EC2 instance
9
Key Takeaways
10
Read after write consistency
11
How often does this happen
12
Solution. Considerations
13
Our Approach
14
Performance Comparison: S3 vs HDFS
15
Dealing with Metadata Operation
16
Reduce Move Operations
17
Multipart Upload API
18
The Last Move Operation
19
Fix Bucket Rate Limit Issue (503)
20
Improving S3Committer
21
S3 Benefit Compare to HDFS
22
Things We Miss in Mesos
23
Cost Saving
24
Spark at Pinterest
Description:
Explore the migration process of Pinterest's critical Apache Spark clusters from HDFS to S3 in this 30-minute presentation. Dive into the motivations behind the transition, including the shift from Mesos to YARN as the resource scheduler. Learn about the technical challenges faced, such as S3 performance, consistency, and access control, and how they were addressed to match HDFS capabilities. Discover the changes made to job submission processes to accommodate differences between Mesos and YARN. Gain insights into Spark performance optimization through profiling and EC2 instance type selection. Examine the performance results and smooth migration process achieved by Pinterest. Understand key takeaways, including read-after-write consistency solutions, performance comparisons between S3 and HDFS, strategies for dealing with metadata operations, and improvements to S3Committer. Explore the benefits of S3 over HDFS, cost savings, and the current state of Spark at Pinterest.

Migrating Pinterest Apache Spark Clusters from HDFS to S3

Databricks
Add to list
0:00 / 0:00