Explore the migration process of Pinterest's critical Apache Spark clusters from HDFS to S3 in this 30-minute presentation. Dive into the motivations behind the transition, including the shift from Mesos to YARN as the resource scheduler. Learn about the technical challenges faced, such as S3 performance, consistency, and access control, and how they were addressed to match HDFS capabilities. Discover the changes made to job submission processes to accommodate differences between Mesos and YARN. Gain insights into Spark performance optimization through profiling and EC2 instance type selection. Examine the performance results and smooth migration process achieved by Pinterest. Understand key takeaways, including read-after-write consistency solutions, performance comparisons between S3 and HDFS, strategies for dealing with metadata operations, and improvements to S3Committer. Explore the benefits of S3 over HDFS, cost savings, and the current state of Spark at Pinterest.
Migrating Pinterest Apache Spark Clusters from HDFS to S3