Главная
Study mode:
on
1
Intro
2
We Are on Cloud
3
Spark Clusters
4
Spark Versions and Use Cases
5
Migration Plan
6
Migration Path
7
Spark API
8
Approach
9
Translate Cascading
10
UDF Translation
11
Translate Scalding
12
Secondary Sort
13
Accumulators
14
Accumulator Continue
15
Accumulator Tab in Spark UI
16
Profiling
17
Automatic Migration Service (AMS)
18
Data Validation
19
Source of Uncertainty
20
Performance Tuning
21
Balancing Performance
22
Automatic Migration & Failure Handling
23
Future Plan
Description:
Explore Pinterest's journey in migrating their batch processing to Apache Spark in this 25-minute conference talk from Databricks. Discover the challenges and solutions encountered during the transition from legacy ETL workflows written in Cascading/Scalding. Learn about the migration's motivation, bridging semantic gaps between different engines, handling thrift objects, improving Spark accumulators, and performance tuning using an innovative Spark profiler. Gain insights into the performance improvements and cost savings achieved post-migration. Delve into topics such as Spark clusters, API approaches, translating Cascading and Scalding, secondary sort, accumulator enhancements, profiling techniques, automatic migration services, data validation, and balancing performance. Understand the complexities of large-scale ETL workflow migration and the future plans for Pinterest's data processing infrastructure.

Migrating ETL Workflows to Apache Spark at Scale - Pinterest's Experience

Databricks
Add to list
0:00 / 0:00