Play all

Intro

We Are on Cloud

Spark Clusters

Spark Versions and Use Cases

Migration Plan

Migration Path

Spark API

Approach

Translate Cascading

UDF Translation

Translate Scalding

Secondary Sort

Accumulators

Accumulator Continue

Accumulator Tab in Spark UI

Profiling

Automatic Migration Service (AMS)

Data Validation

Source of Uncertainty

Performance Tuning

Balancing Performance

Automatic Migration & Failure Handling

Future Plan

Description:

Explore Pinterest's journey in migrating their batch processing to Apache Spark in this 25-minute conference talk from Databricks. Discover the challenges and solutions encountered during the transition from legacy ETL workflows written in Cascading/Scalding. Learn about the migration's motivation, bridging semantic gaps between different engines, handling thrift objects, improving Spark accumulators, and performance tuning using an innovative Spark profiler. Gain insights into the performance improvements and cost savings achieved post-migration. Delve into topics such as Spark clusters, API approaches, translating Cascading and Scalding, secondary sort, accumulator enhancements, profiling techniques, automatic migration services, data validation, and balancing performance. Understand the complexities of large-scale ETL workflow migration and the future plans for Pinterest's data processing infrastructure.

Migrating ETL Workflows to Apache Spark at Scale - Pinterest's Experience

Databricks

Add to list

#Data Science #Big Data #Apache Spark #Data Processing #Computer Science #Software Engineering #Performance Tuning #Database Management #Data Migration #Business #Marketing #Digital Marketing #Social Media #Pinterest #Data Engineering #ETL

0:00 / 0:00