Play all

Intro

Optimization Tricks

What are Pandas UDFs?

Development tips and tricks

Modeling at Quantcast

Example Problem

Naive approach: Use Spark SOL

Optimization: Use Pandas UDFs for Looping

Optimization: Aggregate Keys in Batches

Optimization: Inverted Indexes

Optimization: Use python libraries

Optimization: Summary

Description:

Discover optimization techniques for Spark SQL data processing using Pandas UDFs in this 27-minute video from Databricks. Learn how to accelerate query performance by over an order of magnitude through specialized batch processing jobs. Explore what Spark SQL excels at and where it falls short, and gain insights into implementing custom UDFs for significant performance gains. Understand how to profile Spark SQL jobs efficiently to validate optimization strategies. Follow along as the speaker shares experiences from developing a model training pipeline at Quantcast, processing petabytes of data for thousands of models. Dive into practical examples, including naive approaches and various optimization techniques such as looping with Pandas UDFs, aggregating keys in batches, using inverted indexes, and leveraging Python libraries. Equip yourself with valuable knowledge to enhance your data processing workflows in Spark SQL.

Accelerating Data Processing in Spark SQL with Pandas UDFs - Optimization Techniques

Databricks

Add to list

#Data Science #Big Data #Apache Spark #Spark SQL #Programming #Programming Languages #Python #Computer Science #Software Engineering #Performance Tuning #Data Analytics #Data Processing #Batch Processing

0:00 / 0:00