Play all

Intro

Short Intro

Outline

Introduction on Apache Parquet

Parquet: Glossary

Parquet: Data Page

Background

Non-Vectorized Parquet Reader

Advantages of Vectorized Approach

High Level Idea

Parquet Schema Conversion

SPARK-34863: Complex type support

Complex Type - Performance

Perf: vectorized vs non-vectorized

Parquet Predicate Pushdown

Column Index Filtering

Column Index Support in Spark

Column Index - Performance

Future Work

Description:

Explore recent improvements in Apache Parquet performance within Apache Spark in this 37-minute talk from Databricks. Learn about vectorized read support for complex types, which can achieve 10x+ improvement when reading Parquet data with complex structures. Discover how Parquet column index support enhances predicate pushdown capabilities, allowing Spark to leverage this feature for more efficient data filtering. Gain insights into the differences between vectorized and non-vectorized Parquet readers, understand the importance of predicate pushdown in optimizing scan performance, and get a glimpse of future work items aimed at further enhancing Parquet read performance in Spark. Delve into technical concepts such as Parquet schema conversion, complex type support, and column index filtering to deepen your understanding of these performance optimizations.

Recent Parquet Improvements in Apache Spark - Vectorized Complex Types and Column Index Support

Databricks

Add to list

#Data Science #Big Data #Apache Spark #Parquet

0:00 / 0:00