Play all

Intro

Data processing and analytics

Overview

Data sources and formats

Physical storage layout models

Different workloads

Row-wise vs Columnar

Parquet: data organization Data organization

Parquet: encoding schemes

Optimization: dictionary encoding

Optimization: predicate pushdown

Optimization: partitioning • Embed predicates in directory structure

Optimization: avoid many small files

Optimization: avoid few huge files

Optimization: Delta Lake • Open-source storage layer on top of Parquet in Spark

Conclusion

Description:

Dive into the intricacies of the Parquet format and explore performance optimization opportunities in this 41-minute conference talk by Boudewijn Braams from Databricks. Begin with an introduction to structured data formats and physical data storage models, including row-wise, columnar, and hybrid approaches. Delve deeper into the specifics of the Parquet format, examining its disk representation, physical data organization, and encoding schemes. Learn about various performance optimization techniques such as dictionary encoding, page compression, predicate pushdown, dictionary filtering, and partitioning schemes. Discover strategies to combat the issue of 'many small files' and gain insights into the open-source Delta Lake format in relation to Parquet. Suitable for both newcomers seeking an approachable refresher on columnar storage and experienced professionals looking to optimize analytical workloads in Spark, this talk provides tangible tips and tricks to leverage the Parquet format for improved performance. Read more

The Parquet Format and Performance Optimization Opportunities

Databricks

Add to list

#Data Science #Big Data #Parquet #Apache Spark #Computer Science #Operating Systems #File Management #Data Analytics #Delta Lake #Database Management #Columnar Storage