Главная
Study mode:
on
1
Intro
2
Data processing and analytics
3
Overview
4
Data sources and formats
5
Physical storage layout models
6
Different workloads
7
Row-wise vs Columnar
8
Parquet: data organization Data organization
9
Parquet: encoding schemes
10
Optimization: dictionary encoding
11
Optimization: predicate pushdown
12
Optimization: partitioning • Embed predicates in directory structure
13
Optimization: avoid many small files
14
Optimization: avoid few huge files
15
Optimization: Delta Lake • Open-source storage layer on top of Parquet in Spark
16
Conclusion
Description:
Dive into the intricacies of the Parquet format and explore performance optimization opportunities in this 41-minute conference talk by Boudewijn Braams from Databricks. Begin with an introduction to structured data formats and physical data storage models, including row-wise, columnar, and hybrid approaches. Delve deeper into the specifics of the Parquet format, examining its disk representation, physical data organization, and encoding schemes. Learn about various performance optimization techniques such as dictionary encoding, page compression, predicate pushdown, dictionary filtering, and partitioning schemes. Discover strategies to combat the issue of 'many small files' and gain insights into the open-source Delta Lake format in relation to Parquet. Suitable for both newcomers seeking an approachable refresher on columnar storage and experienced professionals looking to optimize analytical workloads in Spark, this talk provides tangible tips and tricks to leverage the Parquet format for improved performance. Read more

The Parquet Format and Performance Optimization Opportunities

Databricks
Add to list