Главная
Study mode:
on
1
Intro
2
Outline
3
Business Model
4
Data Flow
5
Conclusion
6
Why do I care
7
Other technologies
8
Blob storage
9
Data sharing
10
Pocky
11
Why Parquet
12
Python implementations
13
Parquet file structure
14
Pre predicate pushdown
15
Dictionary encoding
16
Compression
17
Partitioning
18
Storage
19
ODBC
20
Azure Blob Storage
21
Questions
Description:
Explore efficient techniques for handling large columnar datasets in Apache Parquet using Pandas and Dask in this EuroPython 2018 conference talk. Dive into the Apache Parquet data format, understanding its binary and columnar structure, as well as its CPU and I/O optimization techniques. Learn how to leverage row groups, compression, and dictionary encoding to enhance data storage and retrieval. Discover methods for reading Parquet files into Pandas DataFrames using fastparquet and Apache Arrow libraries. Gain insights into working with data larger than memory or local disk space using Apache Dask, including partitioning and cloud object storage systems like Amazon S3 and Azure Storage. Master techniques such as metadata utilization, partition filenames, column statistics, and dictionary filtering to boost query performance on extensive datasets. Understand the benefits of partitioning, row group skipping, and optimal data layout for accelerating queries on large-scale data.

Using Pandas and Dask to Work with Large Columnar Datasets in Apache Parquet

EuroPython Conference
Add to list