Play all

Intro

The Quest for Complete DataFrame Serialization

NumPy Enhancement Proposal (NEP) 1

Promising Performance of NPZ versus Parquet

Overview

Components of a DataFrame

Block-Consolidation Strategies Unconsolidated Blocks

Block Consolidation & Complexity

The NPY Format

Converting Contiguous Bytes to an Array

NPY & Object Arrays

NPY Versions

The NPZ Format

Encoding a DataFrame as an NPZ

JSON Metadata

NPY Performance in Numpy

Lies, Damned Lles, and Benchmarks

Nine DataFrame Fixtures

Memory Maps

Memory Mapping an Array

Memory Mapping a DataFrame

Current State

Future Work

Conclusions

Description:

Explore the potential of NumPy's NPY format as a faster alternative to Parquet for DataFrame storage in this PyCon US talk. Dive into the challenges of serializing DataFrames and learn how a custom NPZ file format with JSON metadata can offer significant performance and compatibility advantages. Examine detailed read/write performance comparisons between Parquet and NPZ across various DataFrame shapes and dtype compositions. Discover techniques for optimizing Python routines for NPY file operations and explore applications for memory-mapping complete DataFrames using NPY representation. Gain insights into improving data science workflows and reducing compute costs through this innovative approach to DataFrame storage.

Employing NumPy's NPY Format for Faster Than Parquet DataFrame Storage

PyCon US

Add to list

#Conference Talks #PyCon US #Programming #Programming Languages #Python #NumPy #Javascript #JSON #Data Science #Data Analysis #DataFrames #Computer Science #Data Structures #Serialization