Career Theme Programming interfaces for data preparation, analytics, and feature engineering
3
What exactly is a data frame?
4
A data frame is ... a programming interface ... for expressing data manipulations
5
Data frames address many analytical workloads that are either not possible or not well-served by traditional SQL-based systems
6
In R, the "data frame" data structure is part of the language Other projects implement their own (e.g. pandas) • Some projects may not use any data structures (e.g. compiling operations to SQL)
7
Most data frames are effectively "Islands" with a hard serialization barrier • Many non-reusable implementations of the same algorithms • Limited collaboration across projects and programming languag…
8
Apache Arrow Open source community project launched in 2016 • Intersection of database systems, big data, and data science tools • Purpose: Language independent open standards and libraries to accele…
9
Improve interoperability problems with other data processing systems . Standardize data structures used in data frame implementations • Promote collaboration and code reuse across libraries and progr…
10
Limited data types Excessive memory consumption Poor processing efficiency for non-numeric types Accommodate larger-than-memory datasets
11
Apache Arrow Project Overview Language-agnostic in-memory columnar format for analytical query engines, data frames • Binary protocol for IPC/RPC . "Batteries included" development platform for build…
12
Arrow and the Future of Data Frames . As more data sources offer Arrow-based data access, it will make sense to process Arrow in situ rather than converting to some other data structure • Analytical …
13
Runtime memory format for analytical query processing • Ideal companion to columnar storage like Apache Parquet Fully shredded columnar, supports flat and nested schemas Organized for cache-efficient…
Description:
Explore the future of data frames and Apache Arrow in this insightful conference talk by Wes McKinney, creator of Python pandas and co-creator of Apache Arrow. Delve into the background and motivation behind the Apache Arrow project, examining its columnar in-memory data standard and expanding library support across programming languages. Investigate the relationship between data frame libraries and database systems, and discover how analytics systems are likely to evolve towards "Arrow-native" implementations. Learn about the challenges faced by traditional data frame implementations and how Apache Arrow addresses these issues through standardization, improved interoperability, and efficient in-memory processing. Gain valuable insights into the potential impact of Arrow on data science tools, analytical query engines, and the future of data processing applications.
Apache Arrow and the Future of Data Frames with Wes McKinney