1. Introduction
We’re all used to work with CSVs, JSON files… With the traditional libraries and for large datasets, these can be extremely slow to read, write and operate on, leading to performance bottlenecks (been there). It’s precisely with big amounts of data that being efficient handling the data is crucial for our data science/analytics workflow, and this is exactly where Apache Arrow comes into play.
Why? The main reason resides in how the data is stored in memory. While JSON and CSVs, for example, are text-based formats, Arrow is a columnar in-memory data format (and that allows for fast data interchange between different data processing tools). Arrow is therefore designed to optimize performance by enabling zero-copy reads, reducing memory usage, and supporting efficient compression.
Moreover, Apache Arrow is open-source and optimized for analytics. It is designed to accelerate big data processing while maintaining interoperability with various data tools, such as Pandas, Spark, and Dask. By storing data in a columnar format, Arrow enables faster read/write operations and efficient memory usage, making it ideal for analytical workloads.
Sounds great right? What’s best is that this is all the introduction to Arrow I’ll provide. Enough theory, we want to see it in action. So, in this post, we’ll explore how to use Arrow in Python and how to make the most out of it.
2. Arrow in Python
To get started, you need to install the necessary libraries: pandas and pyarrow.
pip install pyarrow pandas
Then, as always, import them in your Python script:
import pyarrow as pa
import pandas as pd
Nothing new yet, just necessary steps to do what follows. Let’s start by performing some simple operations.
2.1. Creating and Storing a Table
The simplest we can do is hardcode our table’s data. Let’s create a two-column table with football data:
teams = pa.array(['Barcelona', 'Real Madrid', 'Rayo Vallecano', 'Athletic Club', 'Real Betis'], type=pa.string())
goals = pa.array([30, 23, 9, 24, 12], type=pa.int8())
team_goals_table = pa.table([teams, goals], names=['Team', 'Goals'])
The format is pyarrow.table, but we can easily convert it to pandas if we want:
df = team_goals_table.to_pandas()
And restore it back to arrow using:
team_goals_table = pa.Table.from_pandas(df)
And we’ll finally store the table in a file. We could use different formats, like feather, parquet… I’ll use this last one because it’s fast and memory-optimized:
import pyarrow.parquet as pq
pq.write_table(team_goals_table, 'data.parquet')
Reading a parquet file would just consist of using pq.read_table('data.parquet')
.
2.2. Compute Functions
Arrow has its own compute module for the usual operations. Let’s start by comparing two arrays element-wise:
import pyarrow.compute as pc
>>> a = pa.array([1, 2, 3, 4, 5, 6])
>>> b = pa.array([2, 2, 4, 4, 6, 6])
>>> pc.equal(a,b)
[
false,
true,
false,
true,
false,
true
]
That was easy, we could sum all elements in an array with:
>>> pc.sum(a)
<pyarrow.Int64Scalar: 21>
And from this we could easily guess how we can compute a count, a floor, an exp, a mean, a max, a multiplication… No need to go over them, then. So let’s move to tabular operations.
We’ll start by showing how to sort it:
>>> table = pa.table({'i': ['a','b','a'], 'x': [1,2,3], 'y': [4,5,6]})
>>> pc.sort_indices(table, sort_keys=[('y', descending)])
<pyarrow.lib.UInt64Array object at 0x1291643a0>
[
2,
1,
0
]
Just like in pandas, we can group values and aggregate the data. Let’s, for example, group by “i” and compute the sum on “x” and the mean on “y”:
>>> table.group_by('i').aggregate([('x', 'sum'), ('y', 'mean')])
pyarrow.Table
i: string
x_sum: int64
y_mean: double
----
i: [["a","b"]]
x_sum: [[4,2]]
y_mean: [[5,5]]
Or we can join two tables:
>>> t1 = pa.table({'i': ['a','b','c'], 'x': [1,2,3]})
>>> t2 = pa.table({'i': ['a','b','c'], 'y': [4,5,6]})
>>> t1.join(t2, keys="i")
pyarrow.Table
i: string
x: int64
y: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
y: [[4,5,6]]
By default, it is a left outer join but we could twist it by using the join_type parameter.
There are many more useful operations, but let’s see just one more to avoid making this too long: appending a new column to a table.
>>> t1.append_column("z", pa.array([22, 44, 99]))
pyarrow.Table
i: string
x: int64
z: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
z: [[22,44,99]]
Before ending this section, we must see how to filter a table or array:
>>> t1.filter((pc.field('x') > 0) & (pc.field('x') < 3))
pyarrow.Table
i: string
x: int64
----
i: [["a","b"]]
x: [[1,2]]
Easy, right? Especially if you’ve been using pandas and numpy for years!
3. Working with files
We’ve already seen how we can read and write Parquet files. But let’s check some other popular file types so that we have several options available.
3.1. Apache ORC
Being very informal, Apache ORC can be understood as the equivalent of Arrow in the realm of file types (even though its origins have nothing to do with Arrow). Being more correct, it’s an open source and columnar storage format.
Reading and writing it is as follows:
from pyarrow import orc
# Write table
orc.write_table(t1, 't1.orc')
# Read table
t1 = orc.read_table('t1.orc')
As a side note, we could decide to compress the file while writing by using the “compression” parameter.
3.2. CSV
No secret here, pyarrow has the CSV module:
from pyarrow import csv
# Write CSV
csv.write_csv(t1, "t1.csv")
# Read CSV
t1 = csv.read_csv("t1.csv")
# Write CSV compressed and without header
options = csv.WriteOptions(include_header=False)
with pa.CompressedOutputStream("t1.csv.gz", "gzip") as out:
csv.write_csv(t1, out, options)
# Read compressed CSV and add custom header
t1 = csv.read_csv("t1.csv.gz", read_options=csv.ReadOptions(
column_names=["i", "x"], skip_rows=1
)]
3.2. JSON
Pyarrow allows JSON reading but not writing. It’s pretty straightforward, let’s see an example supposing we have our JSON data in “data.json”:
from pyarrow import json
# Read json
fn = "data.json"
table = json.read_json(fn)
# We can now convert it to pandas if we want to
df = table.to_pandas()
Feather is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. So, contrary to Apache ORC, this one was indeed created early in the Arrow project.
from pyarrow import feather
# Write feather from pandas DF
feather.write_feather(df, "t1.feather")
# Write feather from table, and compressed
feather.write_feather(t1, "t1.feather.lz4", compression="lz4")
# Read feather into table
t1 = feather.read_table("t1.feather")
# Read feather into df
df = feather.read_feather("t1.feather")
4. Advanced Features
We just touched upon the most basic features and what the majority would need while working with Arrow. However, its amazingness doesn’t end here, it’s right where it starts.
As this will be quite domain-specific and not useful for anyone (nor considered introductory) I’ll just mention some of these features without using any code:
- We can handle memory management through the Buffer type (built on top of C++ Buffer object). Creating a buffer with our data does not allocate any memory; it is a zero-copy view on the memory exported from the data bytes object. Keeping up with this memory management, an instance of MemoryPool tracks all the allocations and deallocations (like malloc and free in C). This allows us to track the amount of memory being allocated.
- Similarly, there are different ways to work with input/output streams in batches.
- PyArrow comes with an abstract filesystem interface, as well as concrete implementations for various storage types. So, for example, we can write and read parquet files from an S3 bucket using the S3FileSystem. Google Cloud and Hadoop Distributed File System (HDFS) are also accepted.
5. Conclusion and Key Takeaways
Apache Arrow is a powerful tool for efficient Data Handling in Python. Its columnar storage format, zero-copy reads, and interoperability with popular data processing libraries make it ideal for data science workflows. By integrating Arrow into your pipeline, you can significantly boost performance and optimize memory usage.