PySpark Explained: Delta Tables

Image by AI (Ideogram)

Learn how to use the building blocks of Delta Lakes.

13 min read

12 hours ago

Delta tables are the key components of a Delta Lake, an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions to big data workloads.

The concept and implementation of Delta tables ( and by association — Delta Lakes ) were done by the team at Databricks, the company that created Spark.

Databricks is now a cloud-based platform for data engineering, machine learning, and analytics built around Apache Spark and provides a unified environment for working with big data workloads. Delta tables are a key component of that environment.

Coming from an AWS background, Delta tables somewhat remind me of AWS’s Athena service, which enables you to perform SQL SELECT operations on files held on S3, the AWS mass storage service.

There is one key difference though. Athena is designed to be a query-only tool, whereas Delta tables allow you to UPDATE, DELETE and INSERT data records easily as well as query data from them. In this respect, Delta tables act more like Apache Iceberg formatted tables. But the advantage they have over Iceberg tables is that they are more tightly integrated with the Spark eco-system.