All articles

Demystifying Delta Lake: What It Is and Why It Matters

Delta Lake is not a database. It's a transaction log over Parquet that gives you ACID, time travel, and schema evolution — without a server. Here's what it does, in plain English.

TL;DR. Delta Lake is not a database. It’s a storage format — Parquet files plus a transaction log — that gives you ACID transactions, versioning, time travel, and schema enforcement on top of plain object storage. You can write to it from any language with a Delta library (Python, Rust, Spark, Polars, DuckDB), no server required. If you’ve ever wished your folder of Parquet files behaved like a real table, Delta Lake is what you’ve been looking for. This post explains what it actually does, why it works, and when to reach for it.

The problem Delta Lake solves

Imagine you run a nightly job that writes a Parquet file to a folder. Tonight’s job is half-done when the laptop goes to sleep.

What state is the file in? Corrupted, probably — partially written. Tomorrow’s reader hits an Unexpected end of file and the dashboard goes red. There is no transaction. There is no rollback. The previous version is gone.

Now imagine two processes try to write to the same folder at the same time. One overwrites the other. There is no concept of a commit. The “table” is whichever file won the race.

These are the problems database engines solved 50 years ago. The catch is that databases are servers — they run as processes, they manage their own files, they enforce order through locks and write-ahead logs. The lakehouse world (data living in object stores like S3 or local disk, queried by many engines) doesn’t have a server in the middle. Until Delta Lake came along, it didn’t have transactions either.

What Delta Lake actually is

A Delta Lake “table” is two things on disk:

  1. A folder of Parquet files — your actual data, partitioned the way you’d expect.
  2. A _delta_log/ subfolder — a sequence of JSON commit files (and periodic Parquet checkpoints) that record every change to the table.

That’s it. There’s no daemon, no schema service, no central metastore required. To read a Delta table, you read the latest commit in _delta_log/ to find which Parquet files are currently part of the table, then read those files. To write a Delta table, you write your new Parquet files first, then atomically append a new commit to _delta_log/ listing the changes.

The atomicity of that final commit step is what gives you ACID. If your write half-fails, the commit never lands and the previous version of the table is unchanged. Readers always see a consistent snapshot.

Here is what a _delta_log/ directory typically looks like:

_delta_log/
  00000000000000000000.json    # initial CREATE
  00000000000000000001.json    # first INSERT
  00000000000000000002.json    # second INSERT
  00000000000000000003.json    # UPDATE
  00000000000000000004.json    # MERGE
  00000000000000000010.checkpoint.parquet    # rollup of commits 0-10
  ...

Each JSON file is a tiny ordered list of actions: added these files, removed those files, changed schema this way. The checkpoint files exist so that readers don’t have to replay every commit from the beginning of time — they read the latest checkpoint, then replay only commits after it.

What you actually get from this

Five concrete capabilities, all flowing from the transaction log:

1. ACID transactions

Every write — append, update, merge, delete — is one atomic commit. Either it fully succeeds and is visible to readers, or it fully doesn’t and the previous version is preserved. No half-written tables. No corrupt reads. No “wait for the job to finish before querying” gymnastics.

2. Time travel

Because every version is preserved (until you VACUUM), you can query the table as it existed at any prior version or timestamp:

import polars as pl

# Today's data
current = pl.read_delta("./sales")

# What did the table look like at version 5?
v5 = pl.read_delta("./sales", version=5)

# What did it look like yesterday at noon?
yesterday = pl.read_delta("./sales", version="2026-04-15T12:00:00Z")

This is genuinely magical the first time you use it to debug a downstream report that broke. You can scroll through history and see exactly which write changed the row that triggered the alert.

3. Schema enforcement and evolution

By default, Delta refuses writes that don’t match the table’s schema. No more silently-cast int columns appearing as string in tomorrow’s read. When you do want to evolve the schema (add a column, change a type), you opt in explicitly with a mergeSchema flag. The history of every schema change is logged.

4. Concurrent writers

Multiple processes can write to the same Delta table at the same time. Delta uses optimistic concurrency: each writer prepares its commit, then attempts to land it. If two writers conflict (e.g., they both touched the same files), the second one detects the conflict at commit time and either retries or fails cleanly. You don’t get the silent-overwrite problem of plain Parquet.

5. MERGE, UPDATE, DELETE

Plain Parquet is append-only — you write a new file, you can’t edit existing rows. Delta gives you full row-level operations: UPDATE to change values, DELETE to remove rows, MERGE INTO for upserts. Internally these are implemented as “rewrite the affected files, append a new commit”; from your perspective they look like SQL.

OPTIMIZE, Z-ORDER, and VACUUM in plain English

Three operations you’ll meet quickly. All three are housekeeping.

OPTIMIZE rewrites many small Parquet files into fewer larger ones. After a lot of small writes (say, an hourly streaming job), a Delta table can accumulate hundreds of tiny files, which kills read performance. OPTIMIZE consolidates them into chunks of (typically) 1 GB. You run it on a schedule — daily is common.

Z-ORDER is an optional flag on OPTIMIZE that physically sorts the data by one or two chosen columns within each file. If you frequently filter by customer_id, Z-ORDER on customer_id makes those filters dramatically faster because Delta can skip entire files that can’t contain matching rows.

VACUUM deletes old Parquet files that are no longer referenced by the current table version (or any version newer than your retention threshold). Without VACUUM, the table grows forever — you keep paying storage for data you can no longer reach because it predates your time-travel horizon. The default retention is 7 days, which is sensible.

You don’t need to configure these on day one. You’ll know it’s time when reads slow down (run OPTIMIZE), when a particular column dominates filters (add ZORDER BY), or when your storage bill is bigger than your data (run VACUUM).

When Delta beats plain Parquet

You’re doing this…Plain ParquetDelta Lake
One-shot dump of static dataFineOverkill
A table read by many consumersFineBetter — schema enforced
A table written more than onceRiskyRight answer
Multiple writers (concurrent jobs)BrokenRight answer
Audit / regulatory historyBring your own snapshotsBuilt-in (time travel)
Update / delete individual rowsNot possibleUPDATE / DELETE / MERGE
Schema needs to evolveManual coordinationLogged + enforced
Production analytics tablesRiskyRight answer

Rule of thumb: the moment a Parquet table will be written to more than once, you want Delta.

How Delta compares to Iceberg and Hudi

Three open table formats compete in this space: Delta Lake, Apache Iceberg, and Apache Hudi. They solve overlapping problems with different trade-offs.

  • Delta Lake has the most mature non-JVM ecosystem (the Rust-based deltalake package is excellent), the simplest mental model, and the strongest Spark integration. Now Linux Foundation governance.
  • Apache Iceberg has the most flexible metadata layer (catalog-agnostic, multi-snapshot branching), strong adoption in Snowflake and AWS Athena, and a growing engine ecosystem.
  • Apache Hudi is the oldest of the three, with strong streaming/incremental story but a steeper learning curve.

For local-first work, Delta is the easiest place to start because the deltalake Python library is mature and dependency-light. For warehouse interop, Iceberg often wins because more managed services support it natively. The good news: all three solve the same fundamental problem, and the storage layouts are similar enough that migration tools exist.

How Flowfile uses Delta Lake

Flowfile’s data catalog stores every catalog table as a Delta table on local disk (or, optionally, on S3 / ADLS / GCS). The practical implications:

  • Every flow run that writes to a catalog table creates a new Delta commit. No filename juggling. No version numbers in folder names.
  • The catalog UI lets you time-travel. Pick a table, pick a historical version, preview the data exactly as it was then.
  • Scheduled flows + Delta = cheap incremental processing. A daily flow can MERGE new rows into a Delta table without rewriting the whole thing.
  • Lineage is automatic. Because the catalog knows which flow wrote which Delta commit, it can show you the producer relationship without you wiring anything up.
  • Schema changes are tracked. Add a column to your output? Delta records the schema evolution; the catalog UI surfaces it.

You don’t have to know anything about Delta to get this. You write to a catalog table from a flow; Delta does the rest. But understanding what’s underneath helps when you want to do something more advanced — like consume the same Delta table from a Polars script outside Flowfile, or query it with DuckDB, or hand it off to a teammate who’s writing PySpark.

Try it in two lines

If you have Python installed, the lowest-friction Delta intro is:

import polars as pl

df = pl.DataFrame({"name": ["Alice", "Bob"], "spend": [100, 200]})
df.write_delta("./my_first_delta_table")

# Read it back
print(pl.read_delta("./my_first_delta_table"))

# Write a new version
pl.DataFrame({"name": ["Carol"], "spend": [300]}).write_delta(
    "./my_first_delta_table", mode="append"
)

# Time travel to the original
print(pl.read_delta("./my_first_delta_table", version=0))

That’s a complete Delta workflow. Look at the _delta_log/ folder it created — you’ll see your two commits as JSON files. Everything else Delta does is a more elaborate version of these two commits.

When you want to use Delta inside a real pipeline with a catalog, scheduling, and lineage on top, install Flowfile — the catalog will give you all of that with no extra setup.


Related reads: Why Your Data Should Stay on Your Laptop for the local-first catalog story, Polars vs Pandas in 2026 for the engine that pairs naturally with Delta, and Open-Source Alternatives to Alteryx for how Delta-backed catalogs change the visual ETL landscape.

Frequently asked questions

Is Delta Lake a database?
No. Delta Lake is a storage format — a layer on top of Parquet files plus a transaction log. It gives you database-like guarantees (ACID transactions, versioning, schema enforcement) without running a database server.
What's the difference between Parquet and Delta Lake?
Parquet is a single-file columnar format. Delta Lake is a *table* format: a collection of Parquet files plus a transaction log (`_delta_log/`) that records every change. With Parquet alone, overwriting a file loses the previous version. With Delta, every write is a new versioned commit you can roll back.
Do I need Spark to use Delta Lake?
No. The deltalake Python package (Rust-based, no JVM) lets you read and write Delta tables directly from Python. Polars supports Delta natively. DuckDB has a delta extension. Spark works too, but it's no longer required.
What is time travel in Delta Lake?
Time travel means you can query the table as it existed at any prior version or timestamp. Useful for debugging ('which write broke my report?'), auditing ('what did the table look like on Jan 1?'), and rollbacks ('restore to last Tuesday's version').
What does OPTIMIZE do in Delta Lake?
OPTIMIZE rewrites many small Parquet files into fewer larger files, which dramatically improves read performance. Z-ORDER is an optional flag that physically sorts data by chosen columns to make filtered reads even faster.
When should I use Delta Lake instead of plain Parquet?
Use Delta whenever you'll write to a table more than once, when multiple processes might write concurrently, when you need an audit history, or when schema will evolve over time. Plain Parquet is fine for a one-shot dump that nobody will ever rewrite.
How does Flowfile use Delta Lake?
Flowfile's data catalog stores tables as Delta. Every flow run that writes to a catalog table creates a new Delta version. The catalog UI lets you preview historical versions, schema changes, and lineage — all backed by the Delta transaction log.