Logic Is a Table, Observed from the Other Side
Logic produces a table, but the logic is the table — observed from the consumer side. What that flip changes about catalog design, and what I had to delete after I noticed it.
This is one of those ideas that’s obvious in retrospect, and I think every catalog-shaped tool eventually arrives at it. I want to write down how I got there, because the path matters more than the conclusion.
The starting framing
Logic produces a table. You build a pipeline visually, you wire up a write node at the end, you run it, and a table shows up in your catalog. That’s how I’d describe it to a new user. It’s how the docs describe it. It’s how every ETL tool describes it.
The framing has a clean producer/consumer split. Logic produces. Tables sit. Consumers read. Three roles, three concepts, easy to teach.
The flip
But sit with it for a second.
When you ask “what’s in this table?”, the honest answer isn’t “rows of data.” The honest answer is “whatever this logic produces when you run it.” The data is downstream of the logic. The logic is what defines what the table contains. The table is just the result of executing the logic at a specific time, frozen on disk.
So really, the logic is the table’s definition. The materialized parquet file is a snapshot. The logic is the canonical thing. The snapshot is a derived artifact.
This isn’t a deep insight on its own. SQL has thought about views vs. tables for fifty years. dbt has materialized vs. ephemeral models. The split between “definition” and “result” is well-trodden.
What’s new — at least new for me — is realizing the visual graph of transformations is exactly the same kind of object as a SQL view definition. It’s a query, expressed in a different syntax. The catalog had been treating logic and tables as separate things. They’re not separate things. They’re the same thing, observed from either side.
Reading one output from a multi-output graph
Here’s the example that made it click for me. Imagine a graph shaped like this:
Input → Transform_1 → Output_1
↘ Transform_2 → Output_2
One source, two independent branches, each with its own transform and its own output. Materialized, that’s two tables on disk, both kept up to date by re-running the whole graph every time anything changes.
But if you treat the logic as a query, you don’t have to. If a consumer reads Output_1, the plan only contains the path Input → Transform_1 → Output_1. Transform_2 is never planned, never executed, never touched. Polars’s optimizer prunes everything that doesn’t contribute to the requested output before it runs anything.
This is just normal lazy-evaluation behavior. But applied across what would otherwise be “two separate tables in the catalog” — across the boundary between definitions — it stops being a table-and-table relationship and becomes a single graph that gets sliced different ways depending on what you ask for.
The branch you don’t read costs nothing.
You don’t need a cache layer
This is the part where I made things harder than they needed to be, and then realized I had.
The first version of virtual tables in Flowfile had two modes: an “optimized” mode that stored a serialized LazyFrame and used it on read, and a “standard” mode that fell back to executing the producer logic. To keep the optimized mode honest, I built source-version tracking — snapshot the upstream Delta versions when you optimize, check on every read, fall back if anything moved. It worked. It was also unnecessary.
The reason it was unnecessary is that there’s no actual cache to keep coherent. A Polars LazyFrame is a plan, not a result. Running the logic doesn’t produce data — it produces a plan. The data only appears when something calls .collect(). So the question “is the cached plan stale?” is the wrong question. There’s no cache. Every read of a virtual table just executes the plan from scratch, against whatever the upstream sources currently look like.
You don’t need to track source versions because you’re not caching anything. You don’t need an “optimized vs. standard” split because there’s no fallback path. There’s just the plan, executed lazily, every time. Polars’s optimizer does the optimization work on each call — it’s fast, it’s correct, and it doesn’t care whether the upstream has advanced since last time.
So the whole machinery I built around “keeping the optimized plan honest” goes away. The catalog stores the logic. A read asks for the logic’s LazyFrame. The consumer composes filters and projections on top. Polars optimizes the whole thing as one query and executes it. That’s the entire mechanism.
The Polars piece
None of this works without the right execution layer.
Polars LazyFrame is the right serialization unit because it has three properties at once: it’s serializable, it composes with other LazyFrames, and Polars’s query optimizer treats the composition as a single query. That last one is the trick. Most “deferred query” systems break down at composition — you can defer one query, but stacking a filter on top of a deferred query loses the optimization opportunity. Polars doesn’t lose it. The optimizer sees through the composition and plans across it.
So when Flowfile hands the catalog a LazyFrame for a piece of logic and a consumer composes .filter(col("region") == "EU").select("id", "amount") on top, the resulting plan pushes the filter and projection down through the logic’s joins and aggregations the same way it would in a single LazyFrame. There’s no boundary in the plan. There’s just one plan.
I didn’t build that. Polars built that. I just noticed it would work.
What about lazy blockers
Some nodes can’t be expressed as pure Polars lazy operations — a custom Python script, a Docker kernel, a node that depends on an artifact that needs eager execution. The natural worry is that these break the abstraction. They don’t. They just add a checkpoint in the middle of the plan.
When the plan reaches a non-lazy node, that node executes eagerly, writes its result to an IPC file, and the next node in the plan reads from that IPC file as a fresh LazyFrame. The plan is no longer a single uninterrupted Polars query — it’s a Polars query, an eager Python step, another Polars query, glued together with IPC. But the type of the end result is still a LazyFrame. The composition still works. A consumer can still filter and project on top, and at least the post-checkpoint portion of the plan optimizes properly.
The downside is real: lazy blockers slow logic down, because the optimizer can’t push filters back across the eager step. The upside is that they don’t change the shape of the system. Every piece of logic still produces a LazyFrame. Every catalog read still asks for that LazyFrame. The interface doesn’t fork into two cases.
What this taught me about catalog design
The thing I keep coming back to is that I didn’t add virtual tables as a feature. I noticed the catalog already contained the information needed for virtual tables — logic definitions, source lineage, run history — and the only piece missing was the rule that said “you can read logic as if it were a table.” Once that rule is in place, virtual tables aren’t a new pillar. They’re the catalog seeing what was already in the graph.
I also learned that the simplest version of an idea often sits underneath a more complicated version you built first. The version-tracking layer wasn’t wrong — it produced correct results — but it was solving a problem that didn’t exist. I built it to keep an optimized plan in sync with its sources: snapshot the upstream Delta versions when you optimize, check on every read, fall back if anything moved. The trouble is, a LazyFrame isn’t a cache; it’s a plan. Plans can’t go stale, because they don’t hold data — they describe what to do. Only results can drift, and there are no results until someone calls .collect(). So every read was already going to be fresh. The primitive I was working with already had the property I was building machinery to enforce.
The lesson, I think, is that catalog design rewards noticing more than building. The interesting features come from observing what your existing data already implies, not from adding new tables. Lineage was there. Logic definitions were there. The lazy-evaluation semantics were there. All I had to do was connect them in a way that read like one feature instead of four — and then delete the parts I didn’t need.
That’s not a deep architectural principle. It’s just the observation that if your data model is honest, the features it supports are mostly already there, waiting. And sometimes the work is removing the scaffolding you put up before you realized you didn’t need it.
Related reads: Virtual Flow Tables: When a Catalog Entry Is a Pipeline for the implementation reference behind this idea, Why Flowfile Is the Way It Is for the broader architectural arc this flip sits inside, Your Lineage Graph Should Run Your Pipelines for the catalog-as-runtime side of the same coin, and Demystifying Delta Lake for the storage layer that the upstream-version tracking turned out not to need.
Frequently asked questions
- How is this different from the existing virtual tables post?
- The [virtual tables post](/blog/virtual-flow-tables) is the mechanism reference — what they are, how they're stored, the API, when to use which. This post is the realisation story behind them: how the framing flipped from 'logic produces table' to 'logic *is* table', and why the over-engineered first version had to be deleted once I saw it.
- Why doesn't a virtual table need a cache layer?
- Because there's nothing to cache. A Polars LazyFrame is a plan, not a result. Running the logic doesn't produce data — it produces a plan. The data only appears when something calls `.collect()`. Every read of a virtual table just executes the plan from scratch against whatever the upstream sources currently look like. Plans don't go stale; only results do, and there are no results until someone collects.
- What about Python script nodes that can't be lazy?
- They become a checkpoint in the middle of the plan. When the plan reaches a non-lazy node, that node executes eagerly, writes its result to an IPC file, and the next node reads from that IPC file as a fresh LazyFrame. The plan is no longer one uninterrupted Polars query — but the *type* is still a LazyFrame, the composition still works, and at least the post-checkpoint portion still optimises properly. Lazy blockers slow logic down; they don't fork the interface.
- Why does Polars's optimizer make this work where other deferred-query systems don't?
- Composition. Most 'deferred query' systems break down when you stack new operations on top of a deferred query — you can defer one query, but the optimizer can't see across the boundary. Polars's optimizer treats the whole composition as a single query and pushes filters and projections through the logic's joins and aggregations the same way it would in a single LazyFrame. There's no boundary in the plan.
- If I read Output_1 from a multi-output graph, does Transform_2 still run?
- No. The plan only contains the path Input → Transform_1 → Output_1. Transform_2 is never planned, never executed, never touched. Polars's optimizer prunes everything that doesn't contribute to the requested output before it runs anything. The branch you don't read costs nothing.
- What's the catalog-design lesson here?
- Catalog design rewards noticing more than building. The interesting features come from observing what your existing data already implies, not from adding new tables. Virtual tables weren't a new pillar — they were the rule 'you can read logic as if it were a table' applied on top of information the catalog already had: logic definitions, source lineage, run history, lazy-evaluation semantics.