All articles

Abstraction Is a Zoom Level on a DAG You Already Have

Most dataflow tools sit on a DAG. What separates code that feels easy from code buried in YAML isn't structure — it's how many nodes you can see.

TL;DR. Most of what we argue about in dataflow tooling — visual vs code, low-code vs full-code, notebook vs pipeline — sits on a DAG of data dependencies. Spreadsheet recalc, SQL planners, Polars LazyFrames, build systems, container layers, CI pipelines, Git: the dependency graph underneath is acyclic by design or by convention. What differs between easy code and buried in YAML code isn’t the structure. It’s how many of those nodes you’re allowed to see at once. Abstraction isn’t a transformation of one thing into another. It’s a zoom level on a graph that already exists. (Outside dataflow — general programs with mutable state, event loops, reactive systems — the framing has known edges. More on that below.)

The DAG sits underneath

Pick most dataflow tools and the dependency graph is acyclic by design. A spreadsheet recalculates by topological sort and refuses to evaluate cyclic references. A SQL optimiser turns your query into a logical plan, then a physical plan, both of which are DAGs of operators. Polars hands you a LazyFrame and lets you inspect the plan with .explain() before any data moves. make and Bazel define their work as a DAG of build targets — a cycle is a build error, not a feature. Container layers are a DAG by construction. A CI run is mostly a DAG of jobs (matrix builds and conditional reruns muddy the picture, but the core shape is acyclic). Git is, definitionally, a DAG of commits. Even out-of-order CPU execution rebuilds a tiny dependency DAG inside each instruction window.

This isn’t a clever observation. It’s the consequence of X needs Y being formal enough to write down, and the tool refusing the cyclic case. Once that relation lives in real memory and cycles are disallowed, what you have is a DAG. The interesting part is that the DAG is the same whether the user sees one node or thirty.

It also isn’t everything. The static call graph of an arbitrary program isn’t a DAG — recursion and mutual recursion put cycles in. An event loop dispatching to itself isn’t a DAG. A program whose control flow depends on mutable shared state isn’t easily reducible to one. The framing here is narrower: the work that resolves to data dependencies — and a lot of the tooling people compare resolves to exactly that — has a DAG underneath it.

The same job, four button counts

Take a single outcome — Kafka topic to a warehouse table — and look at how it shows up at different layers of tooling.

One button. A managed connector: Fivetran, Airbyte Cloud. Pick the topic, pick the destination, click sync. The DAG you see is one node. Underneath it there are roughly thirty: schema discovery, deserialiser, dead-letter routing, watermark tracking, retry policy, idempotent write, lineage emit. None of them are yours.

Five buttons. A configured connector — same product class, but you’re picking the deserialiser, choosing how to handle nulls, mapping fields. Five visible nodes; the other twenty-five are still hidden, still maintained for you.

Ten buttons. A self-hosted consumer with a transform step in front of the warehouse write. Now you can see the consumer group offsets, the schema-registry lookup, the transformation, the upsert. You’re maintaining the runtime and the topology. The DAG has roughly ten nodes you can point at.

Twenty buttons. You’re writing the consumer. Argo schedules it, OpenTelemetry traces it, PagerDuty wakes you up when it falls over, Airflow re-runs the daily backfill. Every node in the original thirty-node graph is now a thing you can debug, and a thing you have to debug. (Those four tools are illustrative, not prescriptive — the point is the count of visible nodes, not the brand of each one.)

The DAG is the same in all four cases. What differs is which nodes are atomic to the person doing the work.

Abstraction is choosing which nodes are atomic

That’s the move worth naming. Raising the abstraction level means collapsing a subgraph into one node. Dropping the level means expanding it back. The work doesn’t disappear when it gets collapsed; it just moves out of the user’s seat into someone else’s.

This reframes when abstractions fail. The SaaS connector is fine — until the upstream schema changes, the deserialiser starts emitting nulls in a column nobody told you about, and the one node you can see has no surface for the thing that’s actually wrong. The collapsed subgraph turned into a lie. In the other direction, the hand-rolled twenty-button consumer is fine — until you need five pipelines a week, at which point the visible-everything zoom becomes a five-times-the-work liability. Same tooling, different demands, different cost.

There’s a useful distinction sitting next to this one. Stewart Brand’s pace layers idea: in any working system, different layers change at different speeds. Sync sales orders changes once a year. Offsets, retries, deserialisers change every release of the upstream library. A fixed-zoom tool forces both to move at the same speed, which is why the slow-moving layer feels rigid and the fast-moving layer feels exposed. The right zoom for the semantic node is not the right zoom for the implementation node.

Tools that get the zoom right

The tools that age well let you change the zoom. make has shipped this since the seventies — make -d expands the rule graph so you can see why the build decided to rebuild the thing it rebuilt. Bazel goes further: every artifact, including individual object files, is a node, and the graph is queryable. You can ask what depends on this proto file? and get an answer that’s true at the byte level.

Polars does the same trick at three levels on the same LazyFrame. .explain() shows the logical plan — what you wrote, restated. .explain(optimized=True) shows what the optimiser is going to actually run — pushed-down filters, pruned projections, reordered joins. .collect() runs it, and a profiler view shows the per-operator times. Three zooms, same DAG. You don’t have to pick.

Flowfile lands the same way. The canvas shows one node per logical step — Read Parquet, Filter, Group By. Each node binds to a Pydantic model whose body is a Polars expression DAG you can open and inspect. Underneath that, flow.to_polars() emits the script and pl.LazyFrame.explain(optimized=True) shows the physical plan after the optimiser has had its way. Three zooms, one flow. Nothing collapses permanently; nothing expands forever.

Where this breaks

Not every runtime is a static DAG, and pretending otherwise is the kind of overclaim that makes a framing useless. Reactive systems with backpressure rewire the graph per event — the same source can be a different shape of consumer next tick, depending on whether the buffer is full. Watermarked stream processors carry windowed state across what looks like a DAG but isn’t, because the dependency relation now includes time. Anything with mutual recursion at runtime — coroutines waiting on each other, an event loop dispatching to itself — has a cycle that doesn’t unroll.

The honest claim is narrower: deferred-but-finite work has a DAG underneath it, and the zoom level is a choice. Perpetual-motion event loops live somewhere else. If the framing breaks for the workload in front of you, the framing is the thing to drop, not the workload.

The zoom is the thing

Most fights about good tooling are actually fights about zoom level. Visual versus code: a zoom argument. Low-code versus full-code: a zoom argument. Notebook versus pipeline: a zoom argument. The honest question isn’t which abstraction is right? — it’s which nodes should be atomic for this job, this week, this person?

The tools that respect that are the ones that let you change the answer without changing tools.

The DAG is the unit of work. The zoom is taste. Don’t let the tool pick the zoom for you.


Related reads: Direction Stopped Mattering: Code and Graph in One Loop for two zoom levels of the same Flowfile graph, Your Lineage Graph Should Run Your Pipelines for what happens when the DAG itself becomes the runtime, and Virtual Flow Tables: When a Catalog Entry Is a Pipeline for the zoom-collapse mechanic in code.

Frequently asked questions

Are you really claiming everything is a DAG?
No. The claim is narrower: most of the dataflow tools people argue about — ETL tools, query planners, build systems, schedulers, container build chains, version control — sit on a DAG of data dependencies. Outside dataflow (event loops, reactive systems, programs whose control flow depends on mutable shared state), the framing has known edges. See *where this breaks*.
What about cycles? Loops? Recursion?
Some unroll, some don't. A loop over independent items expands into N independent nodes; a loop accumulating into a single state becomes a chain. But a `while` loop whose termination depends on runtime state doesn't unroll without running it, and direct or mutual recursion stays cyclic in the static call graph. The post is about data-dependency DAGs, not function-call graphs of arbitrary programs.
Aren't function call graphs DAGs anyway?
Generally no. Recursion (direct or mutual) puts cycles in a static call graph. The data-dependency graph in a dataflow program — *the value of node B depends on the value of node A* — is what's reliably a DAG. The function-level call graph of an arbitrary program isn't, and the post isn't about that.
How does Flowfile actually expose multiple zoom levels?
Three of them, on the same flow. The canvas shows one node per logical step. Each node has an inspectable expression DAG underneath — the Polars expressions the form is bound to. And the physical plan you see in `.explain(optimized=True)` is a third zoom: filter pushdown, projection pruning, the optimised operator order. Same flow, three views.
Where does this framing break?
Reactive systems with backpressure, watermarked stream processors, and any runtime where the topology rewires itself per event don't unroll cleanly into a static DAG. The framing is about deferred-but-finite work, not perpetual-motion event loops. Treat it as a useful default that has known edges, not a universal law.