Why Flowfile Is the Way It Is
The honest answer for why a 'visual ETL tool' has scheduling, a catalog, a SQL editor, dashboards, and ML nodes — none of it planned, none of it accidental.
People sometimes ask why a “visual ETL tool” has scheduling and a catalog and a SQL editor and saved visualizations and dashboards and ML nodes. I want to give the honest answer here, not the cleaned-up roadmap version. The honest answer is that the chain wasn’t planned, but it also wasn’t a series of accidents. It’s two things happening at once: user-shaped pressure showing up at specific points, and ideas I’d been chewing on in the background being ready when the pressure arrived.
If you walk back the chain, you end up at a conversation about kernels.
It started with a conversation about kernels
I’d been thinking about Flowfile as a visual ETL tool. Visual nodes for the common transformations. Code generation underneath. Clean. But someone made a point I hadn’t internalized: no matter how complete the visual surface gets, there’s always a custom case. An ML model that doesn’t fit any node. A weird API call. A domain-specific transformation that needs four lines of Python nobody else will ever write again. If users have to leave Flowfile for those cases, the whole “stay in one tool” pitch breaks. The tool needs an escape hatch into arbitrary code, treated as a first-class citizen, not as a hack.
That conversation changed the scope. I added kernels — execution nodes that run user-supplied Python (and later Polars-aware Python) inside a flow, the same way Jupyter runs cells. Custom code, anywhere in the graph, with the same input/output contract as any other node.
Then the practical question: where do kernel outputs go? A kernel can produce a frame, a fitted model, a serialized object. That output has to live somewhere the next node in the flow can read it, and somewhere the user can find it again later. So I needed storage.
I’d been watching Unity Catalog
Here’s where I want to be honest about a thing the “natural path” framing tends to hide. I didn’t invent the catalog from first principles. I’d been watching Unity Catalog for a while. I’d thought about what a catalog does for a data platform — make tables discoverable, version them, track lineage, give other tools something to point at. I didn’t have a roadmap to build one. But the shape was in my head as a thing that worked.
So when kernel outputs needed somewhere to live, I didn’t build a flat key-value store. I built storage that could grow into a catalog, because the catalog shape was already there waiting for an excuse. The user-shaped need (kernel outputs need a home) met an idea I’d been carrying (catalogs are how good data platforms organize themselves), and the two combined into the early version of what’s now the Flowfile catalog.
If I hadn’t been thinking about Unity, I’d probably have built a kernel-specific storage layer and regretted it six months later when scheduling needed persistence too. The exposure mattered. The natural-path framing of “every feature fell out of the last one” is roughly true, but it leaves out the layer underneath, which is that I’d seen what good looked like in adjacent tools and was making decisions accordingly even when the immediate problem was small.
The database wants to be a catalog
Once you have storage tracking kernel outputs, you start putting more in it. Flow metadata. Source paths. Schemas. Run history. Lineage — which flow produced which artifact.
At some point the storage stops being “kernel outputs” and starts being a catalog, in the full sense. The same database that tracks “this kernel produced an output on Tuesday at 3am” can track “this table was produced by that flow on Tuesday at 3am.” Same information, observed from either side. The naming caught up with what the thing already was.
Scheduling was already half-built
Once the catalog tracks runs and outputs, scheduling becomes the obvious next step. If a flow can run once and the catalog knows what it produced, the only missing piece is triggering the run — on a cron, on a data change, on a manual button.
The state layer scheduling needs — run history, execution metadata, retry status — was already in the catalog. The scheduler just adds the trigger. If I’d tried to build scheduling first, I would have had to invent persistence for it. Building the catalog first meant the persistence was already there when the scheduler arrived. That wasn’t luck. It was the Unity-shaped instinct from earlier paying off later.
SQL was free
When the catalog has tables in it, people are going to want to query those tables. Not in a flow — interactively. Just open it up and write SELECT * FROM ....
Polars has SQLContext built in. You hand it a LazyFrame, you give it a table name, you write SQL. That’s the API. The hard part — parsing SQL, planning queries, executing them — Polars handles. All I needed on top was a UI and a thin layer mapping catalog tables to LazyFrames.
The SQL editor isn’t really a new capability. It’s the catalog already speaking the right language.
Visualization was almost free too
Once you can run SQL and get a table back, the next reasonable thing is to visualize the result. I didn’t want to build a charting library. Charting libraries are a tar pit.
But Flowfile already had a node for visual data exploration, built on GraphicWalker. The same component that already worked inside the visual editor could work on top of SQL results. Same component, different input.
So you write a SQL query in the catalog, drop the result into GraphicWalker, drag fields onto axes, and get a chart. The component didn’t change. It just had a new place to be useful.
Dashboards are just more than one chart
When people can save a chart, they want to combine charts. They want a layout. A filter that applies to all of them. A canvas where they can drop tiles.
That’s a dashboard. None of it is novel. The chart component already existed, the catalog already stored saved viz, draggable layout primitives already existed in the canvas code. The dashboard view stitches together three things I already had.
It’s not a great BI tool. Real BI tools have sharing, embedding, alerting, row-level security, mobile rendering — and a decade head start. Flowfile’s dashboards are good enough to look at your own data. They’re not trying to be more.
A separate path: streaming
Scheduling unlocked another chain too, less obvious.
Once you can schedule a flow, you can ask: what if the schedule isn’t time-based? What if it’s data-based? What if a flow runs when its source data changes?
That’s change detection. And once you have change detection, certain integrations become natural that weren’t before. Specifically: keeping a Kafka topic in sync with an analytical dataset. You consume from Kafka, transform, write to a table. When the upstream offset advances, the flow runs. The table stays current with the topic.
This isn’t streaming-first ETL. I’m not trying to compete with Flink. But “keep an analytical table fresh from a Kafka source” is a sharp, common, useful problem, and the infrastructure to do it well already existed in the scheduler.
The thing about tables
This is the part I think about most.
A flow produces a table. That’s the obvious framing — pipeline runs, table appears.
Turn it around: the flow is the table’s definition. The procedure that produces the table is the table, observed from the consumer side. If you ask “what’s in this table?”, the honest answer is “whatever this flow produces when you run it.”
Once that’s true, you don’t have to materialize the table at all. You can store the plan and execute it on read. Filter pushdown crosses the flow boundary. Projection pushdown crosses the flow boundary. The table behaves like any other catalog table from the user’s perspective, but it’s the flow, run lazily.
That’s what virtual tables are in Flowfile. They deserve their own post, so I’ve written more about them separately in A Flow Is a Table, Observed from the Other Side. The point here is just that they didn’t require new infrastructure. They required noticing that the catalog already had everything they needed.
What this means
Best-in-class visual ETL is the goal. The bidirectional Python parity, the code generation, the visual debugging — that’s the part where I’m trying to win.
Everything else exists because it was the natural shape of the architecture. Scheduling, catalog, SQL, viz, dashboards — none of those are competing with the specialists. They’re trying to be good enough that you don’t need to leave the tool for small jobs. If you want best-in-class BI, use Superset. If you want production ML, use MLflow. If you want a real warehouse, buy Snowflake. I’ll tell you when.
The test for whether to add a feature, I think, is: does it fall out of something I already have, or do I have to build new infrastructure for it? If it falls out, it’s probably worth doing. If I’d have to build a parallel system, it’s probably someone else’s problem.
But the heuristic only works if the things you’ve already built are good. The catalog was extensible because I’d seen Unity Catalog and built kernel storage with that shape in mind. The scheduler was easy because the catalog was good. The SQL editor was easy because Polars was already in the stack. None of that is luck. It’s exposure to the right adjacent ideas, applied early enough that the foundations could carry the weight of what came later.
So the real heuristic is two layers. The visible one: only build things that fall out of what you have. The invisible one underneath: read widely, watch what good systems look like, let the patterns sit in your head until a small problem gives you an excuse to use them. Without the second layer, the first layer just produces messes that compound.
That’s worked so far. We’ll see how long it lasts.
Related reads: A Flow Is a Table, Observed from the Other Side for the catalog flip in detail, Three Releases In, Flowfile Stopped Being a Pipeline Tool for the same arc told through release notes, Your Lineage Graph Should Run Your Pipelines for how scheduling collapsed into the catalog, and Flowfile’s Kafka Source: A Streaming Story Told in Micro-Batches for the streaming path off scheduling.
Frequently asked questions
- Is Flowfile trying to be a BI tool, an orchestrator, and an ML platform all at once?
- No. The visual ETL is the part trying to be best-in-class. The catalog, scheduling, SQL editor, saved visualizations, dashboards, and ML nodes exist because they fell out of the architecture, not because of a strategic decision. They're good enough that you don't need to leave the tool for small jobs. For real BI, real ML, or a real warehouse, use the specialists.
- Why does the catalog look so much like Unity Catalog?
- Because I'd been watching Unity Catalog for a while before kernel outputs needed somewhere to live. The shape was already in my head. When the user-shaped need arrived, I built storage that could grow into a catalog rather than a flat key-value store, because catalogs are how good data platforms organize themselves.
- What's the rule for adding a new feature?
- Two layers. The visible one: only build things that fall out of what's already there. If a feature would require a parallel system, it's probably someone else's problem. The invisible one underneath: read widely and watch what good systems look like, so when a small problem gives you an excuse, the right shape is already in your head.
- What's a kernel in Flowfile?
- An execution node that runs user-supplied Python (and Polars-aware Python) inside a flow, the same way Jupyter runs cells. The point is that there's always a custom case visual nodes can't cover — an ML model, a weird API call, a four-line transformation nobody else will ever need. Kernels make custom code first-class, not a hack.
- Where does this stop? Are dashboards Flowfile's competitor to Superset?
- No. Flowfile's dashboards are good enough to look at your own data. They're not trying to be Superset, Metabase, or Looker. Real BI tools have sharing, embedding, alerting, row-level security, mobile rendering — and a decade head start. If you need any of that, leave.