Polars vs DuckDB in 2026: Which to Pick for Local Analytics
Both are fast, columnar, and built on Arrow. Picking between Polars and DuckDB is less about speed and more about the shape of your work. An honest comparison.
TL;DR. Polars and DuckDB are the two best things to happen to local analytics in the last five years. They are also more similar than the marketing suggests — both columnar, both Arrow-based, both vectorised, both fast on the same workloads. The choice is less about benchmarks and more about which shape your work has: a DataFrame API in a Python pipeline (Polars) or a SQL engine you can embed anywhere (DuckDB). This post is the honest comparison.
The two-tool problem nobody warns you about
You sit down to do some analysis on a 30 GB Parquet folder. The Pandas instinct says “load it.” The instinct lasts about four seconds before the OOM kill arrives. Two tools have entered the conversation in the last few years to fix that. They look almost identical from a distance and turn out to be subtly different up close.
Polars is a DataFrame library written in Rust, with first-class Python and a Rust API. You write pl.scan_parquet(...).filter(...).group_by(...).collect() and the lazy planner figures out the rest.
DuckDB is a single-binary embedded SQL database. You write SELECT region, sum(amount) FROM 'orders.parquet' WHERE country = 'NL' GROUP BY region and DuckDB plans, executes, and hands you the rows back.
Both run on Apache Arrow. Both push down filters into Parquet. Both scale comfortably on a laptop into the tens of gigabytes. The interesting question isn’t which is faster. It’s which one fits your week.
What they actually share
Worth saying up front, because the rest of the post is about differences. The shared core is bigger than people expect:
- Arrow memory. Both engines hold data in Arrow’s columnar layout. That’s why handing a DataFrame from one to the other is zero-copy.
- Vectorised execution. Both process data in column-shaped batches with SIMD where the platform allows. The “doing the work in tight loops over packed arrays” part is the same idea on both sides.
- Lazy planning. DuckDB has always been a query optimiser. Polars caught up with
LazyFrame. Both rewrite your query before running it — push filters past joins, prune unused columns, fuse adjacent projections. - Parquet-native. Both can read a Parquet folder directly without an import step. Both honour partition pruning. Both stream through files larger than memory.
- Embeddable. Neither requires a server.
pip install polarsorpip install duckdband the engine is in your process.
If you start with the wrong instinct — “DuckDB is the database, Polars is the DataFrame, the database must be slower because it has to parse SQL” — you’ll be wrong on the first benchmark you run.
Where the shapes diverge
The interesting part. Five places where the choice matters in practice.
1. The API is the product
DuckDB’s API is SQL. You can also use it through a Python relational API (duckdb.sql(...), .df(), .arrow()), but the canonical way to express logic is a query string. SQL is universally readable, ages well, and is what most analysts already know.
Polars’ API is method-chained DataFrame expressions. df.filter(pl.col("x") > 10).group_by("region").agg(pl.col("amount").sum()). Closer to Pandas but cleaner; closer to Spark but lazier; closer to dplyr if that’s your reference. It’s the right shape for code that lives in a Python file next to the rest of your business logic.
The distinction sounds taste-based. It’s actually load-bearing. SQL is a great language for describing the result you want. A DataFrame API is a great language for describing the steps to get there. Most pipelines need both at different points. Most teams pick one as the centre and use the other as a guest.
2. State and embedding
DuckDB is a database. It has a file format (.duckdb), system catalogs, persistent state. You can CREATE TABLE, you can attach the database from a different process, you can run a query against a table that exists across sessions. The recent DuckLake catalog adds an open metadata layer on top.
Polars has no state. A DataFrame is a value, a LazyFrame is a plan, and when your Python process exits, both vanish. Persistence is delegated — write to Parquet, write to Delta, write to whatever your storage layer is. That’s a feature when you want a stateless transformation library and a friction point when you wanted a “just give me a database to talk to.”
This is the deepest design difference between them. DuckDB is a place where your data lives. Polars is a tool that processes data on the way through.
3. SQL completeness
DuckDB has full SQL: window functions, recursive CTEs, every aggregate, full ANSI scalar function library, custom UDFs in Python or C++. If you can write it in PostgreSQL, you can probably write it in DuckDB.
Polars has SQL too — there’s an SQLContext you can register tables against — but the surface is partial. Most analytics queries work; some Postgres-flavoured features don’t. The Polars SQL is best understood as a transpilation layer onto the expression API, not a separate full SQL implementation.
If your codebase or your team is SQL-first, DuckDB will be the lower-friction tool. If you’re going to write Python anyway, the Polars expression API is a strict superset of what you can do in its SQL.
4. Streaming and very large data
Both engines can process more than RAM, but they get there differently.
Polars has a streaming engine. LazyFrame.collect(streaming=True) runs the plan in chunks, operator by operator, holding a small working set in memory. The streaming engine has been the default in modern Polars releases.
DuckDB has its own streaming model — it uses out-of-core algorithms when the working set exceeds memory, spilling to disk via its temp directory. From the user’s perspective: just run the query, DuckDB handles the rest.
Both work. Polars gives you more direct control over the streaming boundary. DuckDB hides it more. Neither is a Spark-replacement; both will fall over on petabyte data.
5. Ecosystem fit
Polars sits inside the Python data ecosystem. df.to_pandas(), df.to_numpy(), scikit-learn pipelines, PyTorch tensors, plotting libraries — the integrations are direct because the data structure is a DataFrame.
DuckDB sits at the boundary of multiple ecosystems. There are clients for Python, R, Java, Rust, JavaScript (DuckDB-WASM), Go, .NET, even mobile. If your tool needs an embedded analytical engine and the host language isn’t Python, DuckDB is often the only realistic answer.
This is why a lot of products end up with DuckDB under the hood and a different API on top. It’s the SQLite of analytics.
A side-by-side benchmark, with caveats
Numbers from a recent run on a 64 GB MacBook over a 12 GB Parquet dataset. Treat them as directional.
| Workload | DuckDB 1.x | Polars 1.x (lazy) | Notes |
|---|---|---|---|
| Filter + count, single column | ~120 ms | ~110 ms | Both push filters into Parquet |
| Group-by sum, 8 keys | ~1.4 s | ~1.6 s | DuckDB slightly ahead on this shape |
| Join 50M × 50M, 1 key | ~5.8 s | ~5.2 s | Polars hash-join wins narrowly |
| Window function over 100M rows | ~2.1 s | ~2.4 s | Window APIs differ; results equivalent |
| Streaming pipeline over 80 GB | ~85 s | ~78 s | Both spill; Polars has tighter control |
Two takeaways. First, the speed gap is small enough that “which is faster” is almost never the right question. Second, the variance you’ll see between query shapes inside one engine is bigger than the variance between engines on the same query.
Where each one wins
Pick DuckDB when:
- SQL is the language your team thinks in.
- You need an embedded database — a thing with a file, a catalog, persistent tables.
- You want a single binary with no language runtime, or you’re in a non-Python host.
- The work is heavy on full-SQL features: recursive CTEs, advanced window frames, exotic aggregates.
- You’re building a tool and you need analytical SQL inside it (the SQLite-of-analytics use case).
Pick Polars when:
- You’re writing Python pipelines and want the engine to fit in the same file as your logic.
- You care about the DataFrame ergonomics for ML / scientific Python interop.
- You want a lazy plan that you can inspect with
.explain()and reason about as data flows. - You want explicit control over the streaming boundary.
- You’re going to read or write Delta / Iceberg tables — the integrations on the Polars side are tighter.
Use both when:
- Your reporting/BI layer is SQL-first; your transformation layer is Python. Land Parquet between them and let each tool do what it’s best at. Zero-copy Arrow handoff is real.
Where Flowfile sits in this picture
Flowfile is built on Polars. Every visual node compiles to Polars expressions, the catalog runs on Delta Lake, and the export feature emits standalone Polars code with no Flowfile dependency. The choice was a fit-the-shape decision: most Flowfile users are writing Python, and the export-as-Polars story matters more than the SQL-completeness story.
DuckDB still shows up adjacent to Flowfile in real workflows. People run DuckDB to ingest something messy, write Parquet, and pick it up in a Flowfile graph. Or they read a Flowfile catalog table from DuckDB through its Delta extension. That’s the tools-being-good-citizens version of “which one wins”: you don’t pick.
If you’ve never used either, the Flowfile demo runs a Polars build in your browser via WASM — load a CSV, drag a few nodes, watch how a group-by behaves on real data. It’s the fastest way to feel the engine.

What I’d tell a friend choosing today
If you’re starting a new local-analytics project in 2026 and you have no constraint pulling you one direction or the other, pick the engine whose API you want to write in. They’re close enough on performance that the day-to-day ergonomics dominate. SQL people will be productive in DuckDB faster than they’d want to admit. Python people will be productive in Polars faster than they’d want to admit. Neither choice will age badly.
The decision you don’t have to make is “one or the other forever.” Most teams I see end up with both, talking through Parquet, sharing Arrow buffers, with each tool doing what it’s best at. That’s a sign the ecosystem is healthy, not a sign you couldn’t make up your mind.
Related reads: Polars vs Pandas in 2026 for the older sibling comparison, Demystifying Delta Lake for the storage layer underneath, and Big Data Is Dead for why local engines like these matter again in the first place.
Frequently asked questions
- Is DuckDB faster than Polars?
- On most analytics workloads — group-bys, joins, scans over Parquet — they land within ~2x of each other in either direction depending on the query shape. Both run on Apache Arrow, both vectorise, both push down filters and projections. The one consistently slower than the other is usually a bug or a missing optimisation, not a fundamental gap.
- Can I use DuckDB and Polars together?
- Yes, and a lot of people do. Polars and DuckDB share Arrow buffers, so handing data between them is zero-copy. A common shape is DuckDB for the SQL-heavy reporting layer, Polars for the in-Python transformation logic, with Parquet on disk as the contract between them.
- Does DuckDB replace Pandas?
- Partly. DuckDB replaces the Pandas roles where you'd write SQL-shaped logic — aggregations, joins, ad-hoc analysis. It doesn't replace the parts of Pandas that hand a DataFrame to scikit-learn or matplotlib. Polars covers more of that surface because it's a DataFrame library, not a database engine.
- Should I use DuckDB or Polars in 2026 for new projects?
- Pick DuckDB if SQL is your primary language, your team already thinks in queries, or you want a single binary with no language runtime. Pick Polars if you write Python pipelines, want a DataFrame API that interops with the ML ecosystem, or value the lazy-plan / streaming story for very large data.
- Does Flowfile use Polars or DuckDB?
- Polars. Every node in Flowfile compiles to Polars expressions, and the catalog uses Delta Lake on top of Parquet. The SQL editor runs through Polars SQLContext. DuckDB is well-loved and well-engineered — Polars was the closer fit for a tool whose primary user is writing Python and whose primary export is a Polars script.