Fuzzy Match in Polars: Joining on Dirty Data with Flowfile
Every real dataset has 'Acme Corp' vs 'ACME Corporation' somewhere. Here's how Flowfile's fuzzy_join — built on Polars and Levenshtein — handles it without a regex in sight.
TL;DR. Every real dataset has Acme Corp vs ACME Corporation vs acme corp. somewhere. Exact-match SQL joins drop all three mismatches on the floor. Flowfile ships a fuzzy_join method — built on pl-fuzzy-frame-match and Polars — that matches similar strings using Levenshtein similarity, returns the similarity score as a column so you can audit the match, and runs over real-world data sizes without falling back to a Python loop. It’s a one-method fix for the single most common “why won’t this join?” problem in ETL.
The problem exact-match joins can’t solve
Every data-cleaning project hits this wall. You have two tables:
orders customers
--------- ---------
customer_name, amount customer_name, region
"Acme Corp", 1200 "ACME Corporation", "EU"
"Globex LLC", 890 "Globex, LLC", "US"
"Initech", 450 "INITECH", "US"
You need to attach region to each order. You write:
orders.join(customers, on="customer_name", how="left")
Every row fails to match. The keys are “the same” — to a human. To an exact-match string comparison, they’re three completely unrelated values.
The usual responses:
- Lowercase everything, strip whitespace, normalise punctuation. Catches 40% of cases. Breaks on
Acme CorpvsACME Corporation. - Custom Python regex cleanup. Catches 60%. Takes a week to write. Breaks silently on the edge cases you didn’t think of.
- Give up and ask the source systems to clean their data. They won’t.
The honest answer is: the exact-match join is the wrong tool. You need similarity, not equality.
What Flowfile’s fuzzy_join does

Flowfile’s FlowFrame has a fuzzy_join method. The API is deliberately similar to a regular join:
import flowfile as ff
from pl_fuzzy_frame_match.models import FuzzyMapping
orders = ff.FlowFrame({
"id": [1, 2, 3],
"customer_name": ["Acme Corp", "Globex LLC", "Initech"],
"amount": [1200, 890, 450],
})
customers = ff.FlowFrame({
"customer_name": ["ACME Corporation", "Globex, LLC", "INITECH"],
"region": ["EU", "US", "US"],
})
mapping = FuzzyMapping("customer_name", threshold_score=60)
enriched = orders.fuzzy_join(customers, [mapping]).collect()
print(enriched)
What you get back has the matched rows from both sides — but also a column telling you how good each match was:
shape: (3, 6)
┌─────┬────────────────┬────────┬──────────────────────┬────────┬──────────────────────────────────────────────┐
│ id │ customer_name │ amount │ customer_name_right │ region │ customer_name_vs_customer_name_right_leven...│
╞═════╪════════════════╪════════╪══════════════════════╪════════╪══════════════════════════════════════════════╡
│ 1 │ Acme Corp │ 1200 │ ACME Corporation │ EU │ 0.56 │
│ 2 │ Globex LLC │ 890 │ Globex, LLC │ US │ 0.82 │
│ 3 │ Initech │ 450 │ INITECH │ US │ 1.00 │
└─────┴────────────────┴────────┴──────────────────────┴────────┴──────────────────────────────────────────────┘
Three things to notice:
- The join succeeded. You got your regions attached.
- The similarity score is right there in the output.
InitechvsINITECHmatches at 1.00 after case normalisation;Acme CorpvsACME Corporationmatches at 0.56. - Nothing escaped to a Python loop. The match computation ran in Polars, in Rust, across columns.
That third point matters more than people expect. rapidfuzz and fuzzywuzzy compare two strings; applying them to join two DataFrames row-by-row through Python means you pay Python function-call overhead N×M times. pl-fuzzy-frame-match expresses the same comparison as a Polars operation and stays in the vectorised engine. On realistic data volumes (tens of thousands by tens of thousands) this is the difference between seconds and hours.
The FuzzyMapping object
FuzzyMapping is the per-column config. A single name is the minimal form:
FuzzyMapping("customer_name", threshold_score=60)
That says “match the customer_name column on both sides with a Levenshtein threshold of 60 (out of 100)”. You can also name left and right columns separately when they differ:
FuzzyMapping(left_col="company", right_col="customer_name", threshold_score=75)
You pass a list of mappings to fuzzy_join, which lets you require multiple fuzzy matches simultaneously — useful for entity resolution where you want “same company name AND same city”:
enriched = orders.fuzzy_join(
customers,
[
FuzzyMapping("customer_name", threshold_score=60),
FuzzyMapping("city", threshold_score=80),
],
description="Resolve customers on name + city",
).collect()
The output gets one similarity-score column per mapping. You can post-filter on combined logic — e.g. require the combined score to exceed some value, or require both to individually clear their thresholds (which the threshold_score on each mapping already enforces).
Tuning the threshold
threshold_score is expressed on a 0–100 scale. There is no universally correct value; it’s an empirical exercise. Some starting points that hold up in practice:
| Data shape | Suggested threshold | Why |
|---|---|---|
| Company names (short) | 70–80 | Small edit distances matter a lot: Acme and Ace are 25% apart and should not match |
| Personal names | 75–85 | Higher bar — name collisions are expensive |
| Addresses | 55–65 | Long strings, lots of noise (Rd/Road, St/Street), small edits are proportionally smaller |
| Email addresses | 85–95 | Typos matter; john@acme.com vs john@acne.com is a business incident |
| Free-text descriptions | 40–60 | Lots of noise; you’re mostly looking for overlap |
The engineering discipline is: always keep the similarity score column. Don’t drop it after the join. When you look at the output and see a 0.42 match between Acme Holdings Ltd and Apex Holdings Ltd, you want to be able to see that immediately and tune the threshold up, not ship the bad join.
How it scales
Naive fuzzy matching is O(N×M). For a 40K × 30K join you’re already at 1.2 billion string comparisons — and brute-forcing that takes minutes even in Polars. Most fuzzy-match libraries hit a wall here.
pl-fuzzy-frame-match doesn’t. It auto-switches at exactly 100 million comparisons:
if cartesian_product_size >= 100_000_000:
use_approximate_matching() # ANN candidate selection, then exact scoring
else:
use_exact_matching() # straight Levenshtein on the cross-join
The hybrid path uses sparse-matrix cosine similarity on character n-grams (via polars-simed) to shortlist candidates, then runs whichever exact algorithm you picked (Levenshtein, Jaro-Winkler, etc.) on the survivors. Verified benchmarks from the library’s README:
| Comparisons | Brute force | Auto strategy | Speedup |
|---|---|---|---|
| 150M (15K × 10K) | 40.8 s | 1.45 s | 28× |
| 1.2B (40K × 30K) | 363.5 s | 4.75 s | 76× |
You don’t have to do anything to get this — the library detects the dataset size and picks the strategy. You can still apply your own blocking strategy on top (pre-partition by country, email domain, first letter, region) if you want to drive comparisons even lower; in a Flowfile visual flow, that’s an exact Join node feeding into the Fuzzy Match node.
The deeper algorithmic story — why ANN works, what the n-gram vectorisation looks like, where the trade-offs sit — is in 76× Faster Fuzzy Joins: How pl-fuzzy-frame-match Works.
Why this is a node, not a library
You can, of course, pip install pl-fuzzy-frame-match and use it directly. What Flowfile adds is three things:
- A visual node. Drag the Fuzzy Match node onto the canvas, pick columns on left and right, set thresholds, preview the matched rows with their similarity scores in a grid. Great for exploratory work and for analysts who don’t want to write code.
- Description on every node. Pass
description="Resolve customers on name + city"intofuzzy_join()and that description becomes the node label on the visual graph — see Connections, Secrets, and the Catalog for more on this. - It slots into the rest of the platform. The fuzzy-matched output can feed straight into a Group By, then a Pivot, then a Write to Catalog. Same flow, same canvas, same engine.
End-to-end example
A concrete one. You have a CRM export (customers) and a billing-system export (invoices). They agree on customer names in theory. In practice the CRM has ACME Corporation and the billing system has Acme Corp. You want to build a monthly summary.
import flowfile as ff
from pl_fuzzy_frame_match.models import FuzzyMapping
crm = ff.read_csv("crm_customers.csv")
invoices = ff.read_csv("invoices.csv")
enriched = invoices.fuzzy_join(
crm,
[FuzzyMapping("customer_name", threshold_score=70)],
description="Match billing names to CRM",
)
summary = (
enriched
.filter(
ff.col("customer_name_vs_customer_name_right_levenshtein") >= 0.75,
description="Keep only confident matches",
)
.group_by("region")
.agg([
ff.col("amount").sum().alias("total_billed"),
ff.col("invoice_id").count().alias("num_invoices"),
])
)
summary.write_catalog_table("analytics.billing.monthly_by_region", write_mode="overwrite")
Five logical steps. The fuzzy match is one of them. The similarity score column survives all the way through the pipeline so you can post-filter on it. The output lands in the Delta-backed catalog with versioning for free.
Honest caveats
A few things to know before you reach for this:
- It’s an “eager” node. Flowfile marks Fuzzy Match as eager in the node registry, meaning it forces materialisation of its inputs. You can’t chain arbitrarily many lazy operations through it; plan accordingly.
- Low-confidence matches are silent. If no row on the right side clears the threshold, the fuzzy join drops that left row (or keeps it with nulls if you choose
how="left"). Always check your row counts before and after. - Levenshtein has blind spots. It treats
AppleandAlpelas closer thanAppleandApple Inc— transpositions are cheap, additions are expensive. For names and short strings, switch tofuzzy_type="jaro_winkler"on the FuzzyMapping. For text where transposition typos are common,damerau_levenshteinis the right pick. Levenshtein is the default because it’s the most general-purpose, not because it’s always right.
Try it
pip install flowfile
Take any two datasets you know don’t quite join. Load them both into a Flowfile flow, drop a Fuzzy Match node between them, set the threshold at 70, run it, and look at the similarity-score column. You’ll spend more time tuning the threshold than you spent writing the node.
- Browser demo — has a Fuzzy Match node in the lightweight node set.
- Install locally — full version with all join / match options.
- GitHub — source and issues.
Related reads: 76× Faster Fuzzy Joins: How pl-fuzzy-frame-match Works for the algorithmic deep dive, Polars vs Pandas in 2026 for the engine that makes this fast, and From Alteryx to Flowfile: A Practical Migration for Alteryx users who relied on the Fuzzy Match tool.
Frequently asked questions
- What is a fuzzy match in data processing?
- A fuzzy match (also called approximate string matching) is a join or lookup that succeeds when two values are similar enough rather than exactly equal. It's how you deal with real-world data where 'Acme Corp', 'ACME Corporation', and 'acme corp.' are all the same company — no amount of exact-match SQL will join them.
- How does Flowfile's fuzzy_join work?
- Flowfile's fuzzy_join is built on the pl-fuzzy-frame-match library, which uses Polars expressions and Levenshtein similarity by default. You give it a column mapping and a threshold score (0–100); it returns matched rows along with a similarity-score column so you can audit what matched and what didn't.
- What similarity algorithm does it use?
- Six are supported: levenshtein (the default — edit distance), jaro and jaro_winkler (great for short strings like names), damerau_levenshtein (Levenshtein plus transpositions), hamming (equal-length strings only), and indel (insertion/deletion distance). You pick one per FuzzyMapping with the fuzzy_type parameter. The score is normalised to 0–1 and returned in a column named after the columns and algorithm involved.
- What threshold should I use?
- Start at 75 (0.75 similarity) for short strings like company names or email local-parts; drop to 60 for longer strings like addresses where small differences are proportionally smaller. Always inspect the score column and tune — there is no single right answer. Fuzzy matching is an empirical exercise, not a mathematical one.
- How is this different from rapidfuzz or fuzzywuzzy?
- rapidfuzz and fuzzywuzzy are Python libraries that compute similarity scores between two strings. Flowfile's fuzzy_join sits one level up: it joins two DataFrames on similarity. Under the hood, pl-fuzzy-frame-match uses Polars expressions (not Python-loop comparisons), which is materially faster for joining large DataFrames than applying rapidfuzz row-by-row.
- Does it scale to millions of rows?
- Yes — and it's smarter than you'd expect. pl-fuzzy-frame-match auto-switches strategy at 100M comparisons. Below that, it uses exact fuzzy matching directly because the overhead of the alternative isn't worth it. Above that, it switches to a hybrid approach: an approximate-nearest-neighbour pass using sparse-matrix cosine similarity on character n-grams (via polars-simed) narrows the candidate set, then exact fuzzy scoring runs on the survivors. Verified benchmarks: 40K × 30K (1.2B comparisons) drops from ~363 seconds brute-force to ~4.75 seconds — a 76× speedup. See the deep dive for the algorithm details.
- Can I use multiple fuzzy columns at once?
- Yes. fuzzy_join takes a list of FuzzyMapping objects, each with its own threshold. A common pattern for entity resolution is to fuzzy-match company name AND fuzzy-match address, requiring both to clear their respective thresholds. The resulting rows have one similarity-score column per mapping, so you can post-filter on combined logic.