All articles

76× Faster Fuzzy Joins: How pl-fuzzy-frame-match Works

Brute-force fuzzy matching is O(N×M) — at 1.2 billion comparisons it falls over. Here's how a two-stage hybrid (ANN + exact scoring) reduces that to seconds while preserving accuracy.

TL;DR. Fuzzy joins on real-world data hit a wall fast. Brute-force similarity matching is O(N×M) — at 40,000 × 30,000 rows that’s 1.2 billion string comparisons. pl-fuzzy-frame-match (the engine behind Flowfile’s fuzzy_join) sidesteps the problem with a two-stage hybrid pipeline: a cheap approximate-nearest-neighbour pass narrows the candidate set using sparse-matrix cosine similarity on character n-grams, then exact fuzzy algorithms (Levenshtein, Jaro-Winkler, Damerau-Levenshtein, etc.) score the survivors. The library auto-switches between strategies at 100M comparisons. Verified: 28× faster at 150M, 76× faster at 1.2B — with comparable accuracy.

The Cartesian-product wall

Every fuzzy-match library starts the same way: for each row on the left, compare to each row on the right. That’s a Cartesian product. The number of comparisons grows as the product of the input sizes:

Left rowsRight rowsComparisons
10,0008,00080 million
40,00030,0001.2 billion
400,00010,0004 billion

And each comparison is itself non-trivial. Levenshtein distance is O(n×m) where n and m are the string lengths. Jaro-Winkler isn’t free either. So a “1.2 billion comparisons” workload is really 1.2 billion runs of an inner string-similarity algorithm. Even on a modern multi-core machine you’re looking at minutes of CPU at best, and gigabytes of intermediate results threatening to spill to disk.

The standard responses are all unsatisfying:

  • “Just use a smaller subset” — fine for prototyping, useless in production.
  • “Block on a key” — works only if you have a clean partition key (country, email domain). Many real datasets don’t.
  • “Throw a Spark cluster at it” — the right answer if you actually need petabyte scale, but enormous overkill for the merely-large-data case.

There’s a fourth option that most teams don’t reach for, because it sounds too clever to actually work: don’t compute every comparison.

The hybrid approach: skip the work you don’t need

The insight is that most of those 1.2 billion comparisons are trivially false matches. Acme Corp doesn’t need to be exhaustively compared against Globex LLC — they share no character patterns; we can rule them out almost for free. The exact, expensive comparison is only worth running on pairs that might be a match.

So you split the work into two stages:

  1. Stage 1 — fast approximate pass. Use a cheap method to shortlist the candidates that have any chance of matching.
  2. Stage 2 — exact fuzzy scoring. Apply the precise algorithm (Levenshtein, Jaro-Winkler, etc.) only to the shortlisted candidates.

Stage 2 still produces the answer you wanted. Stage 1 just makes sure you’re not paying for billions of exact comparisons against pairs that were never going to match.

The clever bit is what you use for Stage 1.

Stage 1: sparse-matrix cosine similarity on n-grams

pl-fuzzy-frame-match uses polars-simed for the approximate stage. The mechanics:

Step A — n-gram vectorisation. Each string is split into overlapping character n-grams (typically trigrams):

"apple"  →  ["app", "ppl", "ple"]
"appel"  →  ["app", "ppe", "pel"]

Each unique n-gram becomes a column in a giant feature space; each string becomes a vector with 1 for the n-grams it contains and 0 everywhere else. Most strings have a handful of n-grams out of tens of thousands of possible ones — almost everything is zero. That’s why the matrix is sparse, and that’s what makes the next step efficient.

Step B — cosine similarity via sparse matrix multiplication. Cosine similarity between two vectors is:

similarity(A, B) = (A · B) / (||A|| × ||B||)

Where A · B is the dot product. Compute it for every pair on the left vs every pair on the right in one sparse-matrix multiply. Modern sparse linear algebra kernels (BLAS-equivalent) are extremely good at this — they parallelise across CPU cores, skip all the zero-times-zero work entirely, and stay in cache where possible.

Step C — top-k selection. For each left row, keep only the top-k right rows by cosine similarity. Discard everything else. The output is no longer N×M — it’s N×k, where k is typically 100–500.

The result of Stage 1 is a candidate set: for each row on the left, a small list of the most promising matches on the right. Now you can afford to be expensive.

Stage 2: exact fuzzy scoring on the survivors

With a candidate set of ~N×k rows instead of N×M, the exact stage becomes tractable. pl-fuzzy-frame-match supports six similarity algorithms — pick the one that fits your data:

fuzzy_typeWhat it measuresBest for
levenshteinEdit distanceGeneral-purpose default
jaroJaro similarityShort strings
jaro_winklerJaro with prefix bonusNames, where prefixes are stable
damerau_levenshteinLevenshtein + transpositionsTypos with letter swaps
hammingPosition-by-position differencesEqual-length strings only
indelInsertions and deletions onlyWhere substitutions don’t apply

Stage 2 runs the exact algorithm against the candidate set, computes the precise similarity score, and emits the matches that clear your threshold_score. The output looks identical to what brute force would have produced — the exact algorithm did the actual scoring; the ANN stage just stopped you from running it on pairs that obviously didn’t match.

The auto-switch

You don’t configure any of this. The library detects when it’s worth using the hybrid pipeline:

if cartesian_product_size >= 100_000_000:
    use_approximate_matching()
else:
    use_exact_matching()

100 million comparisons is the crossover. Below it, brute-force exact matching is faster — the overhead of building sparse matrices, computing cosine similarities, and selecting top-k isn’t repaid because modern CPUs handle a few million Levenshtein comparisons easily. Above it, that overhead is dwarfed by the comparisons it lets you skip.

In practice this means: small joins are fast for the obvious reason (there’s not much work to do), and large joins are fast for a non-obvious one (the library quietly switched algorithms on you).

The benchmarks

From the library’s README, on commodity hardware:

ComparisonsBrute forceAuto strategySpeedup
150M (15K × 10K)40.82 s1.45 s28×
1.2B (40K × 30K)363.50 s4.75 s76×

The 76× number isn’t a paper claim — it’s measurable on hardware you probably have. And the speedup grows with dataset size: the more comparisons brute force has to grind through, the more dramatic the win from skipping the ones that were never going to match.

Multi-column fuzzy joins

Real entity-resolution problems usually want more than one column to match: same company name AND same city, same person name AND same email domain. pl-fuzzy-frame-match handles this with a list of mappings:

from pl_fuzzy_frame_match.models import FuzzyMapping

fuzzy_maps = [
    FuzzyMapping(
        left_col="company_name",
        right_col="organization",
        threshold_score=85.0,
        fuzzy_type="jaro_winkler",
    ),
    FuzzyMapping(
        left_col="address",
        right_col="location",
        threshold_score=75.0,
        fuzzy_type="levenshtein",
    ),
]

Each subsequent mapping operates on the filtered output of the previous one — the candidate set shrinks at every step. So matching company name first (where Jaro-Winkler is good for organisation names) cheaply rules out most pairs; only the survivors get the more expensive address comparison. The order matters: put the most discriminating column first.

For more complex logic the library exposes FuzzyMapExpr, which supports AND / OR composition (mapping_a & mapping_b, mapping_a | mapping_b) — useful when you want, for example, “name fuzzy-matches AND (email matches OR phone matches)”.

Why this matters in practice

Performance improvements of this magnitude don’t just speed things up — they enable analyses that weren’t practical before:

  • Real-time entity resolution. Match a million-row CRM against a million-row billing system in seconds, not hours. Run it on every batch, not as a quarterly clean-up project.
  • Iterative threshold tuning. When a fuzzy join takes 5 seconds, you’ll iterate 20 times to find the right threshold. When it takes 5 minutes, you’ll guess once.
  • Composable joins. Fuzzy match → group by → pivot → write to catalog as a single Flowfile pipeline. The fuzzy match step stops being the slow node people route around.

Where this fits in Flowfile

pl-fuzzy-frame-match powers two surfaces in Flowfile:

  1. The fuzzy_join method on FlowFrame — see the practical fuzzy match post for the user-facing API.
  2. The Fuzzy Match node in the visual canvas — same engine, no code, with a data preview that shows the similarity score column right after the match runs.

You don’t see the auto-switching, the n-grams, or the sparse matrices. You see your matched rows with their similarity scores. Everything in this post is implementation detail that the library handles for you.

But knowing it’s there matters for two reasons. First, you’ll trust the result more once you understand why it’s both fast and accurate. Second, you’ll set your expectations correctly: this is a tool that scales to billion-comparison joins on a laptop, not just a cute toy that works on small data.

Try it

pip install pl-fuzzy-frame-match
# or, with Flowfile:
pip install flowfile

The library works standalone with any two Polars DataFrames. Inside Flowfile, the same engine is wired into both the fuzzy_join method and the visual Fuzzy Match node — pick whichever surface fits your work.


Related reads: Fuzzy Match in Polars: Joining on Dirty Data for the practical user-facing API, Polars vs Pandas in 2026 for the engine choice that makes vectorised fuzzy matching feasible, and Why Your Data Should Stay on Your Laptop for why “76× faster on commodity hardware” matters more than “infinitely scalable on a cluster”.

Frequently asked questions

Why is brute-force fuzzy matching so slow at scale?
Naive fuzzy matching computes a string similarity score for every pair of records across both datasets — a Cartesian product. 10K × 8K is 80 million comparisons; 40K × 30K is 1.2 billion. Each comparison itself runs an O(n×m) string algorithm (Levenshtein, Jaro-Winkler), and intermediate results pile up in memory. By a few hundred million comparisons, you're looking at minutes-to-hours of CPU and gigabytes of RAM.
What does the hybrid approach actually do?
Two stages. First, an approximate-nearest-neighbour pass uses sparse-matrix cosine similarity on character n-grams to shortlist the top-k most likely matches per row — fast, parallelisable, and memory-efficient. Second, the exact fuzzy algorithm you chose (Levenshtein, Jaro-Winkler, Damerau-Levenshtein, etc.) runs only on those candidates. You go from billions of exact comparisons to millions of approximate ones plus a few hundred thousand exact ones.
When does pl-fuzzy-frame-match switch strategies?
At exactly 100 million comparisons. Below that threshold, brute-force is faster because the overhead of building sparse matrices isn't worth it. Above that, the library automatically switches to the hybrid pipeline. You don't configure anything — it inspects your dataset sizes and picks.
How accurate is the approximate stage compared to brute force?
Very close, in practice. Cosine similarity on character n-grams correlates strongly with edit-distance metrics — strings that are similar by Levenshtein are almost always similar by n-gram cosine, and dissimilar pairs are reliably dissimilar. Setting top-k high enough (the library defaults to a sensible value) means the true best matches are almost always in the candidate set, where the exact stage will then score them properly.
What does 'sparse-matrix cosine similarity' mean in plain English?
Each string becomes a vector where every position represents whether a particular character n-gram (e.g. trigram like 'app', 'ppl', 'ple') is present. Most positions are zero — that's the 'sparse' part. Cosine similarity between two such vectors measures how much their non-zero positions overlap, which is a fast proxy for 'do these strings share a lot of character patterns?'. Implemented as a sparse matrix multiply, this is one of the most-optimised operations on modern CPUs.
Why can't I just use blocking and skip the ANN step?
You can — and should, if a cheap exact key (country, email domain, first letter) genuinely partitions your data. Blocking and the hybrid approach are complementary, not alternatives. Blocking eliminates whole regions of the comparison space; the hybrid approach efficiently searches the regions that remain. For datasets where blocking isn't possible (single global namespace, no useful partition key), the hybrid approach is the only way down from O(N×M).
Does this work with multi-column fuzzy joins?
Yes. fuzzy_match_dfs takes a list of FuzzyMapping objects. Each subsequent mapping operates on the filtered candidates from the previous one, progressively refining matches. There's also a FuzzyMapExpr type that supports AND / OR composition (`mapping_a & mapping_b`, `mapping_a | mapping_b`) for more complex matching logic.