All articles

Kafka for Analysts: A Practical Guide to Streaming as Micro-Batches

Most analytics 'streaming' is really a sequence of micro-batches. How to think about cleaning, combining, and enriching Kafka data without a streaming engine.

TL;DR. For analysts and analytics engineers, “streaming data” almost always means data that’s continuously arriving, not a stream of events you must react to in milliseconds. The right mental model for that shape of work is the micro-batch — a bounded chunk of recent records you treat exactly like you’d treat any other dataframe. This post is the practical guide: what Kafka actually is, why you probably don’t need a stream processor, and how to think about cleaning, combining, and enriching streaming data with the batch tools you already know.

What “streaming data” actually means for an analyst

The word “streaming” is overloaded. When a backend engineer says it, they often mean per-event latency budgets measured in milliseconds, state held in memory between events, windowed joins, watermarks, exactly-once semantics. When an analyst says it, they usually mean something much more tractable: “the data keeps showing up, and I want the dashboards to reflect it without a 24-hour delay.”

That second thing is not a streaming problem. It’s a frequency problem. The same transformations you’d write for a nightly batch — deduplicate, filter, join to a dimension table, roll up by day — are the transformations you want. You just want to run them on smaller chunks, more often.

Once you accept that, a lot of the scariness drops out of “streaming analytics”. You don’t need a cluster of Flink jobs with rocksdb state backends and checkpoint tuning. You need:

  • A source of new records (usually Kafka, sometimes Kinesis or Pub/Sub).
  • A way to pull a bounded chunk of them on a schedule.
  • A dataframe engine to transform the chunk.
  • A table format that tolerates frequent small writes.
  • A place to remember where you stopped, so you don’t re-read the same data next time.

That’s the whole architecture. Everything else is decoration.

Kafka, in one honest paragraph

Kafka is a distributed, durable, partitioned log. Producers append messages to topics; each topic is split into partitions, and each partition is an append-only ordered sequence of records. Consumers read from a position in the partition called an offset. Messages don’t vanish when a consumer reads them — they stay on the log for as long as retention allows. Multiple consumer groups can read the same topic independently, each tracking its own offsets. This is why Kafka gets used for both “event bus for microservices” and “fat pipe of data into the warehouse” workloads: the underlying shape is the same — a durable log with independent readers — and the use cases just differ in what they do with the records.

If you’ve been thinking of Kafka as “RabbitMQ, but bigger,” that mental model will lead you astray the first time you need to replay a topic, add a second consumer, or debug why a consumer group is lagging. Kafka is a log. Consumer groups are bookmarks into the log. That one sentence is worth more than any architecture diagram.

For an analyst, the practical consequence is that Kafka gives you cheap, non-destructive re-reads. If your pipeline has a bug, you fix the bug, reset the consumer group, and reprocess. The data didn’t go anywhere. That’s a very different experience from “a job failed halfway through and now you have to reason about which rows got written.”

The shape: true streaming ↔ micro-batches ↔ batch

It helps to have the whole spectrum in front of you:

LatencyComplexityPer-run costTypical use cases
True streamingms–sechighlow incremental, high fixedfraud detection, real-time bidding, per-event anomaly scoring
Micro-batchessec–minlowlowcontinuous table loads, near-real-time dashboards, rolling aggregates
Batchhourslowlownightly reporting, backfills, historical rollups

True streaming earns its operational weight when the business decision has to happen per event. If the pipeline noticing an anomaly is what triggers a customer-facing action within a second, you need a real stream processor. Everything else tends to live in the middle column.

Micro-batching is the honest version of what most teams already call “streaming analytics”. You pick a cadence — every minute, every five minutes, every hour — and each run pulls whatever has arrived since the last run, transforms it, writes it, and records where it stopped. It’s a normal batch job that happens to run frequently on small inputs.

Why micro-batching is the right default for analytics

Concretely, the things you get when you commit to the micro-batch shape:

  • A pipeline that looks like every other pipeline you already run. Same transformations, same testing, same local-repro story. You can run the same code over a historical file to reproduce a production result.
  • Bounded memory and bounded work per run. Cap the number of messages per run and you cap the blast radius of a bad batch. Out-of-memory errors stop being a class of problem.
  • Trivial retry semantics. If a run fails, don’t commit offsets. The next run re-consumes the same window and tries again. At-least-once is a single config decision, not an architectural undertaking.
  • Straightforward observability. “This run pulled 47 200 messages, produced 46 981 rows, wrote 12 MB to the staging table, committed offset 884 712 on partition 0.” That’s a log line. Not a dashboard that shows you state-store sizes and checkpoint durations.
  • Cost that scales with throughput, not up-time. A scheduled job only runs when it needs to. A long-lived streaming worker runs 24/7, whether or not your topic is active.

The cost is latency. A micro-batch pipeline refreshed every 5 minutes will have up-to-5-minute staleness. For analytics — dashboards, reports, period-over-period KPIs, merges into curated tables — that’s almost always fine. For a personalised recommendation served in the middle of a page load, it’s not. Match the tool to the latency contract the downstream consumer actually signed.

Cleaning a micro-batch

Once you’ve pulled a chunk of records out of Kafka, the first job is usually cleaning. Streaming sources tend to be messier than warehouse tables: producers evolve faster than consumers, and nobody reviews the JSON schema in a PR.

The cleaning tasks are the same ones you do in any batch pipeline, just applied to a smaller chunk more often:

Deduplicate within the batch. The same event often arrives more than once — producers retry, upstream services double-publish on config reloads, a service restarts mid-flush. Within a micro-batch, you pick the business key (e.g., event_id, order_id) and keep one record per key. Usually you keep the latest by broker timestamp:

# Polars — keep the latest record per order_id in this batch
cleaned = (
    batch
    .sort("_kafka_timestamp", descending=True)
    .unique(subset=["order_id"], keep="first")
)

Handle missing and mistyped fields. JSON is forgiving, which means producers get creative. An amount field can arrive as "42.00" sometimes and 42.0 other times. A status might disappear entirely when the producer forgets to include it. The batch is the right place to standardise:

normalised = (
    cleaned
    .with_columns([
        pl.col("amount").cast(pl.Float64, strict=False),
        pl.col("status").fill_null("unknown"),
        pl.col("currency").str.to_uppercase(),
    ])
)

Discard records that can’t be salvaged. Sometimes the right answer is to route bad records to a quarantine table rather than drop them silently. “This is a dead-letter queue for the analytics pipeline” is a boring, well-understood pattern that pays off the first time upstream ships a regression.

Reconcile schema drift. If a new field appears, decide what to do — carry it through, ignore it, raise an alert — in the batch transformation, not by restarting a long-lived streaming job. Schema evolution is a code review, not an incident.

None of this requires streaming-engine machinery. Every one of these operations is a normal dataframe expression run on a bounded chunk.

Combining: joining streams to the rest of your data

The stream in isolation is usually not the thing you want to analyse. You want it joined to the customers table, the products table, the regions dimension, the currency-conversion reference data. This is where micro-batching really shines: each batch is small enough to join against anything.

Stream → dimension joins. This is the 80 % case. The batch is tiny (thousands to hundreds of thousands of rows); the dimension is medium (tens of thousands to millions). Join is cheap, fits in memory, and you get enriched records on the other side.

enriched = (
    normalised
    .join(customers, on="customer_id", how="left")
    .join(products,  on="product_id",  how="left")
)

Union batches over time. Each run appends to a staging table. The staging table is the stream, just sitting at rest. If you never want to query history, keep the staging table short (e.g., last 7 days). If you want history, let it grow — disk is cheap, and a Delta or Iceberg table tolerates billions of rows fine.

Merge into a curated table. The more interesting pattern is merging recent records into a keyed, curated table:

-- Simplified Delta/Iceberg MERGE
MERGE INTO orders_curated AS target
USING batch_staged AS source
  ON target.order_id = source.order_id
WHEN MATCHED AND source._kafka_timestamp > target._kafka_timestamp
  THEN UPDATE SET *
WHEN NOT MATCHED
  THEN INSERT *

This is how you get “always the latest state per key” semantics without a streaming engine. Micro-batch in, merge out, curated table always correct.

What micro-batching is not great at. Stream-to-stream joins with late-arrival tolerance — “join every click with its corresponding impression, where the impression may arrive up to 30 minutes late” — are genuinely hard without windowing and state. If that’s the shape of the problem, use a stream processor. For analytics, this situation is rarer than you’d think, and the usual answer is “land both streams to tables and join the tables on a schedule.”

Enriching, transforming, aggregating

After cleaning and combining, the rest of the pipeline is just analytics. Bucket the event by time window:

hourly = (
    enriched
    .with_columns(
        pl.col("_kafka_timestamp").dt.truncate("1h").alias("hour")
    )
    .group_by(["hour", "region", "product_category"])
    .agg(
        pl.col("amount").sum().alias("gross_revenue"),
        pl.col("order_id").n_unique().alias("orders"),
        pl.col("customer_id").n_unique().alias("customers"),
    )
)

Classify, score, derive, pivot, whatever the analysis calls for. The point is that by the time you’re doing real analytical work, the streaming-ness of the source is irrelevant. You’re just transforming a dataframe.

If your aggregation needs to reflect the whole history rather than just the current batch — “rolling 30-day unique customers” — you have two options:

  1. Recompute the aggregate from the curated table on each run. Cheap if the table is well-partitioned and the engine (Polars, DuckDB, Spark) can push filters down.
  2. Maintain a separate aggregate table, and upsert into it from the batch. Trades complexity for speed.

Option 1 is almost always the right starting point. Reach for option 2 only when the recompute starts to feel slow.

Where the micro-batch lands

Landing zones for analytics-shaped streaming data are a solved problem:

  • Delta Lake / Apache Iceberg. The right default. Both support MERGE, schema evolution, time travel, and efficient small-write compaction. They turn “lots of small writes from a streaming source” from a problem into a feature.
  • Parquet on object storage. Lighter weight than a full table format, and fine for pure-append cases where you never need to update a row. Good for landing raw micro-batches before they get merged into a curated table.
  • A warehouse (Snowflake, BigQuery, Redshift). If you already have one, streaming inserts or COPY INTO from staging files work. The warehouse pays the small-file cost on your behalf.
  • A database (Postgres, DuckDB). For small-scale or embedded analytics, honestly fine. Don’t over-engineer.

The table format you pick matters less than the operational discipline: land raw data separately from curated data, compact small files on a schedule, treat schema changes as code changes.

Cadence: frequency is latency

The single most important config decision in a micro-batch pipeline is how often it runs. Everything else follows from it.

A useful way to think about it:

  • The pipeline’s latency is at most one batch interval plus the batch’s processing time. Run every 5 minutes, and your data is at most ~6 minutes stale.
  • More frequent ≠ better. Running every 30 seconds means 120× as many runs per hour, 120× as many offset commits, 120× as many log lines, and 120× as many chances for an intermittent broker hiccup to create noise. If the dashboard only needs 10-minute freshness, run every 10 minutes.
  • Bound the per-run work. Cap the max messages per run so that a backlog can’t blow up memory. If a backlog forms, the next run eats the cap, the one after eats the next cap, and you catch up incrementally.
  • Retry-on-failure is the whole fault tolerance story. A run failed? The scheduler retries. The consumer group position hasn’t advanced. No data is lost.

Pick the cadence from the consumer’s actual latency requirement, not from “how real-time does it feel?”

Why this reuses all your batch tooling

The pattern above — pull a bounded chunk, clean, join, aggregate, land, commit, wait, repeat — deliberately doesn’t need new tools. Every piece has a mature, well-understood batch equivalent:

  • Dataframe engine: Polars, DuckDB, pandas, or Spark. Whatever you already use for batch analytics.
  • Table format: Delta or Iceberg, both native to Spark / DuckDB / Polars.
  • Scheduler: Airflow, Dagster, Prefect, a cron, a GitHub Actions workflow — whatever runs your current batch jobs.
  • Observability: Your existing batch-job monitoring works. Success/failure, row counts, runtime.
  • Testing: The exact same tests you write for batch jobs. Run the transformation over a fixture file, assert on the result.

The operational lesson is that a micro-batch pipeline is not a different thing from a batch pipeline. It’s a batch pipeline with two additional rules: bound the input, commit offsets at the end. Everything else is stuff you already know.

Where Flowfile fits

Flowfile is a visual and code-generating ETL tool built on Polars, and its Kafka source is deliberately shaped around everything in this post. You point it at a topic, set a max-messages cap, and it hands you a Polars LazyFrame — the same object you’d use in any other Flowfile flow. Clean it, join it to your dimension nodes, land it into a Delta table, and commit offsets on success. The pipeline is the micro-batch.

If you want the technical details — the poll-batch size, the spill-to-Arrow mechanics, the consumer-group offset tracking, the Kafka-to-Catalog sync endpoint, the security model — those live in the companion post: Flowfile’s Kafka Source: How Micro-Batching Actually Works.

Further reading

Frequently asked questions

Is Kafka a message queue?
Not really. Kafka is a distributed, durable, partitioned log. Consumers read from positions (offsets) in that log at their own pace, and messages stay on the log after they're read for as long as the retention policy allows. Queues delete a message when a consumer takes it; Kafka doesn't. That difference is exactly what makes Kafka useful for analytics — you can replay, you can have multiple independent consumer groups, and a slow consumer doesn't block a fast one.
What's the difference between true streaming and micro-batching?
True stream processing (Flink, Kafka Streams) handles one record at a time with windowing, watermarks, and managed state. Micro-batching consumes a chunk of records every N seconds or N messages, processes it as a normal dataframe, and commits offsets at the end. For most analytics — Delta/Iceberg landings, dashboards, daily or hourly aggregates — micro-batching is simpler, cheaper, and good enough.
Do I need Flink or Kafka Streams for analytics?
Almost never. True stream processors earn their operational weight when you need sub-second per-event decisions, stateful stream-stream joins, or windowed aggregates with strict watermark semantics. Analytics rarely needs any of those. For 'land Kafka into a table, aggregate, refresh a dashboard every few minutes,' a scheduled micro-batch pipeline is the right shape.
How do I deduplicate streaming data?
Two layers. Within a micro-batch, use a dataframe dedup on the business key, keeping the latest record by broker timestamp. Across batches, rely on a table format that supports merges — Delta Lake and Apache Iceberg both do this. The natural pattern is 'stream into a staging zone, merge into the curated table on the key.'
How often should I run a micro-batch pipeline?
As often as the consumer of the downstream data actually needs. A dashboard refreshed hourly doesn't benefit from a pipeline that runs every minute. Start with the cadence the business needs and shorten only if there's a real requirement. 'More frequent' costs compute, complexity, and failure-handling surface area.
What tools do I actually need to process streaming data as an analyst?
A Kafka consumer (any language binding of the official client is fine), a dataframe library (Polars, pandas, DuckDB, Spark), a table format that supports merges (Delta or Iceberg), and a scheduler. That's it. If you find yourself reaching for a streaming engine, a state store, and a watermark configuration, double-check whether the problem actually needs them.