You deployed a notification service consuming from task-created. It works perfectly with one consumer. Then traffic spikes during a product launch, and you scale to three consumers. Suddenly, messages are processed twice. A rebalance occurred, your consumer hadn't committed its offset, and another consumer reprocessed the same messages.
This is one of the most common Kafka production issues. Understanding consumer groups and rebalancing isn't optional—it's essential for reliable event processing. In this chapter, you'll learn why rebalances happen, how to handle them safely, and how to diagnose the consumer lag that often triggers scaling decisions.
The pattern you'll learn here—committing offsets during rebalance callbacks—prevents duplicate processing in virtually every Kafka consumer you'll ever write. Master this, and you've solved one of distributed messaging's trickiest problems.
When multiple consumers share a group.id, Kafka treats them as a team working together. Instead of each consumer receiving every message (like a pub/sub topic), Kafka assigns partitions to individual consumers so each message is processed exactly once within the group.
Key rules for partition assignment:
This design ensures ordering within a partition. Messages with the same key always go to the same partition, so they're always processed by the same consumer in order.
A rebalance redistributes partitions when the group membership changes. The group coordinator (a Kafka broker) detects changes and orchestrates the redistribution.
Triggers that cause rebalancing:
Here's why rebalancing causes duplicate processing:
The window between processing and committing is the danger zone. Without proper handling, any messages processed but not committed will be reprocessed.
The on_revoke callback is your opportunity to commit offsets before losing partitions. The on_assign callback lets you set up state for newly assigned partitions.
Output (during normal operation):
Output (during rebalance when new consumer joins):
Why synchronous commit in on_revoke? Asynchronous commits may not complete before the partition is reassigned. The new consumer would start from an old offset, causing duplicates. The brief blocking during synchronous commit is worth the guarantee.
Kafka supports two rebalancing protocols:
Eager Rebalancing (Legacy):
Cooperative (Incremental) Rebalancing (Modern):
Configure cooperative rebalancing in your consumer:
Output:
The cooperative-sticky strategy minimizes partition movement. If Consumer A had partitions 0 and 1, and Consumer B joins, the coordinator might only move partition 1 to B—Consumer A keeps processing partition 0 uninterrupted.
Every time a consumer restarts (even briefly), it triggers a rebalance. In Kubernetes, rolling updates cause repeated rebalances as pods restart. Static membership solves this by giving each consumer a persistent identity.
How static membership works:
When to use static membership:
Consumer lag is the difference between the latest offset in a partition and the offset your consumer has processed. It's the most important metric for consumer health.
What lag tells you:
Checking lag with Kafka CLI:
Output:
This output shows:
When you see growing lag, systematically diagnose:
1. Processing too slow?
2. Polling too infrequently?
3. Not enough consumers?
If you have 3 partitions but only 1 consumer, one consumer handles all load. Scale to 3 consumers for parallel processing.
4. Rebalancing too frequently?
Frequent rebalances interrupt processing. Check for:
Here's the complete pattern combining rebalance callbacks, cooperative rebalancing, static membership, and lag awareness:
Output:
You built a kafka-events skill in Chapter 1. Test and improve it based on what you learned.
Ask yourself:
If you found gaps:
Setup: You have a Kafka consumer that's experiencing duplicate message processing after scaling events.
Prompt 1: Analyze a rebalance scenario
What you're learning: AI will trace the timeline showing how auto-commit with a 5-second interval creates a window where messages are processed but not committed when revocation occurs.
Prompt 2: Debug consumer lag
What you're learning: AI helps you think through asymmetric lag patterns—could be slow processing for certain message types, a hot partition with too much traffic, or a consumer that crashed and recovered.
Prompt 3: Design for your domain
What you're learning: AI collaborates on applying consumer group patterns to your specific constraints, suggesting configurations that balance throughput with your no-duplicate requirement.
Safety note: When testing rebalance scenarios, use a separate consumer group from production. Joining a production group with test consumers triggers rebalances that affect real traffic.