It's 3 AM. Your pager goes off. The order processing system stopped sending confirmation emails two hours ago. You check the notification service logs—no errors. The service is running, polling Kafka, and processing messages. So where are the orders?
You discover the consumer group has 47,000 messages of lag on one partition. Those orders are sitting in Kafka, unprocessed. Your consumer has been processing, but slower than the incoming rate. For two hours, the gap widened silently until a customer complained.
This scenario illustrates why Kafka monitoring isn't optional—it's your early warning system. In this chapter, you'll learn to monitor consumer lag, inspect topics and consumer groups with CLI tools, diagnose common failures, and configure alerts that catch problems before customers do.
Consumer lag is the difference between where producers are writing (the log-end offset) and where your consumer has processed (the current offset). It tells you whether your consumer is keeping up with the production rate.
Why lag matters more than throughput:
A consumer processing 1,000 msg/sec sounds fast—until you realize producers are writing 1,200 msg/sec. Your lag grows by 200 messages every second. Within an hour, you're 720,000 messages behind.
The primary tool for checking consumer lag is kafka-consumer-groups.sh. On a Strimzi cluster, you can execute it inside a Kafka pod:
Output:
Reading this output:
From this output, you can see:
Different lag patterns indicate different problems:
Setting appropriate thresholds depends on your tolerance for processing delay:
Rule of thumb: Alert when lag exceeds what you can process in 1/3 of your retention period. If retention is 7 days and you process 10,000 msg/hour, alert around 50,000 lag.
When troubleshooting, you often need to understand the topic structure—how many partitions, replication factor, and configuration:
Output:
Output:
Key information:
Under-replicated partitions are partitions where one or more replicas have fallen behind the leader. This indicates broker health issues:
Healthy output (no problems):
Unhealthy output:
This shows:
Diagnosing under-replication:
When debugging, you often need to see what's actually in a topic:
Output:
Processed a total of 5 messages
Useful options:
Output with keys and timestamps:
The Kafka ecosystem has specific error patterns. Understanding them speeds up debugging:
When a consumer is falling behind, use this systematic approach:
Step 1: Confirm the lag
Step 2: Check if lag is growing
Step 3: Check partition distribution
Step 4: Check consumer performance
Step 5: Scale if needed
Kafka exposes detailed metrics via JMX (Java Management Extensions). In production, you'll export these to Prometheus or another monitoring system.
Key broker metrics:
Key consumer metrics:
Strimzi provides built-in support for Prometheus metrics. Enable them in your Kafka resource:
Then create the metrics ConfigMap:
When alerts fire, you need clear steps. Here's a template runbook:
Alert: Consumer Lag Critical (> 10,000)
Alert: Under-Replicated Partitions > 0
You built a kafka-events skill in Chapter 1. Test and improve it based on what you learned.
Ask yourself:
If you found gaps:
Setup: You're on-call and receive an alert about your Kafka cluster.
Prompt 1: Interpret monitoring output
What you're learning: AI helps identify asymmetric lag patterns—partition 2 is significantly behind, suggesting either a hot partition, slow message processing, or an issue with the consumer assigned to it.
Prompt 2: Build a troubleshooting checklist
What you're learning: AI walks through ISR mechanics and helps you understand why this error occurs (at least one broker is not in sync), plus diagnostic steps for each scenario.
Prompt 3: Design alerting for your system
What you're learning: AI collaborates on translating business SLAs into technical alert thresholds, showing how to differentiate alert severity based on topic priority and processing requirements.
Safety note: When running diagnostic commands on production Kafka clusters, use read-only commands (--describe, --list) rather than commands that modify state. Never reset consumer offsets or delete topics without understanding the implications.