Your producer from Chapter 6 uses acks=all to wait for all in-sync replicas. But what does "in-sync" actually mean? How many replicas must acknowledge before Kafka considers a write durable? And when you're pushing thousands of events per second, how do you balance durability with the throughput your Task API needs?
These questions reveal a deeper configuration layer beneath the producer settings you've already learned. Kafka's reliability isn't just about producer acknowledgments---it's about how the cluster maintains replicas, when it considers them synchronized, and what happens when brokers fall behind or fail. Understanding these mechanisms lets you configure clusters that match your exact durability and performance requirements.
In this chapter, you'll configure min.insync.replicas for production durability, understand how ISR tracking affects your producer's success, and tune batching parameters to optimize the latency-throughput trade-off. By the end, you'll know exactly how to configure Kafka for scenarios ranging from "never lose a message" to "maximize throughput with acceptable latency."
Every Kafka partition has a leader and zero or more follower replicas. The leader handles all reads and writes. Followers replicate data from the leader to provide redundancy.
A replica is considered in-sync when it meets both of these conditions:
The broker setting replica.lag.time.max.ms (default: 30 seconds) controls the lag threshold. If a follower doesn't fetch from the leader within this window, it's removed from the ISR.
When your producer sends a message with acks=all, it waits for acknowledgment from all replicas in the ISR, not all replicas in the cluster. This distinction is critical:
When ISR shrinks to just the leader, acks=all provides no more durability than acks=1. Your data survives only on one broker.
The min.insync.replicas setting prevents this degradation by requiring a minimum ISR size for writes to succeed.
To tolerate N broker failures, you need:
You can set min.insync.replicas at three levels:
1. Broker-wide default (kafka-cluster.yaml with Strimzi):
2. Topic-specific override (KafkaTopic CRD):
3. Dynamic topic configuration (kubectl or Kafka CLI):
Output:
When ISR shrinks below min.insync.replicas, producers with acks=all receive an error:
This error means Kafka is protecting your data by refusing to accept writes that can't meet your durability requirements.
Your configuration:
Scenario:
When you see NOT_ENOUGH_REPLICAS errors, investigate the ISR state:
Output (healthy):
Output (degraded---Broker 2 is slow):
Notice Broker 2 is missing from all ISR lists. Common causes:
What happens when the leader fails and no in-sync replicas exist?
By default, Kafka waits for an ISR member to come back online. This maintains data consistency but means the partition is unavailable for writes.
The setting unclean.leader.election.enable changes this behavior:
When is unclean election acceptable?
When is unclean election dangerous?
Strimzi configuration:
Here's a configuration comparison table showing how settings differ between environments:
Strimzi development configuration:
Strimzi production configuration:
Beyond durability, you'll often need to optimize for latency or throughput. Kafka's producer batching settings control this trade-off.
Your producer doesn't send each message immediately. It batches messages to reduce network overhead:
Two settings control when a batch is sent:
A batch is sent when either condition is met: linger.ms expires OR batch.size is reached.
For real-time event streaming where latency matters more than throughput:
Trade-off: More network requests, higher broker CPU usage, but sub-millisecond message latency.
For bulk event processing where throughput matters more than latency:
Trade-off: Messages wait up to 50ms before sending, but far fewer network requests and better compression efficiency.
You can observe batching behavior through producer metrics:
Output (high-throughput scenario):
Larger average batch sizes mean better network efficiency.
You've learned the individual settings, but how do they combine for your specific use case? Let's work through a realistic scenario.
The situation:
Your Task API handles two types of events:
Initial approach:
For the lifecycle events, you might start with:
Questioning the configuration:
Consider these factors:
Evaluating alternatives:
If view events are truly non-critical, they could use a lower-durability configuration:
What emerged from this analysis:
You built a kafka-events skill in Chapter 1. Test and improve it based on what you learned.
Ask yourself:
If you found gaps:
Apply what you've learned to configure reliability for your specific scenarios.
Setup: Open Claude Code or your preferred AI assistant with your Kafka project context.
Prompt 1: Design Your Durability Strategy
What you're learning: Mapping business requirements to Kafka durability configurations. Different event types warrant different trade-offs between durability and performance.
Prompt 2: Diagnose an ISR Problem
What you're learning: Diagnosing production ISR issues and understanding the relationship between ISR size and write availability. This error pattern is common during broker maintenance or resource contention.
Prompt 3: Optimize Latency-Throughput Trade-off
What you're learning: Tuning producer batching for specific workload patterns. The same producer settings that work for real-time events can waste resources for batch processing, and vice versa.
Safety Note: Configuration changes to min.insync.replicas and unclean.leader.election.enable affect data durability. Test changes in a development cluster first, and understand that increasing min.insync.replicas can cause write failures if your cluster doesn't have enough healthy brokers.