Your Chapter 6 producer successfully sent messages to Kafka. But successful delivery to the broker's memory is different from durable delivery to disk with replication. When your Task API creates a task and publishes a task.created event, what happens if a broker crashes before replicating the message? What if the network drops during transmission and your producer retries?
Production event-driven systems handle millions of events daily. A 0.1% message loss rate means losing 1,000 events per million. For critical workflows---payment processing, inventory updates, audit logs---that's unacceptable. This lesson covers the reliability configurations that separate prototype producers from production-grade ones.
We'll examine three critical dimensions: how many brokers must acknowledge your message (acks), how to prevent duplicate messages during retries (idempotent producer), and how to handle delivery failures gracefully. By the end, you'll configure producers that match your data's criticality.
When your producer sends a message, Kafka can acknowledge it at three different points in the durability spectrum. Each level trades latency for safety.
acks=0 (Fire and Forget)
The producer sends the message and immediately considers it delivered. It doesn't wait for any acknowledgment:
Problem: The message might have been lost in transmission, the broker might have crashed receiving it, or the write to the partition log might have failed. You'll never know.
When it's acceptable: High-volume metrics collection where individual data points don't matter. If you're sending 10,000 sensor readings per second, losing a few readings is tolerable.
acks=1 (Leader Acknowledgment)
The producer waits for the partition leader to write the message to its local log:
Problem: If the leader crashes before replicating to followers, the message is lost. A new leader is elected from followers, but they don't have your message.
When it's acceptable: Analytics events, user activity tracking, recommendation system inputs---events where occasional loss doesn't break business logic.
acks=all (Full ISR Acknowledgment)
The producer waits for all in-sync replicas (ISR) to acknowledge the write:
Why it matters: Even if the leader crashes immediately after acknowledgment, followers have the message. The new leader (promoted from followers) won't lose data.
When to use: Payment events, order creation, inventory adjustments, audit logs---any event where loss causes business problems.
Output:
When you produce with acks=all, the delivery callback reports the full acknowledgment:
The offset (@ 42) confirms the message was durably written and replicated.
Here's what the acks settings mean in practice:
The latency difference between acks=1 and acks=all is typically 5-20ms---the time to replicate to followers. For most applications, this is negligible. The durability difference, however, is significant.
Decision Framework:
Ask yourself: "If this message is lost, what breaks?"
Even with acks=all, network failures can cause duplicates. Here's the scenario:
Kafka's idempotent producer prevents this by assigning each message a sequence number. The broker detects and deduplicates retries:
Output:
Even if the producer retried internally due to network issues, the partition contains exactly one copy of the message.
Under the hood, Kafka maintains a <ProducerID, SequenceNumber> for each producer:
Idempotence has constraints:
Configuration that violates requirements:
Output:
Even with acks=all and idempotence, messages can fail to deliver. Your producer must handle these failures gracefully.
Every produce() call should include a delivery callback:
Output (success):
Output (network failure):
The producer's retry behavior is controlled by several settings:
How timeouts interact:
For messages that fail permanently, implement a dead letter queue (DLQ):
Output (authorization failure):
The DLQ message contains all information needed to investigate and replay the failed event.
Here's the complete production-ready producer configuration:
Output:
You've configured a reliable producer, but is it right for your specific use case? Let's explore how to refine the configuration.
Your initial request:
"I need to configure Kafka for my Task API. Tasks must never be lost."
Exploring requirements:
Before settling on configuration, consider these questions:
Evaluating the initial configuration:
The configuration we built uses acks=all with idempotent producer. This provides:
Questioning the approach:
For a Task API handling hundreds of events per second, consider:
Refining based on production context:
If your notification consumer can handle duplicates (checking task_id before sending), you might simplify:
If you need atomic writes across multiple topics (task-created, audit-log), you need transactions (covered in Chapter 12).
What emerged from this exploration:
You built a kafka-events skill in Chapter 1. Test and improve it based on what you learned.
Ask yourself:
If you found gaps:
Apply what you've learned by configuring producers for different scenarios.
Setup: Open Claude Code or your preferred AI assistant in your Kafka project directory.
Prompt 1: Analyze Your Use Case
What you're learning: Matching reliability configuration to business requirements. Different event types within the same application may need different producers.
Prompt 2: Debug a Delivery Failure
What you're learning: Understanding the relationship between producer acks, topic replication, and ISR. This error is common in production when brokers are unhealthy.
Prompt 3: Design Error Handling Strategy
What you're learning: Production error handling requires more than printing errors. You need observability, retry logic, and fallback mechanisms.
Safety Note: Always test producer configurations in a development environment before production. Incorrect timeout settings can cause message backlogs or premature failures during normal broker maintenance.