Apache Kafka in Production: What the Tutorials Leave Out

The first time we deployed Kafka, it took 30 minutes to get a producer and consumer working. It took 3 months to get the operational story right: partition strategies, consumer lag monitoring, schema evolution, exactly-once semantics in practice, and the realization that Kafka is infrastructure that requires the same care as a database.

We run Kafka for two client systems -- a logistics platform processing 2M+ delivery events per day and a healthcare system streaming patient data updates across 6 downstream services. The lessons below come from running these systems for 14+ months.

The problem Kafka solves

Kafka is not a message queue. It's a distributed commit log. The distinction matters:

A message queue (RabbitMQ, SQS) delivers a message to one consumer and discards it. Good for task distribution.
A commit log (Kafka) appends events to a durable, ordered stream. Multiple consumers read the same stream independently, at their own pace. Good for event sourcing, audit logs, and decoupling services.

If you need one service to process a task, use a queue. If you need multiple services to react to the same event independently -- billing, notifications, analytics, audit -- use Kafka.

Our architecture

Delivery Service ---produces---> [order.created] ---consumes---> Billing Service
                                                 ---consumes---> Notification Service
                                                 ---consumes---> Analytics Pipeline
                                                 ---consumes---> Audit Log Writer

Patient Portal ---produces---> [patient.updated] ---consumes---> Care Plan Service
                                                  ---consumes---> Scheduling Service
                                                  ---consumes---> Compliance Logger

Each arrow is an independent consumer group. If the analytics pipeline goes down for an hour, it resumes from its last offset when it comes back. No events are lost. Other consumers are unaffected.

What the tutorials teach

from confluent_kafka import Producer, Consumer

# Producer
producer = Producer({"bootstrap.servers": "kafka:9092"})
producer.produce("order.created", key=order_id.encode(), value=json.dumps(event).encode())
producer.flush()

# Consumer
consumer = Consumer({
    "bootstrap.servers": "kafka:9092",
    "group.id": "billing-service",
    "auto.offset.reset": "earliest",
})
consumer.subscribe(["order.created"])
while True:
    msg = consumer.poll(1.0)
    if msg and not msg.error():
        process(json.loads(msg.value()))

This works for a tutorial. Production requires everything below.

What production teaches

1. Partition count is a one-way door

You can increase partitions. You cannot decrease them without recreating the topic. And partition count determines your maximum consumer parallelism.

We start with 12 partitions for high-throughput topics and 3 for low-throughput ones. The logistics platform's delivery.status topic started with 6 partitions. At 500K events/day, we needed 12. The repartitioning required rebalancing all consumer groups and caused a 20-minute processing delay.

2. Schema registry is not optional

Producers and consumers evolve independently. Without schema enforcement, a producer can change the event format and break every consumer silently.

We use Confluent Schema Registry with Avro schemas. Every topic has a registered schema. Producers validate against it before publishing. Consumers deserialize with it. Schema evolution rules (backward, forward, full) prevent breaking changes.

{
  "type": "record",
  "name": "OrderCreated",
  "namespace": "com.client.events",
  "fields": [
    {"name": "order_id", "type": "string"},
    {"name": "total", "type": "double"},
    {"name": "currency", "type": "string", "default": "USD"},
    {"name": "created_at", "type": "long", "logicalType": "timestamp-millis"}
  ]
}

The currency field was added 6 months after launch. Because it has a default value, existing consumers continued working without modification. This is backward-compatible schema evolution, and it only works if you enforce it from day one.

3. Idempotent consumers are mandatory

Kafka guarantees at-least-once delivery. Network retries, consumer rebalances, and offset commit timing can all result in duplicate delivery. Every consumer must handle duplicates gracefully.

Our pattern: every event has a UUID. Consumers maintain a processed-events set in Redis with a 24-hour TTL. Before processing, check the set. After processing, add to the set. Simple, effective, and catches 99.9% of duplicates.

4. Monitor consumer lag religiously

Consumer lag -- the gap between the latest produced offset and the consumer's committed offset -- is the most important Kafka metric. If lag grows, your consumer is falling behind. If it grows unboundedly, you'll eventually run out of retention and lose events.

We alert on:

Lag > 10,000 events for 5 minutes (warning)
Lag > 100,000 events for 5 minutes (critical)
Lag growth rate > 1,000 events/minute sustained (critical -- consumer is slower than producer)

Prometheus's kafka_consumergroup_lag metric (via kafka-exporter) feeds these alerts.

5. Compacted topics for entity state

Not every topic is an event stream. For entity state (the current state of a patient record, the current status of a delivery), use log compaction. Kafka retains only the latest value per key, giving you a materialized view of entity state that consumers can bootstrap from.

Our healthcare system uses compacted topics for patient records. When a new service joins the system, it reads the compacted topic from the beginning and builds its local state in minutes rather than replaying months of events.

The tradeoffs

Operational weight. Kafka is a distributed system. ZooKeeper (or KRaft in newer versions), brokers, partition replication, ISR management -- each is a failure domain. We allocate 4-6 hours/month for Kafka maintenance across our deployments.
Not suitable for request-reply. Kafka is asynchronous by nature. If service A needs a synchronous response from service B, use gRPC or HTTP. Kafka handles event propagation, not RPC.
Cost at small scale. A 3-broker Kafka cluster is overkill for processing 100 events/day. For low-throughput systems, Redis Streams or even PostgreSQL's LISTEN/NOTIFY are simpler alternatives.
Consumer group rebalancing. When consumers join or leave, Kafka rebalances partitions. During rebalancing, processing pauses. For latency-sensitive systems, this pause (5-30 seconds) matters.

Our recommendation

Use Kafka when you have multiple independent consumers that need to react to the same events, when you need event replay capability, or when your system produces more than 10,000 events/day and needs durable, ordered delivery.

For simple task queues, use Redis or RabbitMQ. For low-throughput event passing between 2-3 services, use PostgreSQL LISTEN/NOTIFY or Redis Streams. Kafka earns its operational cost at scale and when the event-driven architecture pattern -- multiple consumers, replay, audit -- is genuinely needed.

Budget for operating it. This is infrastructure, not a library.