Event-Driven Backbones with Kafka
When to reach for Kafka, what the outbox pattern really buys you, partitioning for ordering, and the things that break in year two.
Kafka isn't a queue. It's a durable, replayable log. If you internalize that one idea, most of the common mistakes go away.
When to reach for it
You want an event bus when services should be decoupled in time (producers never block on consumers), when you need to rebuild read models, or when you need a clean audit trail of state transitions. You don't want one just because your architecture diagram needs a box with arrows coming out of it.
The outbox pattern, honestly
The outbox pattern looks like extra machinery until the first time it saves you. The domain DB and an outbox row commit atomically. A relay publishes those rows to Kafka, marking them as sent. Consumers are idempotent and dedupe by event id.
This gives you exactly-once effects without distributed transactions. The atomic commit is local; the rest is careful retries and dedupe.
The operational cost is real: a small table that can grow, a relay to run, and idempotency to enforce everywhere. It's worth it.
Partitioning for ordering
Ordering in Kafka is per-partition. Partition by whatever must be ordered together — usually an aggregate id (orderId, userId, merchantId). Don't partition by timestamp unless you want to learn about hot partitions the hard way.
Things that break in year two
- Hot partitions. One merchant is 50x the rest. You need a plan — partition splitting, virtual partitions, or keyed outbox overrides.
- Consumer lag. Measure it, alert on it, and make sure consumers scale horizontally without losing ordering.
- Schema drift. Use a schema registry with backward-compatible evolution. Breaking changes are rolled out behind a version gate.
- Replay tooling. The log is replayable in theory; make it replayable in practice with CLIs and safe defaults.
What I pick
For most teams on AWS or GCP, I reach for managed Kafka (MSK, Confluent Cloud) with an outbox relay in each service, Avro or Protobuf schemas in a registry, and per-aggregate partitioning. Simple, boring, and it scales.