What Kafka Actually Is β Distributed Append-Only Log, Not a Message Queue
Almost every Kafka interview opens with the misconception that Kafka is "just another queue like RabbitMQ". It isn't. Kafka is a distributed, partitioned, replicated commit log with consumers that track their own position. That single architectural choice changes everything downstream.
Kafka is a distributed append-only log. Producers append records to a topic; the broker keeps every record on disk for a configured retention period (default 7 days). Multiple consumer groups can read the same topic independently, each tracking its own offset. Nothing is "consumed" or deleted on read.
RabbitMQ / SQS are queues: a message is routed to a consumer, acked, and removed. Once it's gone, no other consumer sees it.
| Property | Kafka | RabbitMQ / SQS |
|---|---|---|
| Data model | Append-only log | Queue / exchange |
| Storage | On disk, retained for days | In memory until acked |
| Multi-consumer | Each group reads independently | Each message goes to one consumer |
| Ordering | Per partition | FIFO per queue (varies) |
| Replay | Reset offset, re-read | Not natively supported |
| Throughput | Millions msg/sec | Tens of thousands |
Topics, Partitions & Offsets β The Three-Level Address
Every Kafka record has a three-part address: topic, partition, offset. If you can explain why all three are needed and what each one buys you, you've already cleared the first 5 minutes of any Kafka interview.
- Topic β a named stream of records. Logical grouping (e.g.,
orders,payments). - Partition β an ordered, immutable sequence of records inside a topic. A topic has N partitions; each partition lives on a broker. Partitioning is what makes Kafka horizontally scalable.
- Offset β a monotonically increasing 64-bit ID assigned to each record within a partition. Once written, the offset never changes.
Records are ordered within a partition, never across partitions. So if order matters (a user's events, a single order's lifecycle), they must all land on the same partition β usually by using the same key.
Topic: orders (4 partitions, replication factor 3) P0: [m0][m1][m2][m3]... offsets 0..3 P1: [m0][m1][m2]... offsets 0..2 P2: [m0][m1][m2][m3][m4][m5]... offsets 0..5 P3: [m0][m1]... offsets 0..1 producer.send(new ProducerRecord("orders", "order-42", payload)) // hash("order-42") % 4 β partition 2 // β broker that owns P2 β appended at next offset
The Producer β What Actually Happens When You Call send()
Most candidates think producer.send(record) sends one message over the network. It doesn't. There's a serializer, a partitioner, an in-memory buffer, a batching loop, an I/O thread, and a retry handler β all between your call and the broker. Knowing the path is what separates a junior from a senior.
linger.ms=0, batch.size=16384. "You're sending one message per request β there's no batching. Set linger to 5ms and watch throughput jump 10x."producer.send() and the message hitting the broker.- Serialize the key and value (StringSerializer, AvroSerializer, etc.).
- Partition β if you set a key, partition =
murmur2(key) % numPartitions. No key β sticky partitioner picks a partition and sticks with it for the current batch. - Append to in-memory record accumulator β one queue per partition. The
send()call returns a Future immediately. - I/O thread (Sender) drains accumulator batches when either:
batch.sizebytes filled, orlinger.mselapsed since the first record arrived. - Send to leader broker. The broker writes to the partition log, replicates per
acks, returns ack. - If the request fails (timeout, NotLeaderForPartition), the producer retries per
retries+retry.backoff.ms.
linger.ms=5 # wait up to 5ms to batch β single biggest throughput knob batch.size=65536 # max batch bytes per partition (64KB is a sane default) compression.type=lz4 # compress the batch β 3-5x throughput gain over none acks=all # durability β see section 4 enable.idempotence=true # de-dup retries β see section 10 max.in.flight.requests.per.connection=5
send() is non-blocking, returns a Future, and joins a batch. The I/O thread sends batches when full or after linger.ms. Throughput is mostly tuned by linger.ms, batch.size, and compression.type.acks=0, 1, all β The Durability Knob Most Devs Get Wrong
Three settings, three very different durability stories. The interviewer wants to know not just what each does, but the failure scenario each one allows.
acks control? Which would you use for a payment service?| acks | What gets confirmed | You can lose data when... | Latency |
|---|---|---|---|
0 | Nothing β fire and forget | Producer's network blip, broker crash, leader rebalance β anything | Lowest |
1 | Leader wrote it to its local log | Leader crashes before a follower replicates β the message is gone | Low |
all (-1) | Leader + every in-sync replica (ISR) wrote it | Only if every ISR fails simultaneously β practically never | Higher |
acks=all alone is not enough. You also need min.insync.replicas >= 2. With min.insync=1, "all ISR" can mean just the leader if every follower has fallen behind β and you're back to acks=1 guarantees without realizing it. See section 9.For a payment service: acks=all + min.insync.replicas=2 + enable.idempotence=true. You'll pay 1β2ms more latency, but you'll never lose a transaction record on a single broker failure.
For metrics / clickstream where one in a million dropped events is fine, acks=1 is reasonable. acks=0 is for "we'd rather drop the data than slow the producer" β IoT telemetry at the edge, that kind of thing.
all for anything financial; 1 for analytics; 0 only when you'd genuinely rather lose the data than block.Consumer Groups & Partition Assignment β How Kafka Scales Reads
Consumer groups are how Kafka turns a fan-out log into parallel processing. The rule is simple: each partition is read by exactly one consumer in a group. That single rule gives you both scale and ordering.
orders has 6 partitions. She deploys 3 consumer instances, all in group order-processor. Each instance gets 2 partitions. Black Friday hits β she scales to 6 instances; each gets 1 partition. She tries 12 instances β 6 get 1 partition each, the other 6 sit idle. Why?- A consumer group is a set of consumers that share a
group.id. - For each topic the group subscribes to, every partition is assigned to exactly one consumer in the group.
- If
numConsumers < numPartitions: some consumers get multiple partitions. - If
numConsumers == numPartitions: 1 partition each β maximum parallelism. - If
numConsumers > numPartitions: extras sit idle. Partitions are the upper bound on consumer parallelism.
Offset Commits β Auto vs Manual, and the Duplicate-Processing Bug
Offset management is where most consumer bugs live. The default of enable.auto.commit=true sounds convenient. It is also responsible for 9 out of 10 "we processed the order twice" production incidents.
enable.auto.commit the hard way.Auto-commit (enable.auto.commit=true): every auto.commit.interval.ms (default 5s), the consumer commits the latest polled offset β regardless of whether you finished processing those records. If you crash mid-batch, the offset has already moved past records you didn't actually process. You get at-most-once with potential data loss, or duplicates on the previous batch.
Manual commit with commitSync() or commitAsync() after processing: you commit only after the work is done. If you crash, you re-read and re-process. You get at-least-once with possible duplicates β the standard pattern.
props.put("enable.auto.commit", "false"); try (KafkaConsumer<String, String> c = new KafkaConsumer<>(props)) { c.subscribe(List.of("orders")); while (running) { ConsumerRecords<String, String> recs = c.poll(Duration.ofMillis(200)); for (var r : recs) processOrder(r); // idempotent! c.commitSync(); // commit AFTER processing } }
Rebalancing β Why Your Consumer Pauses Every Deploy
Rebalancing is the protocol that redistributes partitions when group membership changes β a consumer joins, leaves, or fails. It's also the source of the dreaded "stop-the-world" pause every team eventually hits.
Triggers: a consumer joins (deploy), leaves (shutdown), times out (session.timeout.ms with no heartbeat), or topic metadata changes (partition count grew).
Eager (default before 2.4): all consumers stop polling, give up all partitions, group coordinator reassigns, all consumers resume. Total stop-the-world. A 200-consumer group can pause for seconds.
Cooperative-sticky (default 3.0+): only the partitions that need to move are revoked; everyone else keeps polling. Massively reduces pause duration. Set partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor.
processOrder() that exceeds max.poll.interval.ms (default 5min) will get the consumer kicked out of the group β triggering a rebalance even though nothing else changed. Fix: shrink your batch (max.poll.records) or do work asynchronously and call poll() on schedule.Brokers, Replication & ISR β How Kafka Doesn't Lose Data
Replication is what makes Kafka durable, and ISR (in-sync replicas) is the concept that makes replication useful. Mess this up in an interview and the rest of the conversation gets harder.
Replication factor = how many copies of each partition exist across brokers. RF=3 means 3 copies on 3 different brokers.
Leader / Followers β for each partition, exactly one replica is leader. All reads and writes go to the leader. Followers pull from the leader to stay in sync.
ISR (In-Sync Replicas) β the subset of replicas that are caught up with the leader within replica.lag.time.max.ms (default 30s). A follower that falls behind is removed from ISR. The leader itself is always in ISR.
Leader failure: controller picks a new leader from the current ISR. If unclean.leader.election.enable=false (default since 2.5), and ISR is empty, the partition goes unavailable rather than electing a non-ISR replica that may have stale data.
min.insync.replicas β The Durability Triangle
This one config is the hidden third leg of the durability story. Without it, acks=all is a lie.
acks=all on the producer, do I have full durability?Not quite. acks=all means "wait for every replica currently in ISR". If two of your three replicas have fallen behind and dropped out of ISR, then "all ISR" = just the leader. You acked from one box. If that box dies before any follower catches up, you lose data.
min.insync.replicas=2 on the topic forces the producer to fail with NotEnoughReplicasException rather than silently downgrading durability. The producer can retry; meanwhile no record is acked unless at least 2 replicas have it.
# Topic config replication.factor=3 min.insync.replicas=2 # Producer config acks=all enable.idempotence=true # de-dup retries (section 10)
Idempotent Producer β What enable.idempotence=true Actually Does
Without idempotence, a network blip can turn one logical send into multiple identical records on the broker. Idempotence is what makes producer retries safe.
enable.idempotence=true is the one-line fix.Without idempotence, retries can produce duplicates: the broker received the original send, replied with an ack that got lost, the producer retried, the broker wrote the message a second time.
Idempotent producer adds three pieces:
- Producer ID (PID) β assigned by the broker on connect. Stable for the producer's lifetime.
- Sequence number per (PID, partition) β monotonically increasing.
- Broker-side de-dup window β broker tracks the highest sequence per PID per partition; rejects out-of-order or duplicate sequences.
Result: even if the producer retries the same batch 5 times, the broker writes it once.
Transactions & Exactly-Once Semantics β The Whole Truth
"Exactly-once" is the most-asked, most-misunderstood Kafka feature. It exists, it works, but it works in a specific way β and it requires the consumer to participate.
Yes β for the read-process-write loop within Kafka. End-to-end EOS to a non-Kafka sink (a database, an HTTP API) requires your application to make the sink-side write idempotent or transactional.
The transactional producer adds:
transactional.idβ a stable identifier across producer restarts. Kafka uses it to fence zombie producers.initTransactions(),beginTransaction(),send(),sendOffsetsToTransaction(),commitTransaction()β APIs that make a batch atomic across multiple topic-partitions.- Read-process-write atomicity: in one transaction, you read input from topic A, write output to topic B, and commit your input offsets β all atomically. Either everything happens or nothing does.
Consumers must set isolation.level=read_committed to skip records belonging to aborted transactions.
producer.initTransactions(); while (running) { ConsumerRecords<String,String> recs = consumer.poll(...); producer.beginTransaction(); for (var r : recs) { producer.send(new ProducerRecord("output", transform(r.value()))); } producer.sendOffsetsToTransaction(offsetsOf(recs), consumer.groupMetadata()); producer.commitTransaction(); // atomic: writes + offset commit }
Retention vs Log Compaction β Two Different Cleanup Strategies
Topics get cleaned up in one of two ways. Mixing them up will trip you up in any interview that goes past basics.
Time / Size Retention (default)
Records older than retention.ms (default 7 days) or beyond retention.bytes get deleted in segment-sized chunks.
Use for: events (orders, page views, audit log). Care about every record for a window, then it's garbage.
Log Compaction (cleanup.policy=compact)
For every key, keep only the latest record. Older records with the same key get tombstoned and eventually deleted.
Use for: state. "What is user X's profile right now?" The topic is a changelog β last write per key wins.
user-profiles topic, before compaction: [user-1, name=Aman] [user-2, name=Riya] [user-1, name=Aman Patel] // update [user-3, name=Karthik] [user-1, null] // tombstone β delete user-1 After compaction: [user-2, name=Riya] [user-3, name=Karthik] // user-1 fully removed after delete.retention.ms
KTable is backed by a compacted topic. Internal offset/group state in __consumer_offsets is also a compacted topic. If you see "compaction" in a Kafka design discussion, think state, not events.cleanup.policy=compact,delete.Partition Key Strategy β How Hot Partitions Are Born
The choice of partition key is the single biggest determinant of whether your Kafka cluster scales gracefully or melts down at peak.
customer_country as the partition key for her order topic. 80% of customers are in India. So 80% of records land on one partition. That partition's broker is at 95% CPU; the others are idle. Her consumer for that partition is 20 minutes behind. Diagnosis: hot partition.Hot partition = one partition receiving disproportionate traffic because the partition key has a skewed distribution. Symptoms: one consumer falls behind, one broker is hot, others are cool.
- Pick a high-cardinality key.
order_idovercountry.user_idoverregion. - Salt the key if cardinality is bounded. Append a random 0βN suffix to artificially split a hot key β at the cost of losing per-key ordering for that hot key.
- Custom partitioner. If your domain has known hot keys, route them to a sub-partition pool.
- Match key to ordering needs. If order matters per
order_id, useorder_id; don't pickcountryfor "scale" and then complain about ordering.
Message Ordering Guarantees β Per Partition, Never Across
"I need messages in order" is one of the most common requirements β and one of the most commonly mishandled in Kafka.
Kafka guarantees ordering within a single partition. Records appended to partition P0 are read in append order. Period.
Across partitions there is no global order. Two records in P0 and P1 may be read in either order by a consumer reading both.
So: if you need "events for order-42 in order", make sure they all go to the same partition. The producer does this automatically when you set key=order-42 β every record with that key is hashed to the same partition.
max.in.flight.requests.per.connection β The Subtle Ordering Trap
A producer config most people set to "5 (default)" and never think about β until it silently breaks per-key ordering on a retry.
retries=3 and max.in.flight.requests.per.connection=5, can my messages get out of order?Without idempotence: yes. With 5 in-flight batches and retries enabled, batch 1 can fail and be retried, while batches 2β5 already succeeded. Now batch 1's record arrives after batch 2's β out of order.
With enable.idempotence=true: no. The broker uses sequence numbers per (PID, partition) and rejects out-of-order writes; the producer waits and reorders. As of Kafka 3.0, idempotence allows up to 5 in-flight requests safely (older versions required β€ 5 only without idempotence; 1 with).
max.in.flight.requests.per.connection=1 β at a 5β10x throughput cost. The right answer is "leave idempotence on".enable.idempotence=true always. It gives you de-dup AND in-order retries with high in-flight concurrency. There's no reason to disable it.Consumer Lag β What It Is & What Causes Spikes
Consumer lag is THE health metric of any Kafka consumer. The interviewer wants to see if you understand both the math and the operational picture.
Lag = log_end_offset β consumer_offset, per partition. It's how many records the consumer is behind the producer's latest write.
Common spike causes:
- Producer burst β Black Friday, viral spike, retried fan-out from upstream.
- Slow downstream β DB call inside the consumer suddenly takes 200ms instead of 5ms.
- Rebalance β paused all consumers for a few seconds; backlog built up.
- Hot partition β one partition's lag spikes while others are fine; key skew (section 13).
- Consumer crash / GC pause β JVM full GC stalled poll() for 30s.
- Network throttling / quota hit β the broker is shaping your consumer.
Backpressure & Tuning β fetch.min.bytes, max.poll.records, Quotas
A handful of consumer configs control how Kafka shapes traffic. Knowing the right knobs differentiates ops experience from textbook knowledge.
| Config | Default | What it does |
|---|---|---|
fetch.min.bytes | 1 | Wait for at least this many bytes before returning a fetch. Bump to ~50KB to amortize round trips. |
fetch.max.wait.ms | 500 | ...but don't wait longer than this. Tune with fetch.min.bytes. |
max.poll.records | 500 | Max records returned per poll. Drop to 100 if processing is heavy and rebalance pauses bite. |
max.partition.fetch.bytes | 1MB | Per-partition response cap. Raise for large messages. |
max.poll.interval.ms | 5min | Max time between polls before group coordinator marks consumer dead. Bump for slow batches. |
session.timeout.ms | 10s | Heartbeat-based death detection; pair with heartbeat.interval.ms. |
producer_byte_rate, consumer_byte_rate) cap a client by user/clientID. Use them to prevent one bad team from saturating your shared cluster.Schema Registry & Avro β Schema Evolution Without Breaking Consumers
Once you have more than two services on a topic, schema becomes a contract. Schema Registry + Avro / Protobuf is how that contract is versioned and enforced.
Without schemas, every consumer parses raw bytes and prays the producer didn't change anything. Add a field, drop a field, rename β every downstream consumer breaks silently. Schema Registry stores versioned schemas keyed by topic-subject; producers serialize with the registered schema ID embedded in the message; consumers fetch the schema by ID to deserialize.
Compatibility modes (set per subject):
- BACKWARD (default) β new schema can read data written by the previous schema. Producers can upgrade; consumers stay on old schema.
- FORWARD β old schema can read data written by the new schema. Consumers can upgrade; producers stay on old.
- FULL β both directions. Adding optional fields with defaults is the safe lane.
- NONE β anything goes. Don't.
Kafka Streams vs Plain Consumer β When to Reach for Streams
Both read records from topics. Streams adds stateful operations, windowed joins, and exactly-once semantics β at the cost of state-store complexity.
Use a plain consumer when you're doing stateless work: read a record, transform it, write to a DB or call an API. No joins, no aggregations across records.
Use Kafka Streams when:
- You need to join two streams (orders β¨ payments).
- You need windowed aggregations (5-minute rolling sum of clicks).
- You need state and want it backed by Kafka (changelog topic + RocksDB local store).
- You want exactly-once processing across multi-step pipelines for free.
Streams primitives: KStream (event stream), KTable (changelog / current value per key), GlobalKTable (broadcast lookup table). State stores back KTables and aggregations; they're persisted as compacted topics so a restarted node can rebuild state.
Kafka Connect & Dead Letter Queues β Moving Data Without Writing Code
Connect is the official "ingest from / egest to external systems" framework. DLQs are the operational must-have nobody mentions in the docs.
Kafka Connect runs as a separate cluster of workers that host plug-in connectors. Source connectors pull from MySQL/Postgres/MongoDB/S3 into Kafka; sink connectors push from Kafka to Snowflake/Elasticsearch/JDBC. You configure them with JSON over a REST API; no code.
A poison pill is a record the connector cannot process β bad schema, parse error, FK violation. Without DLQ config, the connector retries forever and stalls the entire partition.
Configure errors.tolerance=all + errors.deadletterqueue.topic.name=my-connector-dlq and bad records get routed to a separate topic for manual triage; the main pipeline keeps moving.
ZooKeeper vs KRaft β Why Kafka Dropped ZK
A dated interview clichΓ© is "Kafka uses ZooKeeper". As of Kafka 3.3+, that's no longer the recommended setup. KRaft (Kafka Raft) is in.
ZooKeeper held controller election state and topic metadata. Operating two distributed systems (Kafka + ZK) was painful: separate config, separate JVM tuning, separate failure modes, slow metadata propagation.
KRaft bakes the same metadata-replication and controller-election protocol into Kafka itself, using the Raft consensus algorithm. A small set of broker nodes act as controllers (analogous to ZK ensemble); they replicate the metadata log among themselves; clients no longer need any external coordinator.
Benefits: faster controller failover, faster cluster start-up, simpler operations, supports far more partitions per cluster (millions vs ~200K).
Security β SASL, SSL, ACLs
For any production deployment, "who can do what" is mandatory. Three pieces fit together.
- SSL/TLS β encrypts data in flight between clients and brokers, and between brokers themselves. Set
security.protocol=SSLorSASL_SSL. - SASL β authenticates clients. Common mechanisms:
PLAIN(username/password over TLS),SCRAM-SHA-256/512(challenge-response),GSSAPI(Kerberos),OAUTHBEARER. - ACLs β authorize. Per principal (user) per resource (topic, group, cluster) per operation (Read, Write, Describe, Create). Stored in Kafka itself (KRaft) or ZK (legacy).
# Allow user "order-svc" to produce to topic "orders" kafka-acls --bootstrap-server kafka:9092 \ --add --allow-principal User:order-svc \ --operation Write --topic orders # Allow user "analytics" to consume any topic in group "analytics-group" kafka-acls --add --allow-principal User:analytics \ --operation Read --topic '*' --group analytics-group
Kafka vs RabbitMQ vs SQS β When to Pick What
A classic "compare and contrast" question. Don't just list features β pick a scenario and reason about it.
| Scenario | Pick | Why |
|---|---|---|
| High-throughput event streaming, multiple consumer groups, replay needed | Kafka | Append-only log, retained on disk, multi-consumer-group, millions msg/sec |
| Task queue with priority & complex routing | RabbitMQ | Exchange types (direct/topic/fanout/headers), priority queues, per-message TTL |
| Simple managed queue, AWS-native, no ops overhead | SQS | Zero ops, pay-per-message, FIFO mode for ordering |
| Event sourcing / CDC / stream processing | Kafka | Log compaction, Streams, Connect, schema registry |
| Low-volume, reliable per-message acks, dead-lettering by default | RabbitMQ / SQS | Per-message ack/nack with built-in DLQ. Kafka can do it but with offset gymnastics |
| Cross-region replay, audit log, compliance "show me yesterday's traffic" | Kafka | Long retention + replay is its native model |
Production Pitfalls β What Real Operators Trip Over
A grab-bag of issues that turn up in incident reports across every Kafka shop.
- auto.offset.reset surprises β set to
latestby default. New consumer groups skip history. Set toearliestif you want backfill. - Slow consumer triggers rebalance β work that exceeds
max.poll.interval.mskicks the consumer out. Either shrink the work, or do it async and call poll on schedule. - Topic with too many partitions β every partition is a file handle on every broker. 200K+ partitions/cluster gets painful (KRaft is better than ZK here).
- Compacted topic + tombstone never cleaned β
delete.retention.msdefault 24h. Setmin.cleanable.dirty.ratioproperly or compaction never runs. - JVM heap too big β Kafka brokers want 6β8GB heap, NOT 32GB. The OS page cache does the heavy lifting; a huge heap just lengthens GC pauses.
- Mixed message sizes β one 10MB message in a stream of 1KB messages will block the partition consumer for seconds. Use a separate topic for large blobs, or store in S3 and pass the URL.
- Insufficient disk on broker β Kafka grows linearly with retention Γ throughput.
log.retention.bytescaps it; without it, the broker fills the disk and goes read-only. - Cross-region producers without compression β every byte crosses an expensive WAN.
compression.type=zstdorlz4is mandatory.
Tricky Gotchas & Rapid-Fire Q&A
A grab-bag of the questions that come at the end of a Kafka round, usually paired with a smile and "let's see if you really know this stuff".
sendfile() from page cache straight to socket buffer (no userspace copy), batching at every layer, and reliance on OS page cache rather than a custom buffer pool. Disk is only "slow" for random I/O β Kafka does pure append.consumer.seek(partition, offset) after assignment. Or seekToBeginning() / seekToEnd(). This is how you implement "replay last hour" without changing groups.delete.topic.enable=true (default), the topic is marked for deletion and removed asynchronously by the controller. In-flight producer sends fail with UnknownTopicOrPartitionException. Always drain producers first.SyncGroupResponse. The consumer then calls onPartitionsAssigned and starts polling those partitions. On rebalance, it gets onPartitionsRevoked first, then a new assignment.poll() returning empty and a consumer being assigned no partitions?poll(timeout). No partitions = the group has more consumers than partitions; coordinator gave you nothing. Check consumer.assignment() to distinguish.num.standby.replicas>0) or fully idle. They take over on failure but contribute no processing capacity. Same partition cap as plain consumers.transactional.id implicitly enables enable.idempotence=true. The two work together; the question is whether you also need transactional atomicity beyond just retry-safety.transactional.id, the broker tracks an epoch per ID β when a new instance starts with the same transactional.id, it bumps the epoch and any send from the old (lower-epoch) producer is rejected with ProducerFenced. That's how exactly-once survives instance failures.seek + commitSync, which is how "replay" works.