Why Kafka is always late?

Beyond Basic Monitoring: Deep Diving into Lag and Theory

11 min readOct 24, 2024

Summary

Google has plenty of good articles about tools to monitor Kafka. I think it’s important to dig deeper. Here, I talk about the real concepts behind Kafka lag, the different “times” that affect data flow, and strategies for handling backpressure. Instead of focusing on monitoring tools, I want to explore why Kafka is always “late”.

Quick intro to Apache Kafka (why lag, where are the queues)
How do applications consume data? (partition strategies)
Multiple lags exist in Kafka (Little’s Law, Offset, Time)
The Multiple “Times” of Kafka (and end-to-end lag)
Can lag be stabilized? Can we backpressure?
How to achieve minimal lag?

Quick Intro to Apache Kafka

Kafka is many things. At its core, it’s a scalable real-time messaging platform capable of processing thousands of messages per second easily. Technically speaking, Kafka is a data exchange system based on a distributed, partitioned, replicated, immutable commit log. This might sound complex, but it essentially means Kafka provides a reliable way to work with streams of data efficiently.

Kafka consumers rely on a pull model to read data. They request data from Kafka brokers instead of having Kafka brokers continuously push data to them. This is why lag is generated.

Decoupling

One of the key features of Kafka is the decoupling of producers and consumers. Producers are applications that send data to Kafka, and consumers are applications that read data from Kafka. They don’t interact with each other directly; instead, they communicate through Kafka. This decoupling allows both producers and consumers to operate independently, scaling and evolving without affecting each other.

Replayability & Ordering

Kafka supports infinite storage and allows for the replayability of events with guaranteed ordering. This means you can store data as long as you need and reprocess it whenever necessary, maintaining the exact sequence of events. This capability is particularly interesting for auditing and compliance purposes.

Why Lag exists

However, this decoupling is precisely why lag exists in Kafka. Since producers and consumers operate independently, consumers might not always keep up with the rate at which producers send data. This gap between production and consumption is what we refer to as lag.

How Kafka partitions its work

Kafka’s architecture is designed to handle high-throughput, real-time data streams. To be able to do that, it organizes data into topics, which are further divided into partitions (1 topic can have between 1 and thousands of partitions). This allows Kafka to distribute data across multiple servers (brokers), enabling horizontal scalability and fault tolerance. If one broker fails, others can take over, ensuring continuous data flow without loss.

Offsets

By storing data on disk and replicating it across brokers, Kafka ensures durability and reliability. Consumers keep track of their position in each partition using offsets, which are unique identifiers for each message. If a consumer restarts or fails, it can resume processing from the last known offset, ensuring no data is missed.

Queues for Kafka

An exciting initiative is KIP-932, which introduces the concept of share groups to Kafka, basically Queues. Multiple consumers will be able to consume the same topic without requiring exclusive access to specific partitions and this will introduce individual acknowledgments per message instead of relying on offset commits.

It’s perfect for the classic worker processing architectures and we’ll be able to have more consumer instances than partitions, which was making no sense until this change. This is massive!

TLDR: Lag exists because of the decoupling between producers and consumers.

How do applications consume data from Kafka?

Applications consume data from Kafka by acting as consumers within a consumer group. When you build an application to read data from Kafka, it becomes part of a consumer group identified by a unique group.id. Kafka uses it to manage the coordination among multiple instances of your application, distributing the partitions of the topic among them.

By maintaining offsets, consumers can resume processing from where they left off in case of failures or restarts, ensuring no data is missed or duplicated. Again, this is where lag is created.

Partition Assignment

99% of applications use automatic partition assignment by subscribing to topics (using .subscribe()). Kafka automatically assigns partitions to consumers within the group, handling the distribution and redistribution as consumers join or leave. When the members of the consumer group changes, like if a consumer instance crashes or a new one starts, Kafka triggers a rebalance.

Consumer Rebalance = Lag

Traditionally, during a rebalance, consumers stop processing temporarily while partitions are reassigned. This pause can introduce lag, as messages accumulate and aren’t processed until the rebalance completes.

To mitigate this, Kafka introduced cooperative incremental rebalancing with KIP-429 in Kafka 2.4. This makes consumers to continue processing their assigned partitions during a rebalance. It’s useful in environments where consumer group membership changes frequently, such as auto-scaling systems (Kubernetes or serverless technologies).

The default is to use it since Kafka 3.0 (see Kafka Options Explorer). If you’re not sure, add it to your consumer configuration:

props.put("partition.assignment.strategy", "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");

Offsets Commits = Reducing the Lag

Consumer Offsets act like bookmarks: they keep track of the position in partitions. It helps consumers to resume processing after failures or restart.

When we commit offsets to Kafka, we mark messages as processed. We are reducing the lag. Applications within a consumer group commit offsets to signal either successful processing (at-least-once semantics) or successful receipt (at-most-once semantics) of messages.

Here’s a simple example of setting up a Kafka consumer in Java:

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "my-consumer-group");
props.put("enable.auto.commit", "false");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("my-topic"));

while (true) {
    var records = consumer.poll(Duration.ofMillis(100));
    process(records)
    consumer.commitSync(); // Mark messages as successfully processed
}

Consumers = Lag

Lag occurs when consumers can’t keep up with the rate at which producers send data. Several reasons:

If your application processes messages slower than they arrive, lag accumulates (thinking of Little’s Law here).
During rebalances, consumers pause processing, causing temporary lag.
Delays in network communication can slow down message fetching.
If the message processing involves calls to external systems or APIs, their latency will affect speed and therefore lag.
Limited CPU, memory, or network bandwidth can slow down message processing.

Lag can stabilize or keep increasing like crazy, until your application is so late that it has an impact on the business and you’re on duty, or you hope that it will catchup during the night (where you have less traffic).

Multiple Lags exist in Kafka

Lag is one of the main concept in Kafka It’s the delay between when a message is produced and when it is consumed by an application. It impacts how “real-time” your application is.

Lag = Gap between latest produced and processed message

Lag refers to the gap between the most recent message produced (the latest available message in a partition) and the most recent message committed by a consumer in a partition (the last message processed or read).

For instance, if a partition contains 100 messages (offset 100) and your consumer has committed up to the 90th message (offset 90), the consumer group lag for that partition would be 10. Ten messages haven’t been processed yet.

Lag is never zero. Except when no more data is coming.

What does “good” lag look like? Two dimensions:

Lag in “Offset”: The number of records that have not yet been processed. As our example, it’s an integer, often stable in time when things are running properly, but hard to interpret.
Lag in “Time”: The time difference between when records were produced and when they are consumed. This helps us, as humans, comprehend the delay in terms of minutes or seconds. “My consumer is 3 minutes late”.

Producers send messages at any rate, independent of how fast consumers can process them. If consumers can’t keep up with the incoming data rate, lag accumulates. It’s rare for lag to be zero, especially under high throughput or if processing involves external API calls. Lag may stabilize over time or grow until periods of lower activity when consumers can catch up.

Little’s Law

It’s the duty of developers and ops to monitor both offset lag and time lag, with appropriate alerts based on their requirements and use-cases.

An alert can be set if:

the time lag exceeds “10 seconds”
or if the offset lag is greater than “1000 records”

It depends so much on the use case that there is no rule of thumb here. What works for one application may be completely stupid for another application.

Understanding the relationship between lag, throughput, and processing time is interesting. I like Little’s Law (L = λW) to help here. It relates the average number of items in a system (L, our lag) to the average arrival rate (λ, the producer throughput) and the average time an item spends in the system (W, the consumer processing time). Give a try by computing λW and see if that matches your current lag metrics!

Lag has consequences

Lag is not just a technical concern. It has impact on the business.

High lag means consumers are processing outdated information already, which is not acceptable for time-sensitive applications or user-facing applications.
In applications where timely data is critical, like real-time analytics or monitoring systems, lag will negatively impact user experience and lead to wrong decisions.

This is why it’s mandatory to monitor the lag of your applications.

The Multiple “Times” of Kafka

Lag can be defined at various points — or nodes — in a data processing pipeline. At each node, lag can be calculated using the formula:

lag(Msg, Node) = currentTime(Msg, Node) - eventTime(Msg)

eventTime(): The timestamp when the message was produced
currentTime(): The time when the message is being processed at a particular node.

By calculating the difference between the current time at each node and the event time of the message, you obtain the lag at each node. You can build the lineage of your data pipeline by mapping each node along with its associated lag and identify which nodes are introducing the most delay.

The end-to-end (e2e) lag is the most critical metric. It represents the total time it takes for a message to travel from the producer to a consumer, encompassing all stages of the pipeline:

Ingest Time: Time taken to produce and send the message to Kafka.
Transit Time: Time the message spends within Kafka.
Processing Time: Time taken by the consumer application to process the message.
Departure Lag: Total time elapsed from when the message was created to when it leaves the node. It includes all the time the message has spent being processed and waiting at that node, representing the cumulative delay up to that point in the pipeline.

Measuring the P95 or P99 latency at each stage helps identify outliers and ensure your system meets performance targets (SLO/SLI/SLA).

Can lag be stabilized? Can we backpressure?

Increasing lag is often due to slow consumer reading records produced by a fast producer. It’s often the case when processing on the consumer involves API calls for enrichment or complex transformations depending on an external state that needs to be remotely fetched.

In scenarios where consumers consistently lag and cannot catch up, you’ll need to wait for the night (for the traffic to decrease) or find a strategy to mitigate it. Several approaches:

Implementing back-pressure to slow down the producer → Not possible in Kafka! Producers and Consumers don’t know each other.

BUT, a producer could subscribe to a topic that the consumer uses to produce a record “Hey, calm down for 5 minutes” and this is how you create back-pressure. I don’t see why anyone would do that, as Kafka can store unlimited amount of data, but I’m sure someone already implemented that with good reasons.

2. Buffering and regrouping messages to absorb the incoming flow.

You can process chunks of data together instead of handling messages one by one. By buffering messages, you can then make a single external API call for an entire batch rather than individual calls for each record. It’s effective when you can combine events together (aggregating or merging messages).

for each r in records:
    buffer.append(r)
    
    if buffer.size >= MAX_BUFFER_SIZE:
        combined_data = combine_records(buffer)
        response = external_api_call(combined_data)

3. Scaling out by adding more consumers

This is the classic strategy: adding more instances of your consumer applications to process partitions in parallel. You can scale up to the number of partitions maximum.

You can also use confluentinc/parallel-consumer. A excellent project to let consumers process messages in parallel via a single Kafka Consumer. It means we can increase consumer parallelism without increasing the number of partitions in the topic.

4. Dropping messages

You can avoid processing records in case of low-value data: like generic metrics or logs, or if you care only about the last value of something (you can disregard the previous values). It depends on your use-case but you could say “disregard all incoming records where eventTime is older than 5 minutes”.

How to achieve minimal lag?

There are a lot of configuration to control how we commit first: kafka-options-explorer/?search=commit

The default with Kafka Consumers is to enable.auto.commit = true every auto.commit.interval.ms = 5 seconds (tips: the auto-commit schedule is checked only when you poll()).

Each framework supporting Kafka consumers provides its own way of committing the offset manually after processing the records (meaning enable.auto.commit = false and manual calls to .commit()). This gives more controls, precision, and reactivity over the commits.

Never commitSync() for every record. I remember lag issues with a team I was working with. They were committing for EVERY single record they were processing, which introduce a massive overhead for the brokers and spike in terms of latency and then increasing lag on their consumers.

In the end, you do not “control” lag, but you either:

reduce the time it takes to process each record
scale-out your consumer application to process more in parallel and control how many partitions to read from

Conclusion

Lag in Kafka is an inherent aspect of its design , a consequence of decoupling producers from consumers to achieve flexibility and scalability.

This lag manifests in two forms: offset lag, the difference in message count between producers and consumers, and time lag, the delay between when messages are produced and when they are consumed.

While we can’t eliminate lag, many strategies exist to control and minimize it: consumer configurations, batching and buffering techniques, parallel consumption, or scaling out.

P.S. At Conduktor, we’ve crafted a monitoring solution so slick, out-of-the-box, and user-friendly, it might just make you forget lag ever existed! Check it out, it’s free.