Kafka message ordering is a real nemesis for any application wherein the one ask is to process the messages in (near) real-time and on the other hand ensure that messages are processed in the exact order in which they were generated. It is very challenging to satisfy both the requirements simultaneously. Message processing is not a challenge at all but processing them in a desired sequence (combined order of messages with multiple partitions in Kafka) is where most architects compromise by altering the solution architecture or move away from real-time ingestion. This blog, Kafka Message Ordering Part 1, is first in the four-part series wherein I would discuss the factors distorting the Kafka message ordering before proceeding with possible solution architectures.
Though I have mentioned Kafka in this blog because of its popularity in big data world but the entire discussion applies to any other enterprise integration middleware responsible for message transfers. In reality Kafka (or any other EI tool) has no power to exert any control over the messages outside its boundaries (at producer or consumer side). So essentially our discussion would be around on how we can address the message ordering problem outside Kafka boundaries.
What is message ordering: In layman’s terminology it means ensuring the messages are received or processed by consumer in the exact same order in which they were produced by producers. Now let’s understand why this is considered as a dreaded problem.
Assume that you have a number of sensors installed on sea bed forming an IoT cloud for transmitting acoustic data generated by collision of water waves with sea bed. These sensors can be used for a wide variety of applications like predicting presence of oil/gas underneath, prediction of earthquakes, underwater flora fauna analysis, tracking cargo movement, etc. Pictorially it can be represented as:
As depicted the three sensors send a total of six events (2 per sensor) at different times (represented by t1, t2, etc.). The correct sequence of events as per this representation is e1, e2, e3, e4, e5 and e6. The intent is to process these events in the exact same sequence. Do you know what all things can go awry here? At high level, there could be issues at producer side, at consumer side or during message transmission from producers to consumers.
Producer Side Factors: Though producers are originators of the events but any ordering related issues have crippling effect on the entire subsystem. In general if something goes wrong at producer side then it is not possible to recover from these issues in downstream systems at all. The only way is to fix the problem at source (producer) side itself. There could be two main issues at producer side which can disrupt the message ordering:-
- Clock sync: Generally it is the actual producer of the event/message/stimuli which sends the timestamp of the event along with the event data. But imagine what impact it could have in case of tens of thousands of sensors with a slight time difference in microseconds among a subset of sensors on a real-time system. In the picture above if sensor 3 has a clock difference of negative 100 microseconds as compared to sensor 1, then for all the readings taken by sensor 3 within 100 microseconds after sensor1’s reading would reflect as pre-sensor1 readings (i.e. reading which came before sensor 1 readings) which is incorrect.
- Producer partitioning the events: Mostly applications have a throughput requirement which defines the number of events/messages to be sent out per second (or a smaller unit of time in case of real-time systems). But dependence on other integrated third-party systems can have a limiting effect on the throughput requirement. To overcome this limitation imposed by third-party (or legacy) systems, producers parallelize the data collection and data transmission activity which means that data from a single producer is accumulated and transferred in parallel. For example assume that the ask is to send 1000 events (from each sensor) out every second, given that every sensor produces 2000 events per second and also given that the third-party allows only a given number of events (or size) to be passed as one message (say 500 events per message is the limit) then data transmission collection and transmission has to be parallelized to achieve the required throughput. It looks something like:
This is now a bigger problem where the network packets containing these messages might arrive out-of-order at consumer side (or get processed in parallel/interlaced fashion by consumer threads). I will discuss the solution of these kinds of issues in the subsequent blog.
Multiple producers: This is slight variation of the problem above where in there are sensors which are producing same type of data/events and at consumer side the ordering of events is not limited to intra sensor reading but inter sensor readings which essentially means that the data from multiple sensors could be interwoven by virtue of time lag between the events. If you see the first figure above the intra node sequence is:
Sensor 1: e1, e4
Sensor2: e2, e5
Sensor 3: e3, e6
And the internode sequence is: e1, e2, e3, e4, e5, e6.
I would discuss the solution architecture after covering other factors which could alter message ordering at consumer side as well as the infra layer. This is the topic of the next blog, Kafka Message Ordering Part 2.