Handling Data Loss.
How Kafka handles Data Loss — Replication and In-Sync-Replica (ISR)
Learn how Kafka handles data loss in the event of failure.
--
Here we have a Kafka cluster and a representation of how the topic is distributed across the cluster and we have some records present in the file system.
As we all know clients are producers and consumers always talk to the leader to retrieve data from the partition.
Let’s say a broker-1 goes for some reason. Right now this is the broker which is the leader of partition-0. All the data which is returned to the partition-0 is residing in the file system of this broker-1. Once it goes down there is no way for the clients to access this data. This is the data loss actually and it is a big problem.
How Kafka handle data loss?
→ Kafka handles this issue of data loss using Replication. Note we have used in command --replication-factor 3
The Kafka producer produces the message to send it to partition-0 and it goes to the leader which is Broker-1 and after the message is received by broker-1, the message is persisted into the file system. Now broker 1 is a leader replica.
— Again we have --replication-factor 3
right? Now we have one copy of the actual message. Since the replicator factor is 3 we need two more copies of the same message. So replication factor is equal to the number of copies of the same message.
Now the next step is that the same message is copied to broker-2 and it gets written into the file system. So broker-2 is the follower of partition-0 which is also known as follower replica and the same step is repeated for broker-3.
Now we have three copies of the same data available in all the brokers.
→ In Kafka's terminology, this concept is called a Replication and the replica of the leader is called a leader-replica and the other two replicas are called Follower Replica.
→ And the same technique is applicable for broker-2, in here any message sent to broker-2 is copied into broker-1 and broker-3 which behaves as a follower replica.
— And the same thing for the Broker-3,
Scenario — Broker-1 failure
Now we have leader replicas for each and every partition and the follower replicas.
Broker-1 is down.
Let’s say broker-1 is down, but still, the data of the partition is available in the broker-2 and the broker-3.
Now zookeeper gets notified about the failure and it assigns the new leader to the controller. Now the broker-2 is the leader of partition-0 and partition-1. This leader assignment is taken care of by the controller node which is the part of the cluster actually.
So the client request for producing and consuming the data for partition-0 will go to the broker-2 hereafter. So this is how Kafka handles data loss and that's replication and Kafka.
In-Sync Replica (ISR)
- This is a new concept introduced in Kafka. This represents the number of replica in sync with each other in the cluster.
- This includes both leader and follower replica.
- The in-sync replica is always recommended to be always greater than 1.
- The ideal state of the replication is —
ISR == Replication Factor
- The ISR can be controlled by
min.insync.replicas
property which can be set at the broker or topic level.
We will look at replication values, who is the leader, and in-sync replica
list all topics
.\bin\windows\kafka-topics.bat --zookeeper localhost:2181 --describe
- __consumer_offsets is created by Kafka itself.
2. we have created two additional topics that are listed below.
List data of one topic
.\bin\windows\kafka-topics.bat --zookeeper localhost:2181 --describe --topic test-topic-replicated
Here you can see partition-0, partition-1, and partition-2 and also leader-0,1,2 which is nothing but broker id.
- for partition-0 the leader is 2.
- for partition-1 the leader is 0.
- and for partition-2 the leader is 1.
Replicas (see command prompt)
Replicas: 2, 1, 0 — The 2 are leader replica and 1,0 are follower replica.
Replicas: 0, 2, 1 — The 0 is leader replica and 2,1 are follower replica.
Replicas: 1, 0, 2 — The 1 is leader replica and 0,2 are follower replica.
In-sync replica
The in-sync replica is basically equaled to replicas.
- The replicas 0,2,1 and ISR 0,2,1 are same it means in-sync replica is 100%. The data is replicated in all the available brokers which are equal to the replication factor.
Making broker down.
let's try to shut down one replica.
let us run again to check what happens
.\bin\windows\kafka-topics.bat --zookeeper localhost:2181 --describe --topic test-topic-replicated
- now zookeeper has received the signal that one broker is down and in this case, it will reassign the leader and the in-sync value will down.
- So broker-0 is the leader for partition 1 and partition-2.
Summary: We have learned just how we can prevent data loss using Replication.
- There are advanced setups that you can do in order to have two brokers always available, if you provide
min.insync.replicas=2
then it will make sure that the data can be produced with the topic if there are two topics available.