Saturday, August 10, 2024

Kafka Eco System

 

Kafka Eco System



  1. ๐Ÿ“Œ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€:


Topics

๐Ÿ‘‰๐ŸผA stream of messages belonging to a particular category is called a Topic.

๐Ÿ‘‰๐ŸผIts is a logical feed name where to which records are published(Similar to Table in DB )

๐Ÿ‘‰๐ŸผUnique identification of table is called name of the topic - can not be duplicated

๐Ÿ‘‰๐ŸผA topic is a storage mechanism for a sequence of events

๐Ÿ‘‰๐ŸผEvents are immutable

๐Ÿ‘‰๐Ÿผkeep events in the same order as they occur in time. So, each new event is always added to the end of the Message. 


2. ๐Ÿ“Œ๐—ฃ๐—”๐—ฅ๐—ง๐—œ๐—ง๐—œ๐—ข๐—ก๐—ฆ:

๐Ÿ‘‰๐ŸผTopics are split into partition

๐Ÿ‘‰๐ŸผAll the messages within a partition are ordered and immutable

๐Ÿ‘‰๐ŸผAll the messages within the partition has a unique ID associated is called OFFSET.

๐Ÿ‘‰๐ŸผKafka uses topic partitioning to improve scalability.

๐Ÿ‘‰๐ŸผKafka guarantees the order of the events within the same topic partition. However, by default, it does not guarantee the order of events across all partitions.


3.๐Ÿ“Œ๐—ฅ๐—˜๐—ฃ๐—Ÿ๐—œ๐—–๐—”๐—ฆ:

๐Ÿ‘‰๐ŸผReplicas are backs of partition

๐Ÿ‘‰๐ŸผReplicas are never read or write data

๐Ÿ‘‰๐ŸผThey are used to prevent data loss (Fault - Tolerance)



4.๐Ÿ“Œ๐—ฃ๐—ฅ๐—ข๐——๐—จ๐—–๐—˜๐—ฅ:


Producer

๐Ÿ‘‰๐ŸผProducers publish messages by appending to the end of a topic partition.

๐Ÿ‘‰๐ŸผEach message will be stored in the broker disk and will receive an offset (unique identifier). This offset is unique at the partition level, each partition has its owns offsets. That is one more reason that makes Kafka so special, it stores the messages in the disk (like a database, and in fact, Kafka is a database too) to be able to recover them later if necessary. Different from a messaging system, that the message is deleted after being consumed;

๐Ÿ‘‰๐ŸผThe producers use the offset to read the messages, they read from the oldest to the newest. In case of consumer failure, when it recovers, will start reading from the last offset;

๐Ÿ‘‰๐ŸผBy default, if a message contains a key (i.e. the key is NOT null), the hashed value of the key is used to decide in which partition the message is stored.

๐Ÿ‘‰๐ŸผProducers publish messages by appending to the end of a topic partition. By default, if a message contains a key (i.e. the key is NOT null), the hashed value of the key is used to decide in which partition the message is stored.

๐Ÿ‘‰๐ŸผAll messages with the same key will be stored in the same topic partition. This behavior is essential to ensure that messages for the same key are consumed and processed in order from the same topic partition.

๐Ÿ‘‰๐ŸผProducers write the messages to the topic level(All the partitions of that topic) or specific partition of the topic using Producing API's

๐Ÿ‘‰๐ŸผIf the key is null, the producer behaves differently according to the Kafka version:

  • up to Kafka 2.3: a round-robin partitioner is used to balance the messages across all partitions

  • Kafka 2.4 or newer: a sticky partitioner is used which leads to larger batches and reduced latency and is particularly beneficial for very high throughput scenarios


5.Consumer:

๐Ÿ‘‰๐ŸผConsumers are applications which read/consume data from the topics within a cluster using consuming API's

๐Ÿ‘‰๐ŸผConsumers can read either on the topic level (All the partitions of the topic) or specific partition of the topics

๐Ÿ‘‰๐ŸผEach message published to a topic is delivered to a consumer that is subscribed to that topic.

๐Ÿ‘‰๐ŸผA consumer can read data from any position of the partition, and internally the position is stored as a pointer called offset. In most of the cases, a consumer advances its offset linearly, but it could be in any order, or starting from any given position.

๐Ÿ‘‰๐ŸผEach consumer belongs to a consumer group. A consumer group may consist of multiple consumer instances.

๐Ÿ‘‰๐ŸผThis is the reason why a consumer group can be both, fault tolerant and scalable.

๐Ÿ‘‰๐Ÿผ If one of several consumer instances in a group dies, the topic partitions are reassigned to other consumer instances such that the remaining ones continue to process messages form all partitions.

๐Ÿ‘‰๐Ÿผ If a consumer group contains more than one consumer instance, each consumer will only receive messages from a subset of the partitions. When a consumer group only contains one consumer instance, this consumer is responsible for processing all messages of all topic partitions.

๐Ÿ‘‰๐ŸผMessage consumption can be parallelized in a consumer group by adding more consumer instances to the group, up to the number of a topic’s partitions.

๐Ÿ‘‰๐Ÿผ if a topic has 8 partitions, a consumer group can support up to 8 consumer instances which all consume in parallel, each from 1 topic partition.

๐Ÿ‘‰๐ŸผAdding more consumers in a consumer group than the number of partitions for a topic, then they will stay in an idle state, without getting any message

๐Ÿ‘‰๐Ÿผ Consumer will pull the message by consumer.poll()

๐Ÿ‘‰๐Ÿผmax.pull.size is 15- in one instance consumer can pull up to 15 no of message at a instance

๐Ÿ‘‰๐Ÿผconsumer.commit() will help in committing



Consumer


6.Kafka Broker:

๐Ÿ‘‰๐ŸผThat Kafka broker is a program that runs on the Java Virtual Machine (Java version 11+)

๐Ÿ‘‰๐ŸผA Kafka broker is used to manage storing the data records/messages on the topic. It can be understood as the mediator between the two


๐Ÿ‘‰๐ŸผThe Kafka broker is responsible for transferring the conversation that the publisher is pushing in the Kafka log commit and the subscriber shall be consuming these messages.

๐Ÿ‘‰๐ŸผEnabling the delivery of the data records/ message to process to the right consumer.



7. Kafka Cluster

๐Ÿ‘‰๐ŸผAn ensemble of Kafka brokers working together is called a Kafka cluster. Some clusters may contain just one broker or others may contain three or potentially hundreds of brokers. Companies like Netflix and Uber run hundreds or thousands of Kafka brokers to handle their data.

๐Ÿ‘‰๐ŸผA broker in a cluster is identified by a unique numeric ID. In the figure below, the Kafka cluster is made up of three Kafka brokers.

๐Ÿ‘‰๐ŸผTo configure the number of the partitions in each broker, we need to configure something called Replication Factor when creating a topic. Let’s say that we have three brokers in our cluster, a topic with three partitions and a Replication Factor of three, in that case, each broker will be responsible for one partition of the topic.



๐Ÿ‘‰๐ŸผAs you can see in the above image, Topic_1 has three partitions, each broker is responsible for a partition of the topic, so, the Replication Factor of the Topic_1 is three.

๐Ÿ‘‰๐ŸผIt’s very important that the number of the partitions match the number of the brokers, in this way, each broker will be responsible for a single partition of the topic


Replication Factor:


https://www.educba.com/kafka-replication/


In this scenario, we have a replication factor of 2, which means each partition has two copies for redundancy.

  • Let’s consider what happens if Broker 3 fails. Since Broker 3 was the leader for partition 3, the connection to that partition is lost. However, Kafka handles this by automatically selecting one of the in-sync replicas (in this case, there is only one remaining replica) to become the new leader for partition 3.

  • When Broker 3 comes back online, it can try to rejoin as the leader. Kafka maintains the in-sync replica (ISR) list for each partition by monitoring the latency of each replica.

  • When a producer sends a message, the leader broker writes it and replicates it to all in-sync replicas. A message is only considered committed once it has been successfully replicated to all in-sync replicas.

  • Producers Sent Message without Key :

  • Just like in the messaging world, Producers in Kafka are the ones who produce and send the messages to the topics.

  • As said before, the messages are sent in a round-robin way. Ex: Message 01 goes to partition 0 of Topic 1, and message 02 to partition 1 of the same topic. It means that we can’t guarantee that messages produced by the same producer will always be delivered to the same topic. We need to specify a key when sending the message, Kafka will generate a hash based on that key and will know what partition to deliver that message.

  • That hash takes into consideration the number of the partitions of the topic, that’s why that number cannot be changed when the topic is already created

  • In Kafka, acknowledgements (ACKs) are crucial for ensuring data reliability. Depending on your use case, you can choose from three types of ACK settings:

  • acks=0: The producer does not wait for any acknowledgement from the broker. While this setting provides the lowest latency, it also risks data loss as there's no guarantee that the message was received or replicated.

  • acks=1: The producer gets an acknowledgement after the leader broker receives the message. However, if the leader fails before the followers replicate the record, there's potential for partial data loss.

  • acks=all: The producer receives an acknowledgement only after all in-sync replicas have received the message. This setting provides the highest durability, ensuring no data loss, though it may introduce some latency.

https://static.javatpoint.com/tutorial/kafka/images/apache-kafka-producer5.png



  • Three Variations of Offset

  1. LogEnd Offset: Offset of the last message to a log/partition

  2. Current Offset: Pointer to the last record that kafka has already sent to the consumer in the most recent poll,will never commit offset

  3. Committed Offset: Making an offset as consumed is called committing an offset


  • Kafka Consumer Group

  • A consumer group is a logical entity in Kafka ecosystem that mainly provides parallel processing/scalable message consumption to consumer clients

  • Each consumer instance must be associated with some consumer group

  • Make sure that no duplication among consumers who all are part of the same consumer group

  • The ideal is to have the same number of consumers in a group that we have as partitions in a topic, in this way, every consumer read from only one.

  • When adding consumers to a group, you need to be careful, if the number of consumers is greater than the number of partitions, some consumers will not read from any topic and will stay idle.


https://docs.cloudera.com/cdp-private-cloud-base/latest/kafka-developing-applications/topics/kafka-develop-groups-fetching.html

Consumer Group Rebalancing

  1. Process of redistribution of partitions to the consumers within a consumer group .

  2. Rebalancing of a consumer group happens in below cases:

  3. A consumer is joining the consumer group

  4. A consumer is leaving the consumer group

  5. If the partitions are added to the topic's which these consumers are intrested in

  6. If partitions goes offline state


Group Coordinator

  • Brokers in the Kafka cluster are assigned as group co-ordinator for a subset of consumer groups

  • The group coordinator maintains a list of consumer groups

  • The group coordinator initiates a rebalance process call

  • The Group Coordinator communicates the new assignment of partitions to all consumers.

  • Until the rebalancing process is not finished, consumers within the consumer group (whose balance is happening), will be blocked from reading any messages


Group Leader

  • The first consumer to join the consumer group is elected as the Group Leader

  • The Group Leader has the list of Active members and the selected assignment strategy

  • Group Leader sends the new assignment of partitions to the Group Coordinator

  • It executes the rebalance process


What Happens When a Consumer Joins A consumer Group:

  • When a consumer starts, it sends find coordinator requests to obtain the group coordinator which is responsible for its Group

  • The consumer initiates the rebalance protocol by sending a join Group request

  • Next, All the members of that consumer group send a sync group request to the coordinator.

  • Each consumer periodically sends a heartbeat to the group coordinator to keep the session alive

Zookeeper:

How Kafka and ZooKeeper Work Together

#1 Controller Election.

  • Zookeeeper is used by Kafka to coordinate clusters, maintain metadata, track broker availability ,leader selection, manage consumer groups and as well as dynamic configuration changes

  • To choose the Controller, each broker attempts to create an "ephemeral node" (a znode that will exist until the session that created it ends) called controller.

  • Maintain the leader/follower relationship among all the partitions of topics

  • If a node went down ,Zookeeper ensures that other replicas will take up the leader

#2 Cluster Membership

  • A group of brokers will have a unique identification ID through out the Kafka cluster

  • A group Znode is generated when the broker connects to the respective zookeeper instances and each broker then constructs ephemeral zone inside of this group znode.

  • Maintain the functioning brokers in the cluster

#3 Topic Configuration.

  • Each kafka topic has its own configuration.

  • The replication factor, limit message size ,unclean leader election, flush rate and message retention all are determined by these settings.

  • Maintains the Location of relicas ,No of partition of each topic

#4 Access Control Lists (ACLs).

  • Apache Kafka includes an authorizer that supports Access Control List via Zookeeper

  • Maintains the ACLS for all the topics ,will include who is allowed to write/read from each topic ,list of the consumer group, members of the group, most recent offset each consumer group received from the partition

#5 Quotas.  

  • ZooKeeper accesses how much data each client is allowed to read/write.


Apache Kafka Examples

  1. Website activity tracking

  2. Web Shop(Ecommerence)

  3. Kafka as a Message queue



Kafka best practices:

  • Understand the data rate of your partitions to ensure you have the correct retention space.

  • Choose an appropriate number of partitions:

  • Use key-based partitioning when necessary

  • Consider data skew and load balancing

  • Plan for scalability

  • Error Handling: when errors are non-retriable such as RecordTooLargeException, TopicAuthorizationException, producer will return the error to application, e.g. via callback

  • Leave topic compression.type to its default value producer, meaning retain the original compression codec set by the producer.Specify compression.type in the producer

  • Idempotence:To enable idempotent producer, set enable.idempotence to true.

  • Optimise Batch Size batch.size and linger.ms

  • Select a meaningful message key. Keys determine the partition to which a message is sent, which can be important for maintaining message order within a partition.

  • Set an appropriate replication factor

  • Ensure that your Kafka producer’s version is compatible with your Kafka broker version to avoid compatibility issues

  • Avoid frequent partition changes

  • Monitor and tune as needed batch-size-avg ,records-per-request-avg, record-queue-time-avg ,record-send-rate, record-size-avg, compression-rate-avg, request-rate,requests-in-flight,request-latency-avg

  • If your consumers are running versions of Kafka older than 0.10, upgrade them.

  • Tune your consumer socket buffers for high-speed ingest.(socket.receive.buffer.bytes)

  • Design high-throughput consumers to implement back-pressure when warranted

  • When running consumers on a JVM, be wary of the impact that garbage collection can have on your consumers.(Off heap memory will came in to the picture to handle this situation)

Will write further

Refer:

https://docs.cloudera.com/cdp-private-cloud-base/latest/kafka-developing-applications/topics/kafka-develop-rebalance.html


No comments:

Post a Comment

"๐Ÿš€ Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"๐Ÿš€ Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...