Tuesday, July 30, 2024

๐—ฅ๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป ๐˜ƒ๐˜€ ๐—–๐—ผ๐—ฎ๐—น๐—ฒ๐˜€๐—ฐ๐—ฒ

๐Ÿšฉ๐—ค๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป

๐Ÿš€ ๐—ฅ๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป ๐˜ƒ๐˜€ ๐—–๐—ผ๐—ฎ๐—น๐—ฒ๐˜€๐—ฐ๐—ฒ


----------------------------------------------------------------------------

๐Ÿ“Œ๐—ฅ๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป:


๐Ÿ‘‰๐ŸปDefault partition for RDD/DataFrame

๐Ÿ‘‰๐Ÿปspark.sql.files.maxpartitionBytes-128MB(

๐Ÿ‘‰๐Ÿปspark.default.parallelism-8 partitons by default(creating data within spark)


๐Ÿ‘‰๐ŸปRepartition is used to increase or decrease the partition in spark

๐Ÿ‘‰๐ŸปRepartition shuffle the data and build a new partition from scratch

repartition is always result equal size partition

๐Ÿ‘‰๐ŸปDue to full shuffle its not good for performance(Depend upon the use case)

๐Ÿ‘‰๐ŸปUse repartition to redistribute data evenly before the join, ensuring a balanced workload

๐Ÿ‘‰๐ŸปApply repartition before grouping to enhance data distribution.


๐Ÿ“Œ๐—–๐—ผ๐—ฎ๐—น๐—ฒ๐˜€๐—ฐ๐—ฒ:


๐Ÿ‘‰๐ŸปCoalesce will only reduces the no of partitions

๐Ÿ‘‰๐ŸปCoalesce does not required full shuffle

๐Ÿ‘‰๐ŸปUnlike repartition, it tries to minimize data movement and avoids a full shuffle when reducing partitions.


๐Ÿ‘‰๐ŸปDue to partition merge - creates uneven no of partitions

๐Ÿ‘‰๐Ÿป๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ High partition count at the end of processing, leading to numerous small output files

๐Ÿ‘‰๐Ÿป๐—ฆ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป we can Use coalesce to decrease partitions before saving the final result.


๐Ÿ‘‰๐Ÿป๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ: Numerous small files causing storage and reading inefficiencies.

๐Ÿ‘‰๐Ÿป๐—ฆ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป: Utilize coalesce to reduce output file count



๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—–๐—ผ๐—ป๐˜๐—ฒ๐˜…๐˜ ๐˜ƒ๐˜€ ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ ๐—ฆ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป


  • Spark Context:




๐Ÿ”ถSpark context is the traditional entry point to any Spark application

๐Ÿ”ถIt represents the connection to the spark cluster and is the place where the user can configure the common properties for the entire application to create RDD

๐Ÿ”ถSparkContext is designed for low-level programming & fine-grained control over the Spark application

๐Ÿ”ถGet the current status of the spark application

๐Ÿ”ถAccess various service

๐Ÿ”ถCancel a job

๐Ÿ”ถCancel a stage

๐Ÿ”ถClosure cleaning

๐Ÿ”ถRegister Spark-Listener

๐Ÿ”ถProgrammable Dynamic allocation

๐Ÿ”ถAccess persistent RDD


  • Common Challenges 

๐Ÿ”ถManaging multiple contexts

๐Ÿ”ถNeed proper context initialization

๐Ÿ”ถfailing to manage context correctly leads to performance

๐Ÿ”ถBefore Spark 2.0, have to create specific Spark contexts for any other interaction (Hive, SQL & Streaming Context)

๐Ÿ”ถIn a Multiuser-Multi Application case-conflict can arise when multiple users or applications try to use the same context

๐Ÿ”ถLegacy code base

  • Spark Session:

๐Ÿ”ถSpark Session was introduced in Spark 2.0

๐Ÿ”ถNo Need to create multiple contexts

๐Ÿ”ถSpark session integrates with spark contexts and provides a high-level API for working with structured data through SQL, streaming data with spark streaming

๐Ÿ”ถImprove the performance by improving Catalyst Optimizer predominantly for Spark SQL Queries

๐Ÿ”ถDataFrame provides a table structure of data

๐Ÿ”ถSpark session can be integrated with Jupyter Notebook

๐Ÿ”ถSpark session is a combination of all 3 different contexts internally spark session creates a new spark context for all the operation

๐Ÿ”ถSpark Session addresses the multi-user accessing the same spark context issue

๐Ÿ”ถSpark sessions handle the isolation and resource management more efficiently


Saturday, July 27, 2024

Kafka Eco System




  1. ๐Ÿ“Œ๐—ง๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€:


Topics

๐Ÿ‘‰๐ŸผA stream of messages belonging to a particular category is called a Topic.

๐Ÿ‘‰๐ŸผIts is a logical feed name where to which records are published(Similar to Table in DB )

๐Ÿ‘‰๐ŸผUnique identification of table is called name of the topic - can not be duplicated

๐Ÿ‘‰๐ŸผA topic is a storage mechanism for a sequence of events

๐Ÿ‘‰๐ŸผEvents are immutable

๐Ÿ‘‰๐Ÿผkeep events in the same order as they occur in time. So, each new event is always added to the end of the Message. 


2. ๐Ÿ“Œ๐—ฃ๐—”๐—ฅ๐—ง๐—œ๐—ง๐—œ๐—ข๐—ก๐—ฆ:

๐Ÿ‘‰๐ŸผTopics are split into partition

๐Ÿ‘‰๐ŸผAll the messages within a partition are ordered and immutable

๐Ÿ‘‰๐ŸผAll the messages within the partition has a unique ID associated is called OFFSET.

๐Ÿ‘‰๐ŸผKafka uses topic partitioning to improve scalability.

๐Ÿ‘‰๐ŸผKafka guarantees the order of the events within the same topic partition. However, by default, it does not guarantee the order of events across all partitions.


3.๐Ÿ“Œ๐—ฅ๐—˜๐—ฃ๐—Ÿ๐—œ๐—–๐—”๐—ฆ:

๐Ÿ‘‰๐ŸผReplicas are backs of partition

๐Ÿ‘‰๐ŸผReplicas are never read or write data

๐Ÿ‘‰๐ŸผThey are used to prevent data loss (Fault - Tolerance)



4.๐Ÿ“Œ๐—ฃ๐—ฅ๐—ข๐——๐—จ๐—–๐—˜๐—ฅ:


Producer

๐Ÿ‘‰๐ŸผProducers publish messages by appending to the end of a topic partition.

๐Ÿ‘‰๐ŸผEach message will be stored in the broker disk and will receive an offset (unique identifier). This offset is unique at the partition level, each partition has its owns offsets. That is one more reason that makes Kafka so special, it stores the messages in the disk (like a database, and in fact, Kafka is a database too) to be able to recover them later if necessary. Different from a messaging system, that the message is deleted after being consumed;

๐Ÿ‘‰๐ŸผThe producers use the offset to read the messages, they read from the oldest to the newest. In case of consumer failure, when it recovers, will start reading from the last offset;

๐Ÿ‘‰๐ŸผBy default, if a message contains a key (i.e. the key is NOT null), the hashed value of the key is used to decide in which partition the message is stored.

๐Ÿ‘‰๐ŸผProducers publish messages by appending to the end of a topic partition. By default, if a message contains a key (i.e. the key is NOT null), the hashed value of the key is used to decide in which partition the message is stored.

๐Ÿ‘‰๐ŸผAll messages with the same key will be stored in the same topic partition. This behavior is essential to ensure that messages for the same key are consumed and processed in order from the same topic partition.

๐Ÿ‘‰๐ŸผProducers write the messages to the topic level(All the partitions of that topic) or specific partition of the topic using Producing API's

๐Ÿ‘‰๐ŸผIf the key is null, the producer behaves differently according to the Kafka version:

  • up to Kafka 2.3: a round-robin partitioner is used to balance the messages across all partitions

  • Kafka 2.4 or newer: a sticky partitioner is used which leads to larger batches and reduced latency and is particularly beneficial for very high throughput scenarios


5.Consumer:

๐Ÿ‘‰๐ŸผConsumers are applications which read/consume data from the topics within a cluster using consuming API's

๐Ÿ‘‰๐ŸผConsumers can read either on the topic level (All the partitions of the topic) or specific partition of the topics

๐Ÿ‘‰๐ŸผEach message published to a topic is delivered to a consumer that is subscribed to that topic.

๐Ÿ‘‰๐ŸผA consumer can read data from any position of the partition, and internally the position is stored as a pointer called offset. In most of the cases, a consumer advances its offset linearly, but it could be in any order, or starting from any given position.

๐Ÿ‘‰๐ŸผEach consumer belongs to a consumer group. A consumer group may consist of multiple consumer instances.

๐Ÿ‘‰๐ŸผThis is the reason why a consumer group can be both, fault tolerant and scalable.

๐Ÿ‘‰๐Ÿผ If one of several consumer instances in a group dies, the topic partitions are reassigned to other consumer instances such that the remaining ones continue to process messages form all partitions.

๐Ÿ‘‰๐Ÿผ If a consumer group contains more than one consumer instance, each consumer will only receive messages from a subset of the partitions. When a consumer group only contains one consumer instance, this consumer is responsible for processing all messages of all topic partitions.

๐Ÿ‘‰๐ŸผMessage consumption can be parallelized in a consumer group by adding more consumer instances to the group, up to the number of a topic’s partitions.

๐Ÿ‘‰๐Ÿผ if a topic has 8 partitions, a consumer group can support up to 8 consumer instances which all consume in parallel, each from 1 topic partition.

๐Ÿ‘‰๐ŸผAdding more consumers in a consumer group than the number of partitions for a topic, then they will stay in an idle state, without getting any message


Consumer

6.Kafka Broker:

๐Ÿ‘‰๐ŸผThat Kafka broker is a program that runs on the Java Virtual Machine (Java version 11+)

๐Ÿ‘‰๐ŸผA Kafka broker is used to manage storing the data records/messages on the topic. It can be understood as the mediator between the two


๐Ÿ‘‰๐ŸผThe Kafka broker is responsible for transferring the conversation that the publisher is pushing in the Kafka log commit and the subscriber shall be consuming these messages.

๐Ÿ‘‰๐ŸผEnabling the delivery of the data records/ message to process to the right consumer.



7. Kafka Cluster

๐Ÿ‘‰๐ŸผAn ensemble of Kafka brokers working together is called a Kafka cluster. Some clusters may contain just one broker or others may contain three or potentially hundreds of brokers. Companies like Netflix and Uber run hundreds or thousands of Kafka brokers to handle their data.

๐Ÿ‘‰๐ŸผA broker in a cluster is identified by a unique numeric ID. In the figure below, the Kafka cluster is made up of three Kafka brokers.

๐Ÿ‘‰๐ŸผTo configure the number of the partitions in each broker, we need to configure something called Replication Factor when creating a topic. Let’s say that we have three brokers in our cluster, a topic with three partitions and a Replication Factor of three, in that case, each broker will be responsible for one partition of the topic.



๐Ÿ‘‰๐ŸผAs you can see in the above image, Topic_1 has three partitions, each broker is responsible for a partition of the topic, so, the Replication Factor of the Topic_1 is three.

๐Ÿ‘‰๐ŸผIt’s very important that the number of the partitions match the number of the brokers, in this way, each broker will be responsible for a single partition of the topic

Will write further more


"๐Ÿš€ Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"๐Ÿš€ Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...