Posts

Showing posts from July, 2024

𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝘃𝘀 𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗲

🚩𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻 🚀 𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝘃𝘀 𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗲 ---------------------------------------------------------------------------- 📌𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻: 👉🏻Default partition for RDD/DataFrame 👉🏻spark.sql.files.maxpartitionBytes-128MB( 👉🏻spark.default.parallelism-8 partitons by default(creating data within spark) 👉🏻Repartition is used to increase or decrease the partition in spark 👉🏻Repartition shuffle the data and build a new partition from scratch repartition is always result equal size partition 👉🏻Due to full shuffle its not good for performance(Depend upon the use case) 👉🏻Use repartition to redistribute data evenly before the join, ensuring a balanced workload 👉🏻Apply repartition before grouping to enhance data distribution. 📌𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗲: 👉🏻Coalesce will only reduces the no of partitions 👉🏻Coalesce does not required full shuffle 👉🏻Unlike repartition, it tries to minimize data movement and avoids a full shuffle when reducing partitions. 👉...

𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝘃𝘀 𝗦𝗽𝗮𝗿𝗸 𝗦𝗲𝘀𝘀𝗶𝗼𝗻

Image
Spark Context: 🔶 Spark context is the traditional entry point to any Spark application 🔶 It represents the connection to the spark cluster and is the place where the user can configure the common properties for the entire application to create RDD 🔶 SparkContext is designed for low-level programming & fine-grained control over the Spark application 🔶 Get the current status of the spark application 🔶 Access various service 🔶 Cancel a job 🔶 Cancel a stage 🔶 Closure cleaning 🔶 Register Spark-Listener 🔶 Programmable Dynamic allocation 🔶 Access persistent RDD Common Challenges  🔶 Managing multiple contexts 🔶 Need proper context initialization 🔶 failing to manage context correctly leads to performance 🔶 Before Spark 2.0, have to create specific Spark contexts for any other interaction (Hive, SQL & Streaming Context) 🔶 In a Multiuser-Multi Application case-conflict can arise when multiple users or applications try to use the same context 🔶 Legacy code base Spark S...

Kafka Eco System

Image
Minimize image Edit image Delete image 📌𝗧𝗼𝗽𝗶𝗰𝘀: Minimize image Edit image Delete image Topics 👉🏼A stream of messages belonging to a particular category is called a Topic. 👉🏼Its is a logical feed name where to which records are published(Similar to Table in DB ) 👉🏼Unique identification of table is called name of the topic - can not be duplicated 👉🏼A topic is a storage mechanism for a sequence of events 👉🏼E vents are immutable 👉🏼keep events in the same order as they occur in time. So, each new event is always added to the end of the Message.  2. 📌𝗣𝗔𝗥𝗧𝗜𝗧𝗜𝗢𝗡𝗦: 👉🏼Topics are split into partition 👉🏼All the messages within a partition are ordered and immutable 👉🏼All the messages within the partition has a unique ID associated is called OFFSET. 👉🏼Kafka uses topic partitioning to improve scalability. 👉🏼 Kafka guarantees the order of the events within the same topic partition.  However, by default, it does not guarantee the order of events across a...