Data Analytics: 𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝘃𝘀 𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗲

Tuesday, July 30, 2024

🚩𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻

🚀 𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝘃𝘀 𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗲

----------------------------------------------------------------------------

📌𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻:

👉🏻Default partition for RDD/DataFrame

👉🏻spark.sql.files.maxpartitionBytes-128MB(

👉🏻spark.default.parallelism-8 partitons by default(creating data within spark)

👉🏻Repartition is used to increase or decrease the partition in spark

👉🏻Repartition shuffle the data and build a new partition from scratch

repartition is always result equal size partition

👉🏻Due to full shuffle its not good for performance(Depend upon the use case)

👉🏻Use repartition to redistribute data evenly before the join, ensuring a balanced workload

👉🏻Apply repartition before grouping to enhance data distribution.

📌𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗲:

👉🏻Coalesce will only reduces the no of partitions

👉🏻Coalesce does not required full shuffle

👉🏻Unlike repartition, it tries to minimize data movement and avoids a full shuffle when reducing partitions.

👉🏻Due to partition merge - creates uneven no of partitions

👉🏻𝗣𝗿𝗼𝗯𝗹𝗲𝗺 High partition count at the end of processing, leading to numerous small output files

👉🏻𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻 we can Use coalesce to decrease partitions before saving the final result.

👉🏻𝗣𝗿𝗼𝗯𝗹𝗲𝗺: Numerous small files causing storage and reading inefficiencies.

👉🏻𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: Utilize coalesce to reduce output file count

Data Analytics