𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝘃𝘀 𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗲
🚩𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻
🚀 𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻 𝘃𝘀 𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗲
----------------------------------------------------------------------------
📌𝗥𝗲𝗽𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻:
👉🏻Default partition for RDD/DataFrame
👉🏻spark.sql.files.maxpartitionBytes-128MB(
👉🏻spark.default.parallelism-8 partitons by default(creating data within spark)
👉🏻Repartition is used to increase or decrease the partition in spark
👉🏻Repartition shuffle the data and build a new partition from scratch
repartition is always result equal size partition
👉🏻Due to full shuffle its not good for performance(Depend upon the use case)
👉🏻Use repartition to redistribute data evenly before the join, ensuring a balanced workload
👉🏻Apply repartition before grouping to enhance data distribution.
📌𝗖𝗼𝗮𝗹𝗲𝘀𝗰𝗲:
👉🏻Coalesce will only reduces the no of partitions
👉🏻Coalesce does not required full shuffle
👉🏻Unlike repartition, it tries to minimize data movement and avoids a full shuffle when reducing partitions.
👉🏻Due to partition merge - creates uneven no of partitions
👉🏻𝗣𝗿𝗼𝗯𝗹𝗲𝗺 High partition count at the end of processing, leading to numerous small output files
👉🏻𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻 we can Use coalesce to decrease partitions before saving the final result.
👉🏻𝗣𝗿𝗼𝗯𝗹𝗲𝗺: Numerous small files causing storage and reading inefficiencies.
👉🏻𝗦𝗼𝗹𝘂𝘁𝗶𝗼𝗻: Utilize coalesce to reduce output file count
Comments
Post a Comment