๐ฉ๐ค๐๐ฒ๐๐๐ถ๐ผ๐ป
๐ ๐ฅ๐ฒ๐ฝ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป ๐๐ ๐๐ผ๐ฎ๐น๐ฒ๐๐ฐ๐ฒ
----------------------------------------------------------------------------
๐๐ฅ๐ฒ๐ฝ๐ฎ๐ฟ๐๐ถ๐๐ถ๐ผ๐ป:
๐๐ปDefault partition for RDD/DataFrame
๐๐ปspark.sql.files.maxpartitionBytes-128MB(
๐๐ปspark.default.parallelism-8 partitons by default(creating data within spark)
๐๐ปRepartition is used to increase or decrease the partition in spark
๐๐ปRepartition shuffle the data and build a new partition from scratch
repartition is always result equal size partition
๐๐ปDue to full shuffle its not good for performance(Depend upon the use case)
๐๐ปUse repartition to redistribute data evenly before the join, ensuring a balanced workload
๐๐ปApply repartition before grouping to enhance data distribution.
๐๐๐ผ๐ฎ๐น๐ฒ๐๐ฐ๐ฒ:
๐๐ปCoalesce will only reduces the no of partitions
๐๐ปCoalesce does not required full shuffle
๐๐ปUnlike repartition, it tries to minimize data movement and avoids a full shuffle when reducing partitions.
๐๐ปDue to partition merge - creates uneven no of partitions
๐๐ป๐ฃ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ High partition count at the end of processing, leading to numerous small output files
๐๐ป๐ฆ๐ผ๐น๐๐๐ถ๐ผ๐ป we can Use coalesce to decrease partitions before saving the final result.
๐๐ป๐ฃ๐ฟ๐ผ๐ฏ๐น๐ฒ๐บ: Numerous small files causing storage and reading inefficiencies.
๐๐ป๐ฆ๐ผ๐น๐๐๐ถ๐ผ๐ป: Utilize coalesce to reduce output file count
No comments:
Post a Comment