Tuesday, July 30, 2024

๐—ฅ๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป ๐˜ƒ๐˜€ ๐—–๐—ผ๐—ฎ๐—น๐—ฒ๐˜€๐—ฐ๐—ฒ

๐Ÿšฉ๐—ค๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป

๐Ÿš€ ๐—ฅ๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป ๐˜ƒ๐˜€ ๐—–๐—ผ๐—ฎ๐—น๐—ฒ๐˜€๐—ฐ๐—ฒ


----------------------------------------------------------------------------

๐Ÿ“Œ๐—ฅ๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป:


๐Ÿ‘‰๐ŸปDefault partition for RDD/DataFrame

๐Ÿ‘‰๐Ÿปspark.sql.files.maxpartitionBytes-128MB(

๐Ÿ‘‰๐Ÿปspark.default.parallelism-8 partitons by default(creating data within spark)


๐Ÿ‘‰๐ŸปRepartition is used to increase or decrease the partition in spark

๐Ÿ‘‰๐ŸปRepartition shuffle the data and build a new partition from scratch

repartition is always result equal size partition

๐Ÿ‘‰๐ŸปDue to full shuffle its not good for performance(Depend upon the use case)

๐Ÿ‘‰๐ŸปUse repartition to redistribute data evenly before the join, ensuring a balanced workload

๐Ÿ‘‰๐ŸปApply repartition before grouping to enhance data distribution.


๐Ÿ“Œ๐—–๐—ผ๐—ฎ๐—น๐—ฒ๐˜€๐—ฐ๐—ฒ:


๐Ÿ‘‰๐ŸปCoalesce will only reduces the no of partitions

๐Ÿ‘‰๐ŸปCoalesce does not required full shuffle

๐Ÿ‘‰๐ŸปUnlike repartition, it tries to minimize data movement and avoids a full shuffle when reducing partitions.


๐Ÿ‘‰๐ŸปDue to partition merge - creates uneven no of partitions

๐Ÿ‘‰๐Ÿป๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ High partition count at the end of processing, leading to numerous small output files

๐Ÿ‘‰๐Ÿป๐—ฆ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป we can Use coalesce to decrease partitions before saving the final result.


๐Ÿ‘‰๐Ÿป๐—ฃ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ: Numerous small files causing storage and reading inefficiencies.

๐Ÿ‘‰๐Ÿป๐—ฆ๐—ผ๐—น๐˜‚๐˜๐—ถ๐—ผ๐—ป: Utilize coalesce to reduce output file count



No comments:

Post a Comment

"๐Ÿš€ Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"๐Ÿš€ Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...