groupByKey vs reduceByKey
groupByKey vs reduceByKey
Published: None
groupByKey vs reduceByKey
Using combiners in PySpark’s reduceByKey can significantly optimize the shuffling process by performing local aggregation on each partition before the data is shuffled across the network. This technique helps reduce the amount of data transferred, leading to better performance.
How Combiners Work in reduceByKey
1. Local Aggregation: Each partition performs a local aggregation of the values for each key. This step reduces the amount of data that needs to be shuffled.
2. Shuffle Phase: The partially aggregated results from each partition are shuffled across the network to the appropriate reducers.
3. Final Aggregation: The reducers perform the final aggregation of the values for each key.
Benefits
· Reduced Data Shuffling: By aggregating data locally before shuffling, the amount of data transferred across the network is minimized.
· Improved Performance: Less data shuffling leads to faster execution times and reduced memory usage.
groupByKey
- Groups all values with the same key into a single sequence
- groupByKey is a transformation operation on Pair RDDs in PySpark
- Groups all values associated with a given key
- Shuffles data across the network to group values by key & is a costly operation, especially for large datasets
- Each data is being tranfered over the n/w
- If the combined value list is too large, it may not fit in one partition,Can cause disk spills
- groupByKey always results in Hash-Partitioned RDDs
- Less efficient than reduceByKey because it shuffles all the data across the network, which can be very expensive in terms of memory and time
Refernce:
https://sparkbyexamples.com/spark/spark-groupbykey-vs-reducebykey/
Comments
Post a Comment