Memory Overhead In pySpark
Memory Overhead
Published: None
Memory Overhead
In Spark, memory overhead refers to the additional memory allocated beyond the user-defined executor memory. This overhead is crucial for managing various internal operations and ensuring smooth execution of tasks.
What is Memory Overhead?
Memory overhead in Spark includes memory used for:
- Task Execution Management:
- Tracks and manages the status ,context and metadata of task being executed
- Allocate space for task releated infomration as input splits,intermediate results and shuffle output
2. Shuffle Operation:
- During the shuffle operations,intermediate data is exchnaged between nodes ,this reuires additional memory for buffer management between nodes and data seriallization
3. Broadcast Variable
- Memory over head ensures that broadcast variable are effciently stored and managed reducing reducant data transfer
4. Internal Data Structure
Spark internal data structures such as task as metadata ,storage and job details required additional memory
5.Network Buffers
- During the data exchange between nodes,n/w buffers are used to temporarily hold data that being sentt/received
- Spark.executor.MemoryOverhead
How Much Memory is Allocated?
The amount of memory allocated for overhead is typically a fraction of the total executor memory. By default, Spark allocates 10% of the executor memory for overhead, but this can be configured using the spark.yarn.executor.memoryOverhead parameter. For example:
- If an executor has 4 GB of memory, the default overhead would be 400 MB (10% of 4 GB).
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("MemoryOverheadExample") \
.set("spark.executor.memory", "4g") \
.set("spark.executor.memoryOverhead", "512m")
sc = SparkContext(conf=conf)
Comments
Post a Comment