Spark’s Driver and Executors
Spark’s Driver and Executors
Published: None
Spark’s Driver and Executors
Spark follows Master-Slave architecture where we have one central coordinator and multiple distributed worker nodes. The central coordinator is called Spark Driver and it communicates with all the Workers.
Each Worker node consists of one or more Executor(s) who are responsible for running the Task. Executors register themselves with Driver. The Driver has all the information about the Executors at all the time.
This working combination of Driver and Workers is known as Spark Application.
The Spark Application is launched with the help of the Cluster Manager. Cluster manager can be any one of the following –
- Spark Standalone Mode
- YARN
- Mesos
- Kubernetes
DRIVER
Driver is a Java process. This is the process where the main() method of our Scala, Java, Python program runs. It executes the user code and creates a SparkSession or SparkContext and the SparkSession is responsible to create DataFrame, DataSet, RDD, execute SQL, perform Transformation & Action, etc.
Responsibility of DRIVER
- The main() method of our program runs in the Driver process. It creates SparkSession or SparkContext.
- Conversion of the user code into Task (transformation and action). It looks at the user code and determines are the possible Tasks, i.e. the number of tasks to be performed is decided by the Driver.
- Helps to create the Lineage, Logical Plan and Physical Plan.
- Once the Physical Plan is generated, the Driver schedules the execution of the tasks by coordinating with the Cluster Manager.
- Coordinates with all the Executors for the execution of Tasks. It looks at the current set of Executors and schedules our tasks.
- Keeps track of the data (in the form of metadata) which was cached (persisted) in Executor’s (worker’s) memory.
EXECUTOR:
Executor resides in the Worker node. Executors are launched at the start of a Spark Application in coordination with the Cluster Manager.
They are dynamically launched and removed by the Driver as per required.
Responsibility of EXECUTOR
- To run an individual Task and return the result to the Driver.
- It can cache (persist) the data in the Worker node.
CLUSTER MANAGER
Spark is dependent on the Cluster Manager to launch the Executors and also the Driver (in Cluster mode).
We can use any of the Cluster Manager (as mentioned above) with Spark i.e. Spark can be run with any of the Cluster Manager.
Spark provides a script named “spark-submit” which helps us to connect with a different kind of Cluster Manager and it controls the number of resources the application is going to get i.e. it decides the number of Executors to be launched, how much CPU and memory should be allocated for each Executor, etc.
Working Process
spark-submit –master <Spark master URL> –executor-memory 2g –executor-cores 4 WordCount-assembly-1.0.jar
- Let’s say a user submits a job using “spark-submit”.
- “spark-submit” will in-turn launch the Driver which will execute the main() method of our code.
- Driver contacts the cluster manager and requests for resources to launch the Executors.
- The cluster manager launches the Executors on behalf of the Driver.
- Once the Executors are launched, they establish a direct connection with the Driver.
- The driver determines the total number of Tasks by checking the Lineage.
- The driver creates the Logical and Physical Plan.
- Once the Physical Plan is generated, Spark allocates the Tasks to the Executors.
- Task runs on Executor and each Task upon completion returns the result to the Driver.
- Finally, when all Task is completed, the main() method running in the Driver exits, i.e. main() method invokes sparkContext.stop().
- Finally, Spark releases all the resources from the Cluster Manager.
Comments
Post a Comment