Data Analytics: 𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗦𝗽𝗮𝗿𝗸 𝗝𝗼𝗯𝘀, 𝗦𝘁𝗮𝗴𝗲𝘀, 𝗮𝗻𝗱 𝗧𝗮𝘀𝗸𝘀

Tuesday, July 23, 2024

𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗦𝗽𝗮𝗿𝗸 𝗝𝗼𝗯𝘀, 𝗦𝘁𝗮𝗴𝗲𝘀, 𝗮𝗻𝗱 𝗧𝗮𝘀𝗸𝘀

🚀𝟭. 𝗦𝗽𝗮𝗿𝗸 𝗝𝗼𝗯

🔶: A Spark job is a complete computation task that you submit to a Spark cluster, which includes all the actions and transformations you want to perform on your data.

🔶: It consists of multiple stages, each containing a sequence of tasks.

🔶: A job is triggered by an action (e.g., collect(), save(), count()) in Spark. Actions prompt the execution of all the preceding transformations.

🔶: Actions like collect(), save(), count(), and take().

🔶: Jobs are the high-level units of work in Spark, representing the full data processing task. They encompass everything from data reading to transformation and writing.

🚀𝟮. 𝗦𝘁𝗮𝗴𝗲

🔶A stage is a set of tasks that can be executed together without needing to shuffle data across the network.

🔶Stages are created during a job’s execution based on the transformations and actions in the data lineage.

🔶Stages are determined by shuffle boundaries. Transformations that require data to be rearranged, like groupBy or reduceByKey, create shuffle dependencies and therefore separate stages.

🔶Stages help optimize processing by reducing the need for data movement between tasks, minimizing network I/O.

🔶Stages are executed sequentially within a job, but tasks within a stage run in parallel.

🚀𝟯. 𝗧𝗮𝘀𝗸

🔶A task is the smallest unit of work in Spark and represents a single operation on a partition of data.

🔶Tasks are created within stages and run on individual partitions of the distributed data.

🔶Spark automatically splits data into partitions and processes them in parallel, with each task handling one partition.

🔶The number of tasks is typically equal to the number of partitions in the data, and tasks are distributed across the cluster's worker nodes.

🔶Tasks are fault-tolerant; if a task fails, it can be retried and executed on a different node.

🔶 Reading a partition of a file, applying a map function, filtering a dataset

#dataengineer

#Pyspark

#Pysparkinterview

#Bigdata

#BigDataengineer

#dataanalytics

#data

#interview

#sparkdeveloper

#sparkbyexample

#pandas

Data Analytics

Tuesday, July 23, 2024

𝗨𝗻𝗱𝗲𝗿𝘀𝘁𝗮𝗻𝗱𝗶𝗻𝗴 𝗦𝗽𝗮𝗿𝗸 𝗝𝗼𝗯𝘀, 𝗦𝘁𝗮𝗴𝗲𝘀, 𝗮𝗻𝗱 𝗧𝗮𝘀𝗸𝘀

No comments:

Post a Comment

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

Search This Blog