๐จ๐ป๐ฑ๐ฒ๐ฟ๐๐๐ฎ๐ป๐ฑ๐ถ๐ป๐ด ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ผ๐ฏ๐, ๐ฆ๐๐ฎ๐ด๐ฒ๐, ๐ฎ๐ป๐ฑ ๐ง๐ฎ๐๐ธ๐
๐๐ญ. ๐ฆ๐ฝ๐ฎ๐ฟ๐ธ ๐๐ผ๐ฏ
๐ถ: A Spark job is a complete computation task that you submit to a Spark cluster, which includes all the actions and transformations you want to perform on your data.
๐ถ: It consists of multiple stages, each containing a sequence of tasks.
๐ถ: A job is triggered by an action (e.g., collect(), save(), count()) in Spark. Actions prompt the execution of all the preceding transformations.
๐ถ: Actions like collect(), save(), count(), and take().
๐ถ: Jobs are the high-level units of work in Spark, representing the full data processing task. They encompass everything from data reading to transformation and writing.
๐๐ฎ. ๐ฆ๐๐ฎ๐ด๐ฒ
๐ถA stage is a set of tasks that can be executed together without needing to shuffle data across the network.
๐ถStages are created during a job’s execution based on the transformations and actions in the data lineage.
๐ถStages are determined by shuffle boundaries. Transformations that require data to be rearranged, like groupBy or reduceByKey, create shuffle dependencies and therefore separate stages.
๐ถStages help optimize processing by reducing the need for data movement between tasks, minimizing network I/O.
๐ถStages are executed sequentially within a job, but tasks within a stage run in parallel.
๐๐ฏ. ๐ง๐ฎ๐๐ธ
๐ถA task is the smallest unit of work in Spark and represents a single operation on a partition of data.
๐ถTasks are created within stages and run on individual partitions of the distributed data.
๐ถSpark automatically splits data into partitions and processes them in parallel, with each task handling one partition.
๐ถThe number of tasks is typically equal to the number of partitions in the data, and tasks are distributed across the cluster's worker nodes.
๐ถTasks are fault-tolerant; if a task fails, it can be retried and executed on a different node.
๐ถ Reading a partition of a file, applying a map function, filtering a dataset
#dataengineer
#Pyspark
#Pysparkinterview
#Bigdata
#BigDataengineer
#dataanalytics
#data
#interview
#sparkdeveloper
#sparkbyexample
#pandas
No comments:
Post a Comment