Tuesday, July 30, 2024

𝗦𝗽𝗮𝗿𝗸 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝘃𝘀 𝗦𝗽𝗮𝗿𝗸 𝗦𝗲𝘀𝘀𝗶𝗼𝗻


  • Spark Context:




🔶Spark context is the traditional entry point to any Spark application

🔶It represents the connection to the spark cluster and is the place where the user can configure the common properties for the entire application to create RDD

🔶SparkContext is designed for low-level programming & fine-grained control over the Spark application

🔶Get the current status of the spark application

🔶Access various service

🔶Cancel a job

🔶Cancel a stage

🔶Closure cleaning

🔶Register Spark-Listener

🔶Programmable Dynamic allocation

🔶Access persistent RDD


  • Common Challenges 

🔶Managing multiple contexts

🔶Need proper context initialization

🔶failing to manage context correctly leads to performance

🔶Before Spark 2.0, have to create specific Spark contexts for any other interaction (Hive, SQL & Streaming Context)

🔶In a Multiuser-Multi Application case-conflict can arise when multiple users or applications try to use the same context

🔶Legacy code base

  • Spark Session:

🔶Spark Session was introduced in Spark 2.0

🔶No Need to create multiple contexts

🔶Spark session integrates with spark contexts and provides a high-level API for working with structured data through SQL, streaming data with spark streaming

🔶Improve the performance by improving Catalyst Optimizer predominantly for Spark SQL Queries

🔶DataFrame provides a table structure of data

🔶Spark session can be integrated with Jupyter Notebook

🔶Spark session is a combination of all 3 different contexts internally spark session creates a new spark context for all the operation

🔶Spark Session addresses the multi-user accessing the same spark context issue

🔶Spark sessions handle the isolation and resource management more efficiently


No comments:

Post a Comment

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!" Big news for data engineers! Delta Lake 2.0+ in...