- Spark Context:
🔶Spark context is the traditional entry point to any Spark application
🔶It represents the connection to the spark cluster and is the place where the user can configure the common properties for the entire application to create RDD
🔶SparkContext is designed for low-level programming & fine-grained control over the Spark application
🔶Get the current status of the spark application
🔶Access various service
🔶Cancel a job
🔶Cancel a stage
🔶Closure cleaning
🔶Register Spark-Listener
🔶Programmable Dynamic allocation
🔶Access persistent RDD
- Common Challenges
🔶Managing multiple contexts
🔶Need proper context initialization
🔶failing to manage context correctly leads to performance
🔶Before Spark 2.0, have to create specific Spark contexts for any other interaction (Hive, SQL & Streaming Context)
🔶In a Multiuser-Multi Application case-conflict can arise when multiple users or applications try to use the same context
🔶Legacy code base
- Spark Session:
🔶Spark Session was introduced in Spark 2.0
🔶No Need to create multiple contexts
🔶Spark session integrates with spark contexts and provides a high-level API for working with structured data through SQL, streaming data with spark streaming
🔶Improve the performance by improving Catalyst Optimizer predominantly for Spark SQL Queries
🔶DataFrame provides a table structure of data
🔶Spark session can be integrated with Jupyter Notebook
🔶Spark session is a combination of all 3 different contexts internally spark session creates a new spark context for all the operation
🔶Spark Session addresses the multi-user accessing the same spark context issue
🔶Spark sessions handle the isolation and resource management more efficiently
No comments:
Post a Comment