Monday, April 29, 2019

Spark - Datasets Components and Optimizer

SparkSQL Components
- Catalyst
- Execution Core
    - Query planner that translates the logical queries into actual Dataset operations
    - SparkSession (2.X) or SQLContext (1.X) is defined in core
- Hive Integration

Catalyst Optimizer
- Backend agnostic - supports both SQL and Dataset code
- Manipulation of trees of relational operators and expressions

Execution Core: SparkSession

Spark 1.X - SQLContext

val sc = new SparkContext(master,none)
val sqlContext = new SQLContext(sc)

Spark 2.X - SparkSession which also creates the SparkContext

val spark = SparkSession.builder().master(master).appName(name).getOrCreate()
val sc = spark.sparkContext


Hive Integration

- Interoperate with Hadoop Hive tables and metastore
   -Spark 1.x, use HiveContext an extension of SQLContext
   -Spark 2.x, enables Hive support in SparkSession
- Create, read and delete Hive tables
- Use Hive SerDe and UDFs

No comments:

Post a Comment