Datasets
- Distributed collection of structured data, which are analogous to Relational Tables.
- Examples of dataset API is Python Pandas and R Dataframes.
- Deliver better performance than RDDs.
- Supported by Scala, Java, Python and R
- Uses a new Query optimizer - Catalyst Aggressive Internal Optimizations
Datasets vs Dataframes
type DataFrame = Dataset[Row]
- Original API was called DataFrames with uses Rows to represent the rows in the data frame, but Row loses type-safety. Each column is effectivel untyped
- Dataset was introduced in Spark 1.6 for restoring type safety - for catching errors at compile time
- We ca still work with Dataframes as if they are regular types, but in reality we are working with DataSet[Row] instances
- Distributed collection of structured data, which are analogous to Relational Tables.
- Examples of dataset API is Python Pandas and R Dataframes.
- Deliver better performance than RDDs.
- Supported by Scala, Java, Python and R
- Uses a new Query optimizer - Catalyst Aggressive Internal Optimizations
Datasets vs Dataframes
type DataFrame = Dataset[Row]
- Original API was called DataFrames with uses Rows to represent the rows in the data frame, but Row loses type-safety. Each column is effectivel untyped
- Dataset was introduced in Spark 1.6 for restoring type safety - for catching errors at compile time
- We ca still work with Dataframes as if they are regular types, but in reality we are working with DataSet[Row] instances
No comments:
Post a Comment