Spark - DataSet
About
Dataset is a interface to the Spark Engine added in Spark 1.6 that provides:
- provides the benefits of RDDs (strong typing, ability to use powerful lambda functions)
- with the benefits of Spark SQL’s optimized execution engine.
When running a SQL against Spark Thrift Server, the dataset interface is used in the background
A Dataset is a strongly typed collection of domain-specific objects.
Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row. A dataframe is then just a dataset.
Articles Related
Benefit
- access the field of a row by name
Management
Creation
A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).
- Scala
val people = spark.read.parquet("...").as[Person]
- Java: DataSet where Java - JavaBean (Bean)
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class));
- Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName).
- R is similar to Python