Spark - DataSet
Dataset is a interface to the Spark Engine added in Spark 1.6 that provides:
- provides the benefits of RDDs (strong typing, ability to use powerful lambda functions)
- with the benefits of Spark SQL’s optimized execution engine.
A Dataset is a strongly typed collection of domain-specific objects.
- access the field of a row by name
A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).
val people = spark.read.parquet("...").as[Person]
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class));
- Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName).
- R is similar to Python