Spark - DataSet

About

Dataset is a interface to the Spark Engine added in Spark 1.6 that provides:

provides the benefits of RDDs (strong typing, ability to use powerful lambda functions)
with the benefits of Spark SQL’s optimized execution engine.

When running a SQL against Spark Thrift Server, the dataset interface is used in the background

A Dataset is a strongly typed collection of domain-specific objects.

Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row. A dataframe is then just a dataset.

Articles Related

Benefit

access the field of a row by name

Management

Creation

A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).

Scala

val people = spark.read.parquet("...").as[Person]

Java: DataSet where Java - JavaBean (Bean)

Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class));

Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName).
R is similar to Python

Documentation / Reference