Spark - DataSet

1 - About

Dataset is a interface to the Spark Engine added in Spark 1.6 that provides:

When running a SQL against Spark Thrift Server, the dataset interface is used in the background

A Dataset is a strongly typed collection of domain-specific objects.

Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row. A dataframe is then just a dataset.

3 - Benefit

  • access the field of a row by name

4 - Management

4.1 - Creation

A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).

  • Scala

val people = spark.read.parquet("...").as[Person]


Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); 

  • Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName).
  • R is similar to Python

5 - Documentation / Reference


Data Science
Data Analysis
Statistics
Data Science
Linear Algebra Mathematics
Trigonometry

Powered by ComboStrap