Spark Engine - Data Structure (DataSet, DataFrame and RDD)

About

Spark has many logical representation for a relation (table).

Spark - DataSet
Spark DataSet - Data Frame (a dataset of rows)
Spark - Resilient Distributed Datasets (RDDs) (Archaic: Previously SchemaRDD (cf. Spark < 1.3)).

This data structure are all:

distributed
and present a abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas) using functional transformations (map, flatMap, filter, etc.)

A dataframe is a wrapper around a RDD that holds a sql connection.

Articles Related

Type

The Dataset

The Dataset can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.

The DataFrame

The DataFrame is collection of distributed Row types. Similar in concept to Python pandas and R DataFrames

The RDD (Resilient Distributed Dataset)

RDD or Resilient Distributed Dataset is the original data structure of Spark.

It's a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster.

New users should focus on Datasets as those will be supersets of the current RDD functionality.

Data Structure

The below script gives the same functionality and computes an average.

RDD

With an Spark - Resilient Distributed Datasets (RDDs)

data = sc.textFile(...).split("\t") 
data.map(lambda x: (x[0], [int(x[1]), 1])) \ 
   .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ 
   .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ 
   .collect()

Data Frame

Using DataFrames: Write less code with a dataframe

 
sqlCtx.table("people") \ 
   .groupBy("name") \ 
   .agg("name", avg("age")) \ 
   .collect()