Spark Engine - Data Structure (DataSet, DataFrame and RDD)


Spark has many logical representation for a relation (table).

This data structure are all:

  • distributed
  • and present a abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas) using functional transformations (map, flatMap, filter, etc.)

A dataframe is a wrapper around a RDD that holds a sql connection.


The Dataset

The Dataset can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.

The DataFrame

The DataFrame is collection of distributed Row types. Similar in concept to Python pandas and R DataFrames

The RDD (Resilient Distributed Dataset)

RDD or Resilient Distributed Dataset is the original data structure of Spark.

It's a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster.

New users should focus on Datasets as those will be supersets of the current RDD functionality.

Data Structure

The below script gives the same functionality and computes an average.


data = sc.textFile(...).split("\t") x: (x[0], [int(x[1]), 1])) \ 
   .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ 
   .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ 

Data Frame

  • Using DataFrames: Write less code with a dataframe
sqlCtx.table("people") \ 
   .groupBy("name") \ 
   .agg("name", avg("age")) \ 

Powered by ComboStrap