Table of Contents

Spark Engine - Data Structure (DataSet, DataFrame and RDD)

About

Spark has many logical representation for a relation (table).

This data structure are all:

A dataframe is a wrapper around a RDD that holds a sql connection.

Dataframe Rdd Sql

Type

The Dataset

The Dataset can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.

The DataFrame

The DataFrame is collection of distributed Row types. Similar in concept to Python pandas and R DataFrames

The RDD (Resilient Distributed Dataset)

RDD or Resilient Distributed Dataset is the original data structure of Spark.

It's a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster.

New users should focus on Datasets as those will be supersets of the current RDD functionality.

Data Structure

The below script gives the same functionality and computes an average.

RDD

data = sc.textFile(...).split("\t") 
data.map(lambda x: (x[0], [int(x[1]), 1])) \ 
   .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ 
   .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ 
   .collect() 

Data Frame

 
sqlCtx.table("people") \ 
   .groupBy("name") \ 
   .agg("name", avg("age")) \ 
   .collect()