Spark Engine - Data Structure (DataSet, DataFrame and RDD)

Spark Query Plan Generation

About

Spark has many logical representation for a relation (table).

This data structure are all:

  • distributed
  • and present a abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas) using functional transformations (map, flatMap, filter, etc.)

A dataframe is a wrapper around a RDD that holds a sql connection.

Dataframe Rdd Sql

Type

The Dataset

The Dataset can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.

The DataFrame

The DataFrame is collection of distributed Row types. Similar in concept to Python pandas and R DataFrames

The RDD (Resilient Distributed Dataset)

RDD or Resilient Distributed Dataset is the original data structure of Spark.

It's a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster.

New users should focus on Datasets as those will be supersets of the current RDD functionality.

Data Structure

The below script gives the same functionality and computes an average.

RDD

data = sc.textFile(...).split("\t") 
data.map(lambda x: (x[0], [int(x[1]), 1])) \ 
   .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ 
   .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ 
   .collect() 

Data Frame

  • Using DataFrames: Write less code with a dataframe
 
sqlCtx.table("people") \ 
   .groupBy("name") \ 
   .agg("name", avg("age")) \ 
   .collect()





Discover More
Card Puncher Data Processing
Spark

Map reduce and streaming framework in memory. See: . The library entry point of which is also a connection object is called a session (known also as context). Component: DAG scheduler, ...
Spark Query Plan Generation
Spark - Engine

In Spark, the Spark engine is a SQL’s optimized execution engine and understand as input: or and is therefore sometimes known as the SQL Engine. In Spark, functions are pipelined around their...
Spark Pipeline
Spark - Resilient Distributed Datasets (RDDs)

Resilient distributed datasets are one of the data structure in Spark. Write programs in terms of operations on distributed datasets Partitioned collections of objects spread across a cluster, stored...
Card Puncher Data Processing
Spark - SQL Framework

The Spark SQL Framework is a library based around an sql in order to create dataset, data frame with bindings in Python, Scala, Java, and R The Spark SQL Framework can execute SQL queries (Hive as...
Spark Query Plan Generation
Spark Engine - Logical Plan

Logical Plan in Spark. Each data structure represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the...
Spark Query Plan Generation
Spark Engine - Transformation Function

Transformations are functions that will not be completed at the time you write and execute the code. They will only get executed once an action function is called. Spark transformations create new data...
Spark Query Plan Generation
Spark Engine - lazy

All spark data structure are lazy which means that computations are only triggered when an action is invoked



Share this page:
Follow us:
Task Runner