Spark DataSet - Data Frame

Card Puncher Data Processing


The data frame is a dataset of rows (ie organized into named columns).

Technically, a data frame is an untyped view of a dataset.

A SparkDataFrame is a distributed collection of data organized into named columns.

It is conceptually equivalent to:

Management / Operations

Operations available on Datasets follow the spark pattern.


A dataframe a unified interface to reading/writing data in a variety of formats with Writer to JDBC, JSON, CSV, …

sources such as:

  • structured data files,
  • tables in Hive,
  • external databases,
  • or existing RDDs.
// a DataFrame is represented by a Dataset of Rows. 
// a type alias of Dataset[Row]
// a DataFrame is represented by a Dataset of Rows. 
// represent a DataFrame in java

// From a sqlContext: \
sqlContext.createDataFrame(RDD[Rows], Schema)
  • Python DataFrame. All Datasets in Python are Dataset[Row], and we call it DataFrame to be consistent with the data frame concept in Pandas and R
people ="...")
textFile ="")


df =   
  .option("samplingRatio", "0.1")   




Etl (Read and Write)

ETL Using Custom Data Sources 
  .option("url", "") 
  .option("user", "marmbrus") 
  .option("password", "*******") 
  .option("query", """ 
    |project = SPARK AND  
    |component = SQL AND  
    |(status = Open OR status = "In Progress" OR status = Reopened)""".stripMargin) 


  • the load function creates a data frame
  • that is then saved


It has various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions such as:

  • group by,
  • order,
  • plus,….


people.col("age").plus(10);  // in Java

Documentation / Reference

Discover More
Relational Data Model
(Relation|Table) - Tabular data

A Relation is a logical data structure composed of tuple (row) attribute (column, field) The following data structure are a relation: a table, a materialized view (query) (store data) a query,...
Card Puncher Data Processing
Calcite - Relational Expression (RelNode, Algebra)

Relational Algebra in Calcite A relational expression is represented by a tree of RelNode. A RelNode can be considered as the same logic than the Spark dataframe. TableScan Project Filter...
Spark Program
Spark - Application

An application is an instance of a driver created via the initialization of a spark context (RDD) or a spark session (Data Set) This instance can be created via: a whole script (called batch mode)...
Card Puncher Data Processing
Spark - DataSet

Dataset is a interface to the Spark Engine added in Spark 1.6 that provides: provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s...
Spark Query Plan Generation
Spark - Engine

In Spark, the Spark engine is a SQL’s optimized execution engine and understand as input: or and is therefore sometimes known as the SQL Engine. In Spark, functions are pipelined around their...
Card Puncher Data Processing
Spark - Library

Spark application library. Example: See also: ?? SQL and DataFrames, MLlib for machine learning, GraphX, Spark Streaming.
Card Puncher Data Processing
Spark - SQL Framework

The Spark SQL Framework is a library based around an sql in order to create dataset, data frame with bindings in Python, Scala, Java, and R The Spark SQL Framework can execute SQL queries (Hive as...
Card Puncher Data Processing
Spark - Web UI (Driver UI)

Each driver program has a web UI, typically on port 4040, that displays information : running tasks, executors, and storage usage. The Spark UI will tell you which DataFrames and what percentages...
Card Puncher Data Processing
Spark DataSet - DSL Operations

Domain-specific-language (DSL) functions are defined in the class: DataFrame, Column and functions Example: group by, order, plus,.... With a spark session and a dataset of row...
Card Puncher Data Processing
Spark DataSet - Row

A DataSet of row is known as a dataframe. Row

Share this page:
Follow us:
Task Runner