Spark DataSet - Data Frame

1 - About

The data frame is a dataset of rows (ie organized into named columns).

Technically, a data frame is an untyped view of a dataset.

A SparkDataFrame is a distributed collection of data organized into named columns.

It is conceptually equivalent to:

3 - Management / Operations

Operations available on Datasets follow the spark pattern.

3.1 - Creating

A dataframe a unified interface to reading/writing data in a variety of formats with Writer to JDBC, JSON, CSV, …

sources such as:

  • structured data files,
  • tables in Hive,
  • external databases,
  • or existing RDDs.
  • Spark - Scala API
// a DataFrame is represented by a Dataset of Rows. 
// a type alias of Dataset[Row]
// a DataFrame is represented by a Dataset of Rows. 
// represent a DataFrame in java

// From a sqlContext: \
sqlContext.createDataFrame(RDD[Rows], Schema)
  • Python DataFrame. All Datasets in Python are Dataset[Row], and we call it DataFrame to be consistent with the data frame concept in Pandas and R
people ="...")
textFile ="")

3.2 - Reading

df =   
  .option("samplingRatio", "0.1")   

3.3 - Writing



  • Spark DataSet - Parquet

3.4 - Etl (Read and Write)

ETL Using Custom Data Sources 
  .option("url", "") 
  .option("user", "marmbrus") 
  .option("password", "*******") 
  .option("query", """ 
    |project = SPARK AND  
    |component = SQL AND  
    |(status = Open OR status = "In Progress" OR status = Reopened)""".stripMargin) 


  • the load function creates a data frame
  • that is then saved

3.5 - Operations

It has various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions such as:

  • group by,
  • order,
  • plus,….


people.col("age").plus(10);  // in Java

4 - Documentation / Reference

Data Science
Data Analysis
Data Science
Linear Algebra Mathematics

Powered by ComboStrap