About
The data frame is a dataset of rows (ie organized into named columns).
Technically, a data frame is an untyped view of a dataset.
A SparkDataFrame is a distributed collection of data organized into named columns.
It is conceptually equivalent to:
- or a data frame in R
Articles Related
Management / Operations
Operations available on Datasets follow the spark pattern.
Creating
A dataframe a unified interface to reading/writing data in a variety of formats with Writer to JDBC, JSON, CSV, …
sources such as:
- structured data files,
- tables in Hive,
- external databases,
- or existing RDDs.
// a DataFrame is represented by a Dataset of Rows.
// a type alias of Dataset[Row]
Dataset[Row]
// a DataFrame is represented by a Dataset of Rows.
// represent a DataFrame in java
Dataset<Row>
// From a sqlContext: \
sqlContext.createDataFrame(RDD[Rows], Schema)
people = spark.read.parquet("...")
textFile = spark.read.text("README.md")
Reading
df = sqlContext.read
.format("json")
.option("samplingRatio", "0.1")
.load("/home/michael/data.json")
Writing
df.write
.format("parquet")
.mode("append")
.partitionBy("year")
.saveAsTable("fasterData")
where:
Etl (Read and Write)
ETL Using Custom Data Sources
sqlContext.read
.format("com.databricks.spark.jira")
.option("url", "https://issues.apache.org/jira/rest/api/latest/search")
.option("user", "marmbrus")
.option("password", "*******")
.option("query", """
|project = SPARK AND
|component = SQL AND
|(status = Open OR status = "In Progress" OR status = Reopened)""".stripMargin)
.load()
.repartition(1)
.write
.format("parquet")
.saveAsTable("sparkSqlJira")
where:
- the load function creates a data frame
- that is then saved
Operations
It has various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions such as:
- group by,
- order,
- plus,….
Example:
people.col("age").plus(10); // in Java