Spark has many logical representation for a relation (table).
- Spark DataSet - Data Frame (a dataset of rows)
- Spark - Resilient Distributed Datasets (RDDs) (Archaic: Previously SchemaRDD (cf. Spark < 1.3)).
This data structure are all:
- and present a abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas) using functional transformations (map, flatMap, filter, etc.)
The Dataset can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.
The DataFrame is collection of distributed Row types. Similar in concept to Python pandas and R DataFrames
The RDD (Resilient Distributed Dataset)
RDD or Resilient Distributed Dataset is the original data structure of Spark.
It's a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster.
New users should focus on Datasets as those will be supersets of the current RDD functionality.
data = sc.textFile(...).split("\t") data.map(lambda x: (x, [int(x), 1])) \ .reduceByKey(lambda x, y: [x + y, x + y]) \ .map(lambda x: [x, x / x]) \ .collect()
- Using DataFrames: Write less code with a dataframe
sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()