Spark has many logical representation for a relation (table).
This data structure are all:
The Dataset can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.
The DataFrame is collection of distributed Row types. Similar in concept to Python pandas and R DataFrames
RDD or Resilient Distributed Dataset is the original data structure of Spark.
It's a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster.
New users should focus on Datasets as those will be supersets of the current RDD functionality.
The below script gives the same functionality and computes an average.
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()
sqlCtx.table("people") \
.groupBy("name") \
.agg("name", avg("age")) \
.collect()