Table of Contents

Spark RDD - (Creation|Construction|Initialization)

About

RDD type

Example

List

One

data = [1,2,3,4,5]
rDD = sc.parallelize(data,4)

No computation occurs with sc.parallelize(). Spark only records how to create the RDD with four partitions

>>>rDD ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229

Several

sc.parallelize([[1,2],[3,4]]).collect()
[[1, 2], [3, 4]]

Key Value

rdd = sc.parallelize([(1, 2), (3, 4)]) 
RDD: [(1, 2), (3, 4)]

File

distFile = sc.textFile("README.md", 4)

where:

A rdd is then a list of string (a list of line)

A file can come from