Spark RDD - (Creation|Construction|Initialization)

Spark Pipeline

Spark RDD - (Creation|Construction|Initialization)


RDD type




data = [1,2,3,4,5]
rDD = sc.parallelize(data,4)

No computation occurs with sc.parallelize(). Spark only records how to create the RDD with four partitions

>>>rDD ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:229


[[1, 2], [3, 4]]

Key Value

rdd = sc.parallelize([(1, 2), (3, 4)]) 
RDD: [(1, 2), (3, 4)]


distFile = sc.textFile("", 4)


  • the first argument is a list of path. Example: /my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file

A rdd is then a list of string (a list of line)

A file can come from

  • HDFS,
  • a text file,
  • Hypertable,
  • Amazon S3 Apache Hbase,
  • SequenceFiles,
  • even a directory or wild card.

Discover More
Spark Pipeline
Spark - Key-Value RDD

Spark supports Key-Value pairs RDD in Python trough a list of tuple. A count of an RDD with tuple will return the number of tuples. A tuple can be seen as a row. Some Key-Value Transformations...
Spark Pipeline
Spark - Resilient Distributed Datasets (RDDs)

Resilient distributed datasets are one of the data structure in Spark. Write programs in terms of operations on distributed datasets Partitioned collections of objects spread across a cluster, stored...
Spark Pipeline
Spark RDD - String

Add the line number as a value of a tuple ? where:

Share this page:
Follow us:
Task Runner