Spark - Key-Value RDD

Spark Pipeline


Spark supports Key-Value pairs RDD in Python trough a list of tuple.

A count of an RDD with tuple will return the number of tuples. A tuple can be seen as a row.


Spark RDD - (Creation|Construction|Initialization)

rdd = sc.parallelize([(1, 2), (3, 4)]) 
RDD: [(1, 2), (3, 4)]


Some Key-Value Transformations


Documentation / Reference

Discover More
Card Puncher Data Processing

Map reduce and streaming framework in memory. See: . The library entry point of which is also a connection object is called a session (known also as context). Component: DAG scheduler, ...
Spark Pipeline
Spark - (Map|flatMap)

The map implementation in Spark of map reduce. map(func) returns a new distributed data set that's formed by passing each element of the source through a function. flatMap(func) similar to map but...
Spark Pipeline
Spark - (Reduce|Aggregate) function

Spark permits to reduce a data set through: a reduce function or The reduce function of the map reduce framework Reduce is a spark action that aggregates a data set (RDD) element using a function....
Spark Pipeline
Spark - Collect

The collect action returns the elements of a map. driver program The collect() action returns all of the elements of the RDD as an array (collection ?). collectAsMap()...
Spark Pipeline
Spark - Resilient Distributed Datasets (RDDs)

Resilient distributed datasets are one of the data structure in Spark. Write programs in terms of operations on distributed datasets Partitioned collections of objects spread across a cluster, stored...
Spark Pipeline
Spark - Sort

See also: takeOrdered sortByKey() return a new dataset (K, V) pairs sorted by keys in ascending order Syntax:...

Share this page:
Follow us:
Task Runner