Spark - (RDD) Transformation

Spark Pipeline


transformation function in RDD


Transformations Description
filter returns a new data set that's formed by selecting those elements of the source on which a function returns true.
distinct([numTasks])) returns a new data set that contains the distinct elements of the source data set.
map and flatMap returns a new distributed data set that's formed by passing each element of the source through a function.
zip (optionally with index or id) returning key-value pairs of the n element of each RDD: <math>\forall i\in \{0, \dots, N\} (rdd1_i,rdd2_i)</math>
split split data set

Discover More
Card Puncher Data Processing
PySpark - Closure

Spark automatically creates closures: for functions that run on RDDs at workers, and for any global variables that are used by those workers. One closure is send per worker for every task. closures...
Spark Pipeline
RDD - Pipe

pipe is a transformation pipe return an RDD created by piping elements to a forked external process. Example with a...
Card Puncher Data Processing

Map reduce and streaming framework in memory. See: . The library entry point of which is also a connection object is called a session (known also as context). Component: DAG scheduler, ...
Spark Pipeline
Spark - Accumulator

Accumulators can only be written by workers and read by the driver program. They allow us to aggregate values from workers back to the driver. Now only the driver can access the value of the accumulator...
Card Puncher Data Processing
Spark - Dense Vector

A DenseVector is class within the module pyspark.mllib.linalg....
Spark Pipeline
Spark - Distinct

distinct([numTasks])) is a transformation that returns a new data set (RDD) that contains the distinct elements of the source data set.
Card Puncher Data Processing
Spark - Function

function = transformation ?
Spark Pipeline
Spark - Key-Value RDD

Spark supports Key-Value pairs RDD in Python trough a list of tuple. A count of an RDD with tuple will return the number of tuples. A tuple can be seen as a row. Some Key-Value Transformations...
Spark Pipeline
Spark - Resilient Distributed Datasets (RDDs)

Resilient distributed datasets are one of the data structure in Spark. Write programs in terms of operations on distributed datasets Partitioned collections of objects spread across a cluster, stored...

Share this page:
Follow us:
Task Runner