Spark - Distinct

Spark Pipeline


distinct([numTasks])) is a transformation that returns a new data set (RDD) that contains the distinct elements of the source data set.


rdd2 = sc.parallelize([1,4,2,2,3])
[1,4,2,2,3] → [1,4,2,3]

Discover More
Spark Pipeline
Spark - (RDD) Transformation

transformation function in RDD Transformations Description filter returns a new data set that's formed by selecting those elements of the source on which a function returns true. distinct([numTasks]))...

Share this page:
Follow us:
Task Runner