Spark - Sort

Spark Pipeline


See also: takeOrdered



sortByKey() return a new dataset (K, V) pairs sorted by keys in ascending order

rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])
RDD: [(1,'a'), (2,'c'), (1,'b')] → [(1,'a'), (1,'b'), (2,'c')] 


sortByKey(ascending=True, numPartitions=None, keyfunc=<function <lambda> at 0x7f7a2d8a4488>)

If Key not unique

Different orders

If the key is not unique, this function can give different order results.

tmp1 = [(1, u'alpha'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'delta')]
tmp2 = [(1, u'delta'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'alpha')]

oneRDD = sc.parallelize(tmp1)
twoRDD = sc.parallelize(tmp2)
oneSorted = oneRDD.sortByKey(True).collect()
twoSorted = twoRDD.sortByKey(True).collect()
print oneSorted
print twoSorted
assert set(oneSorted) == set(twoSorted)     # Note that both lists have the same elements
assert twoSorted[0][0] < twoSorted.pop()[0] # Check that it is sorted by the keys
assert oneSorted[0:2] != twoSorted[0:2]     # Note that the subset consisting of the first two elements does not match
[(1, u'alpha'), (1, u'epsilon'), (1, u'delta'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]
[(1, u'delta'), (1, u'epsilon'), (1, u'alpha'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]


A better technique is to sort the RDD by both the key and value, which we can do by combining the key and value into a single string and then sorting on that string.

def sortFunction(tuple):
    """ Construct the sort string (does not perform actual sorting)
        tuple: (rating, MovieName)
        sortString: the value to sort with, 'rating MovieName'
    key = unicode('%.3f' % tuple[0])
    value = tuple[1]
    return (key + ' ' + value)

print oneRDD.sortBy(sortFunction, True).collect()
print twoRDD.sortBy(sortFunction, True).collect()
[(1, u'alpha'), (1, u'delta'), (1, u'epsilon'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]
[(1, u'alpha'), (1, u'delta'), (1, u'epsilon'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]


sortBy sorts the RDD by the given keyfunc

sortBy(keyfunc, ascending=True, numPartitions=None)

Discover More
Spark Pipeline
Spark - (Take|TakeOrdered)

The action returns an array of the first n elements (not ordered) whereas returns an array with the first n elements after a sort It's a Top N function Python: Takeordered is an action that...
Spark Pipeline
Spark - Key-Value RDD

Spark supports Key-Value pairs RDD in Python trough a list of tuple. A count of an RDD with tuple will return the number of tuples. A tuple can be seen as a row. Some Key-Value Transformations...

Share this page:
Follow us:
Task Runner