# Spark - Sort

## Example

### sortByKey

sortByKey() return a new dataset (K, V) pairs sorted by keys in ascending order

``````rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')])
rdd2.sortByKey() ```
```
````RDD: [(1,'a'), (2,'c'), (1,'b')] → [(1,'a'), (1,'b'), (2,'c')] `
```

Syntax:

````sortByKey(ascending=True, numPartitions=None, keyfunc=<function <lambda> at 0x7f7a2d8a4488>)`
```

#### If Key not unique

##### Different orders

If the key is not unique, this function can give different order results.

``````tmp1 = [(1, u'alpha'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'delta')]
tmp2 = [(1, u'delta'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'alpha')]

oneRDD = sc.parallelize(tmp1)
twoRDD = sc.parallelize(tmp2)
oneSorted = oneRDD.sortByKey(True).collect()
twoSorted = twoRDD.sortByKey(True).collect()
print oneSorted
print twoSorted
assert set(oneSorted) == set(twoSorted)     # Note that both lists have the same elements
assert twoSorted < twoSorted.pop() # Check that it is sorted by the keys
assert oneSorted[0:2] != twoSorted[0:2]     # Note that the subset consisting of the first two elements does not match```
```
``````[(1, u'alpha'), (1, u'epsilon'), (1, u'delta'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]
[(1, u'delta'), (1, u'epsilon'), (1, u'alpha'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]```
```
##### Solution

A better technique is to sort the RDD by both the key and value, which we can do by combining the key and value into a single string and then sorting on that string.

``````def sortFunction(tuple):
""" Construct the sort string (does not perform actual sorting)
Args:
tuple: (rating, MovieName)
Returns:
sortString: the value to sort with, 'rating MovieName'
"""
key = unicode('%.3f' % tuple)
value = tuple
return (key + ' ' + value)

print oneRDD.sortBy(sortFunction, True).collect()
print twoRDD.sortBy(sortFunction, True).collect()```
```
``````[(1, u'alpha'), (1, u'delta'), (1, u'epsilon'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]
[(1, u'alpha'), (1, u'delta'), (1, u'epsilon'), (2, u'alpha'), (2, u'beta'), (3, u'alpha')]```
```

### sortBy

sortBy sorts the RDD by the given keyfunc

````sortBy(keyfunc, ascending=True, numPartitions=None)`
```