About
How to call a (forked) external process from Spark
Snippet
map
import socket
import subprocess
def callEcho(x):
p = subprocess.Popen(['echo','Hello World',str(x),'!'], stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
return out, err, socket.gethostname()
# A rdd with three partitions
rdd = sc.parallelize([1,2,3],3)
rdd.map(callEcho).collect()
[('Hello World 1 !\n', '', 'wn4-hddev2'), ('Hello World 2 !\n', '', 'wn1-hddev2'), ('Hello World 3 !\n', '', 'wn1-hddev2')]
pipe
Example with a Spark RDD - Spark Context (sc, sparkContext)
sc = spark.sparkContext
sc.parallelize(['1', '2', '', '3']).pipe('cat').collect()
[u'1', u'2', u'', u'3']