RDD - Calling a Worker (Local|External) Process

About

How to call a (forked) external process from Spark

Snippet

map

Spark - (Map|flatMap)

import socket
import subprocess
def callEcho(x):
    p = subprocess.Popen(['echo','Hello World',str(x),'!'], stdin=None, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, err = p.communicate()
    return out, err, socket.gethostname()

# A rdd with three partitions
rdd = sc.parallelize([1,2,3],3)
rdd.map(callEcho).collect()
[('Hello World 1 !\n', '', 'wn4-hddev2'), ('Hello World 2 !\n', '', 'wn1-hddev2'), ('Hello World 3 !\n', '', 'wn1-hddev2')]

pipe

RDD - Pipe

Example with a Spark RDD - Spark Context (sc, sparkContext)

sc = spark.sparkContext
sc.parallelize(['1', '2', '', '3']).pipe('cat').collect()
[u'1', u'2', u'', u'3']

Powered by ComboStrap