Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures.
Broadcast variables allow us to keep a read-only variable cached at a worker.
We can send it to the worker only once instead of having to send it with each task that we perform at that worker.
Now usually, broadcast variables are distributed using very efficient broadcast algorithms.
The most common usage is to give every worker a large data set or table.
broadcastVar = sc.broadcast([1, 2, 3])
broadcastVar.value [1, 2, 3]
Lookup the locations of the call signs on the RDD contactCounts. We load a list of call sign prefixes to country code to support this lookup
signPrefixes = loadCallSignTable()
def processSignCount(sign_count, signPrefixes):
country = lookupCountry(sign_count[0], signPrefixes)
count = sign_count[1]
return (country, count)
countryContactCounts = (contactCounts
.map(processSignCount)
.reduceByKey((lambda x, y: x+ y))
)
signPrefixes = sc.broadcast(loadCallSignTable())
def processSignCount(sign_count, signPrefixes.value):
country = lookupCountry(sign_count[0], signPrefixes)
count = sign_count[1]
return (country, count)
countryContactCounts = (contactCounts
.map(processSignCount)
.reduceByKey((lambda x, y: x+ y))
)