About
Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures.
- Enable to efficiently send large read-only values to all of the workers.
- Saved at workers for use in one or more Spark operations
- It's like sending a large, read-only lookup table to all the nodes
- Ship to each worker only once instead of with each task
Broadcast variables allow us to keep a read-only variable cached at a worker.
We can send it to the worker only once instead of having to send it with each task that we perform at that worker.
Now usually, broadcast variables are distributed using very efficient broadcast algorithms.
The most common usage is to give every worker a large data set or table.
Articles Related
API
- At the driver:
broadcastVar = sc.broadcast([1, 2, 3])
broadcastVar.value [1, 2, 3]
Example
Lookup the locations of the call signs on the RDD contactCounts. We load a list of call sign prefixes to country code to support this lookup
Without broadcasting
signPrefixes = loadCallSignTable()
def processSignCount(sign_count, signPrefixes):
country = lookupCountry(sign_count[0], signPrefixes)
count = sign_count[1]
return (country, count)
countryContactCounts = (contactCounts
.map(processSignCount)
.reduceByKey((lambda x, y: x+ y))
)
With broadcast
signPrefixes = sc.broadcast(loadCallSignTable())
def processSignCount(sign_count, signPrefixes.value):
country = lookupCountry(sign_count[0], signPrefixes)
count = sign_count[1]
return (country, count)
countryContactCounts = (contactCounts
.map(processSignCount)
.reduceByKey((lambda x, y: x+ y))
)