db:spark:rdd:broadcast

About

Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures.

Enable to efficiently send large read-only values to all of the workers.
Saved at workers for use in one or more Spark operations
It's like sending a large, read-only lookup table to all the nodes
Ship to each worker only once instead of with each task

Broadcast variables allow us to keep a read-only variable cached at a worker.

We can send it to the worker only once instead of having to send it with each task that we perform at that worker.

Now usually, broadcast variables are distributed using very efficient broadcast algorithms.

The most common usage is to give every worker a large data set or table.

Articles Related

API

At the driver:

broadcastVar = sc.broadcast([1, 2, 3])

At a worker (in code passed via a closure)

broadcastVar.value [1, 2, 3]

Example

Lookup the locations of the call signs on the RDD contactCounts. We load a list of call sign prefixes to country code to support this lookup

Without broadcasting

signPrefixes = loadCallSignTable()

def processSignCount(sign_count, signPrefixes): 
	country = lookupCountry(sign_count[0], signPrefixes) 
	count = sign_count[1] 
	return (country, count) 

countryContactCounts = (contactCounts 
	.map(processSignCount)
	.reduceByKey((lambda x, y: x+ y))
	)

With broadcast

signPrefixes = sc.broadcast(loadCallSignTable())

def processSignCount(sign_count, signPrefixes.value): 
	country = lookupCountry(sign_count[0], signPrefixes) 
	count = sign_count[1] 
	return (country, count) 

countryContactCounts = (contactCounts 
	.map(processSignCount)
	.reduceByKey((lambda x, y: x+ y))
	)

Documentation / Reference

Learning Spark: Lightning-Fast Big Data Analysis

Table of Contents

Spark - Broadcast variables