Spark - Driver

Spark Cluster

About

The driver is a (daemon|service) wrapper created when you get a spark context (connection) that look after the lifecycle of the Spark job.

Spark supports many cluster manager and in the context of Yarn, the driver is a synonym for the application manager.

The driver:

  • start as its own service (daemon)
  • connect to a cluster manager,
  • manage them. It listen for and accept incoming connections from its worker (executors) throughout its lifetime. As such, the driver program must be network addressable from the worker nodes.

Spark Cluster

You can then create your spark application interactively.

Every driver has a single Sparkcontext object.

See Spark - Cluster

Management

Memory

Spark - Configuration

  • spark.driver.memory

The driver in Yarn is the application manager. See application manager memory

  • perrmgen is controlled by spark.driver.extraJavaOptions
spark-shell --conf spark.driver.extraJavaOptions="-XX:MaxPermSize=384m"

Core

Number of thread (ie core)

Spark - Configuration. spark.driver.cores

PID

export SPARK_PID_DIR=/var/run/spark2

Service port

Spark - Configuration

Conf Default Desc
spark.driver.host (local hostname) Hostname or IP address for the driver. This is used for communicating with the executors and the standalone Master.
spark.driver.port (random) Port for the driver to listen on. This is used for communicating with the executors and the standalone Master.

UI

Default to 4040

See Spark - Web UI (Driver UI)

Machine

The driver machine is the single machine where the driver will run (and therefore initiates the Spark job and where summary results will be collected)

It can be:

  • client: the local machine
  • cluster: the resource manager instantiate a machine

For instance on the Yarn cluster manager, see Yarn deployment mode





Discover More
Card Puncher Data Processing
Design Pattern - (Fluent Interface|Method Chaining)

Same technique that the builder pattern to build an Domain Specific Language in declarative way. The API is primarily designed to be readable and to flow between methods/functions Spark script uses heavily...
Yarn Hortonworks
Hadoop - Edge node

An edge node is a node with the same client tools installed and configured as in the headnodes, but with no Hadoop services running. An edge node is a separate machine that isn’t used to store data...
Sparkmagic Hello
Jupyter - SparkMagic

Sparkmagic is a kernel that provides Ipython magic for working with Spark clusters through Livy in Jupyter notebooks. installation ...
Card Puncher Data Processing
PySpark - Closure

Spark automatically creates closures: for functions that run on RDDs at workers, and for any global variables that are used by those workers. One closure is send per worker for every task. closures...
Card Puncher Data Processing
Spark

Map reduce and streaming framework in memory. See: . The library entry point of which is also a connection object is called a session (known also as context). Component: DAG scheduler, ...
Spark Pipeline
Spark - Accumulator

Accumulators can only be written by workers and read by the driver program. They allow us to aggregate values from workers back to the driver. Now only the driver can access the value of the accumulator...
Spark Program
Spark - Application

An application is an instance of a driver created via the initialization of a spark context (RDD) or a spark session (Data Set) This instance can be created via: a whole script (called batch mode)...
Spark Pipeline
Spark - Broadcast variables

Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures. Enable to efficiently send large read-only values to all of the workers....
Spark Cluster
Spark - Cluster

A cluster in Spark has the following component: A spark application composed of a driver program which include the SparkContext (for RDD) or the Spark Session for a data frame which connect to a cluster...
Spark Pipeline
Spark - Collect

The collect action returns the elements of a map. driver program The collect() action returns all of the elements of the RDD as an array (collection ?). collectAsMap()...



Share this page:
Follow us:
Task Runner