Spark - Executor (formerly Worker)

Spark Cluster

About

When running on a cluster, each Spark application gets an independent set of executor JVMs that only run tasks and store data for that application.

Worker or Executor are processes that run computations and store data for your application.

Worker programs run:

There's no communication between workers. See Spark - Cluster

When you perform transformations and actions that use functions, Spark will automatically push a closure containing that function to the workers so that it can run at the workers. One closure is send per worker for every task.

Any modifications to the global variables at the workers are not sent to the driver or to other workers.

Concept

Partition and executor

5 partitions and 3 executors.

Rdd 5 Partition 3 Worker

Management

Memory

Spark - Configuration spark.executor.memory

Example with spark-shell

spark-shell --conf "spark.executor.memory=4g"

Core

Number of thread (ie core)

Spark - Configuration. spark.executor.cores





Discover More
Card Puncher Data Processing
PySpark - Closure

Spark automatically creates closures: for functions that run on RDDs at workers, and for any global variables that are used by those workers. One closure is send per worker for every task. closures...
Spark Caching
Spark - (Executor) Cache

in Spark. Each executor has a cache. from the lines is recomputed lines is NOT recomputed but get from the cache
Spark Pipeline
Spark - (Reduce|Aggregate) function

Spark permits to reduce a data set through: a reduce function or The reduce function of the map reduce framework Reduce is a spark action that aggregates a data set (RDD) element using a function....
Spark Pipeline
Spark - Accumulator

Accumulators can only be written by workers and read by the driver program. They allow us to aggregate values from workers back to the driver. Now only the driver can access the value of the accumulator...
Card Puncher Data Processing
Spark - Application Execution Configuration

in Spark to run an app ie how to calculate: Num-executors - The number of concurrent tasks (executor) that can be executed. Executor-memory - The amount of memory allocated to each executor. Executor-cores...
Spark Pipeline
Spark - Broadcast variables

Broadcast variables are an efficient way of sending data once that would otherwise be sent multiple times automatically in closures. Enable to efficiently send large read-only values to all of the workers....
Spark Cluster
Spark - Cluster

A cluster in Spark has the following component: A spark application composed of a driver program which include the SparkContext (for RDD) or the Spark Session for a data frame which connect to a cluster...
Card Puncher Data Processing
Spark - Connection (Context)

A Spark Connection is : a context object (known also as connection) the first step when creating a script This object is called: an SQL Context for a RDD (in Spark 1.x.) SparkSession for a...
Spark Cluster Tasks Slot
Spark - Core (Slot)

Cores (or slots) are the number of available threads for each executor (Spark daemon also ?) slotscoresDatabricks...
Spark Cluster
Spark - Daemon

daemon in Spark The daemon in Spark are the driver that starts the executors. See The daemon in Spark are JVM running threads (known as core (or slot) one driver = 1 JVM many core one executor...



Share this page:
Follow us:
Task Runner