Spark - (Executor) Cache

Spark Cluster


Data - Cache in Spark.

Each executor has a cache.

from the Spark - Web UI (Driver UI)

Spark Caching


  • lines is recomputed
lines = sc.textFile("...", 4)
comments = lines.filter(isComment) 
print lines.count()
print comments.count() # lines is recomputed
  • lines is NOT recomputed but get from the cache
lines = sc.textFile("...", 4)
lines.cache() # save, lines is NOT recomputed when comments.count() is called
comments = lines.filter(isComment) 
print lines.count()
print comments.count()

Documentation / Reference

Discover More
Spark Cluster
Spark - Cluster

A cluster in Spark has the following component: A spark application composed of a driver program which include the SparkContext (for RDD) or the Spark Session for a data frame which connect to a cluster...
Spark Pipeline
Spark - Resilient Distributed Datasets (RDDs)

Resilient distributed datasets are one of the data structure in Spark. Write programs in terms of operations on distributed datasets Partitioned collections of objects spread across a cluster, stored...
Card Puncher Data Processing
Spark - Web UI (Driver UI)

Each driver program has a web UI, typically on port 4040, that displays information : running tasks, executors, and storage usage. The Spark UI will tell you which DataFrames and what percentages...

Share this page:
Follow us:
Task Runner