Spark - Configuration

Card Puncher Data Processing


The configuration of Spark is mostly:



Web UI

Spark SQL

set;    local-1530396649793  SparkSQL::
spark.driver.port       4850       driver
spark.master    local
spark.sql.catalogImplementation hive
spark.sql.hive.version  1.2.1
spark.sql.warehouse.dir C:\spark-2.2.0-metastore\spark-warehouse
spark.submit.deployMode client


in PySpark - pyspark shell (command line)

confs = conf.getConf().getAll()
# Same as with a spark session 
# confs = spark.sparkContext.getConf().getAll()
for conf in confs:
    print (conf[0], conf[1])



  • The spark-submit script can pass configuration from the command line or from from a properties file



See below config_file

Config file

The config files (spark-defaults.conf, ,,, etc) will be searched by order of precedence at the following location

  • SPARK_CONF_DIR environment variable
  • spark_home/conf

On Linux, a redirection is used to set the conf directory to /etc/spark2/conf


spark.driver.extraJavaOptions -Dhdp.version= -Detwlogger.component=sparkdriver -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,console,RFA,ETW,Anonymizer -Dlog4jspark.log.dir=/var/log/sparkapp/${} -Dlog4jspark.log.file=sparkdriver.log -Dlog4j.configuration=file:/usr/hdp/current/spark2-client/conf/ -XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:InitiatingHeapOccupancyPercent=45
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir wasb:///hdp/spark2-events
spark.eventLog.enabled true
spark.executor.cores 3
spark.executor.extraJavaOptions -Dhdp.version= -Detwlogger.component=sparkexecutor -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,console,RFA,ETW,Anonymizer -Dlog4jspark.log.dir=/var/log/sparkapp/${} -Dlog4jspark.log.file=sparkexecutor.log -Dlog4j.configuration=file:/usr/hdp/current/spark2-client/conf/ -XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:InitiatingHeapOccupancyPercent=45
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.instances 10
spark.executor.memory 9728m
spark.history.fs.logDirectory wasb:///hdp/spark2-events
spark.history.kerberos.keytab none
spark.history.kerberos.principal none
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.master yarn
spark.yarn.access.namenodes hdfs://mycluster
spark.yarn.appMasterEnv.PYSPARK3_PYTHON /usr/bin/anaconda/envs/py35/bin/python3
spark.yarn.appMasterEnv.PYSPARK_PYTHON /usr/bin/anaconda/bin/python
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.jars local:///usr/hdp/current/spark2-client/jars/*
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3


spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/directory
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              5g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"



Documentation / Reference

Discover More
Card Puncher Data Processing
PySpark - How to add a Jar

How to a jar file when executing a PySpark script. When starting pyspark, it had this directory in the classpath. Add your Jar there. site-packages\pyspark\jars\ With the pyspark client and...
Card Puncher Data Processing
PySpark - Install and configuration

env: PYSPARK_PYTHON : Python binary executable to use for PySpark in both driver and workers (default is python2.7 if available, otherwise python). Property spark.pyspark.python take precedence if it...
Card Puncher Data Processing
Python - Installation and configuration

Installation and configuration of a python environment. Download it and install it Example: Linux: Configuration: Path Third library installation: You can also install...
Data System Architecture
SQL - Catalog

A catalog is a named collection of: SQL-schemas, foreign server descriptors, and foreign data wrapper descriptors
Spark Program
Spark - Application

An application is an instance of a driver created via the initialization of a spark context (RDD) or a spark session (Data Set) This instance can be created via: a whole script (called batch mode)...
Card Puncher Data Processing
Spark - Catalog

Metadata store for the table definition. spark.sql.catalogImplementation. A value of hive means the Hive metastore ??
Card Puncher Data Processing
Spark - Classpath (SPARK_CLASSPATH)

in the context of Spark is a configuration property The conf files are searched in the classpath. SPARK_CLASSPATH
Card Puncher Data Processing
Spark - Connection (Context)

A Spark Connection is : a context object (known also as connection) the first step when creating a script This object is called: an SQL Context for a RDD (in Spark 1.x.) SparkSession for a...
Spark Cluster Tasks Slot
Spark - Core (Slot)

Cores (or slots) are the number of available threads for each executor (Spark daemon also ?) slotscoresDatabricks...
Spark Cluster
Spark - Driver

The driver is a (daemon|service) wrapper created when you get a spark context (connection) that look after the lifecycle of the Spark job. cluster managerapplication manager The driver: start as its...

Share this page:
Follow us:
Task Runner