Spark - Configuration

About

The configuration of Spark is mostly:

Articles Related

Management

List

Web UI

The application web UI http://<driver>:4040 > Environment tab

Spark SQL

The set command in Spark SQL. See Hive - Configuration (Variable)

set;

spark.app.id    local-1530396649793
spark.app.name  SparkSQL::192.168.71.10
spark.driver.host       192.168.71.10
spark.driver.port       4850
spark.executor.id       driver
spark.jars
spark.master    local
spark.sql.catalogImplementation hive
spark.sql.hive.version  1.2.1
spark.sql.warehouse.dir C:\spark-2.2.0-metastore\spark-warehouse
spark.submit.deployMode client

PySpark

in PySpark - pyspark shell (command line)

confs = conf.getConf().getAll()
# Same as with a spark session 
# confs = spark.sparkContext.getConf().getAll()
for conf in confs:
    print (conf[0], conf[1])

Set

Submit

The spark-submit script can pass configuration from the command line or from from a properties file

Code

In the code, see app properties

File

See below config_file

Config file

The config files (spark-defaults.conf, , spark-env.sh, log4j.properties, etc) will be searched by order of precedence at the following location

SPARK_CONF_DIR environment variable
spark_home/conf

On Linux, a redirection is used to set the conf directory to /etc/spark2/conf

Spark

spark.driver.extraJavaOptions -Dhdp.version= -Detwlogger.component=sparkdriver -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,console,RFA,ETW,Anonymizer -Dlog4jspark.log.dir=/var/log/sparkapp/${user.name} -Dlog4jspark.log.file=sparkdriver.log -Dlog4j.configuration=file:/usr/hdp/current/spark2-client/conf/log4j.properties -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:InitiatingHeapOccupancyPercent=45
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir wasb:///hdp/spark2-events
spark.eventLog.enabled true
spark.executor.cores 3
spark.executor.extraJavaOptions -Dhdp.version= -Detwlogger.component=sparkexecutor -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,console,RFA,ETW,Anonymizer -Dlog4jspark.log.dir=/var/log/sparkapp/${user.name} -Dlog4jspark.log.file=sparkexecutor.log -Dlog4j.configuration=file:/usr/hdp/current/spark2-client/conf/log4j.properties -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:InitiatingHeapOccupancyPercent=45
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.instances 10
spark.executor.memory 9728m
spark.history.fs.logDirectory wasb:///hdp/spark2-events
spark.history.kerberos.keytab none
spark.history.kerberos.principal none
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.master yarn
spark.yarn.access.namenodes hdfs://mycluster
spark.yarn.appMasterEnv.PYSPARK3_PYTHON /usr/bin/anaconda/envs/py35/bin/python3
spark.yarn.appMasterEnv.PYSPARK_PYTHON /usr/bin/anaconda/bin/python
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address hn0.internal.cloudapp.net:18080
spark.yarn.jars local:///usr/hdp/current/spark2-client/jars/*
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3

Template

spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/directory
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              5g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

Hive

the hive configuration file. See Spark - Hive

<configuration>
	<property>
	  <name>hive.exec.scratchdir</name>
	  <value>hdfs://mycluster/tmp/hive</value>
	</property>
	<property>
	  <name>hive.metastore.client.connect.retry.delay</name>
	  <value>5</value>
	</property>
	<property>
	  <name>hive.metastore.client.socket.timeout</name>
	  <value>1800</value>
	</property>
	<property>
	  <name>hive.metastore.uris</name>
	  <value>thrift://hn0.internal.cloudapp.net:9083,thrift://hn1.internal.cloudapp.net:9083</value>
	</property>
	<property>
	  <name>hive.server2.enable.doAs</name>
	  <value>false</value>
	</property>
	<property>
	  <name>hive.server2.thrift.http.path</name>
	  <value>/</value>
	</property>
	<property>
	  <name>hive.server2.thrift.http.port</name>
	  <value>10002</value>
	</property>
	<property>
	  <name>hive.server2.thrift.port</name>
	  <value>10016</value>
	</property>
	<property>
	  <name>hive.server2.transport.mode</name>
	  <value>http</value>
	</property>
</configuration>

Documentation / Reference

Configuration guide