Spark - Configuration

Card Puncher Data Processing

About

The configuration of Spark is mostly:

Management

List

Web UI

Spark SQL

set;
spark.app.id    local-1530396649793
spark.app.name  SparkSQL::192.168.71.10
spark.driver.host       192.168.71.10
spark.driver.port       4850
spark.executor.id       driver
spark.jars
spark.master    local
spark.sql.catalogImplementation hive
spark.sql.hive.version  1.2.1
spark.sql.warehouse.dir C:\spark-2.2.0-metastore\spark-warehouse
spark.submit.deployMode client

PySpark

in PySpark - pyspark shell (command line)

confs = conf.getConf().getAll()
# Same as with a spark session 
# confs = spark.sparkContext.getConf().getAll()
for conf in confs:
    print (conf[0], conf[1])

Set

Submit

  • The spark-submit script can pass configuration from the command line or from from a properties file

Code

File

See below config_file

Config file

The config files (spark-defaults.conf, , spark-env.sh, log4j.properties, etc) will be searched by order of precedence at the following location

  • SPARK_CONF_DIR environment variable
  • spark_home/conf

On Linux, a redirection is used to set the conf directory to /etc/spark2/conf

Spark

spark.driver.extraJavaOptions -Dhdp.version= -Detwlogger.component=sparkdriver -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,console,RFA,ETW,Anonymizer -Dlog4jspark.log.dir=/var/log/sparkapp/${user.name} -Dlog4jspark.log.file=sparkdriver.log -Dlog4j.configuration=file:/usr/hdp/current/spark2-client/conf/log4j.properties -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:InitiatingHeapOccupancyPercent=45
spark.driver.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.eventLog.dir wasb:///hdp/spark2-events
spark.eventLog.enabled true
spark.executor.cores 3
spark.executor.extraJavaOptions -Dhdp.version= -Detwlogger.component=sparkexecutor -DlogFilter.filename=SparkLogFilters.xml -DpatternGroup.filename=SparkPatternGroups.xml -Dlog4jspark.root.logger=INFO,console,RFA,ETW,Anonymizer -Dlog4jspark.log.dir=/var/log/sparkapp/${user.name} -Dlog4jspark.log.file=sparkexecutor.log -Dlog4j.configuration=file:/usr/hdp/current/spark2-client/conf/log4j.properties -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -XX:+UseG1GC -XX:+PrintFlagsFinal -XX:+PrintReferenceGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:InitiatingHeapOccupancyPercent=45
spark.executor.extraLibraryPath /usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.instances 10
spark.executor.memory 9728m
spark.history.fs.logDirectory wasb:///hdp/spark2-events
spark.history.kerberos.keytab none
spark.history.kerberos.principal none
spark.history.provider org.apache.spark.deploy.history.FsHistoryProvider
spark.history.ui.port 18080
spark.master yarn
spark.yarn.access.namenodes hdfs://mycluster
spark.yarn.appMasterEnv.PYSPARK3_PYTHON /usr/bin/anaconda/envs/py35/bin/python3
spark.yarn.appMasterEnv.PYSPARK_PYTHON /usr/bin/anaconda/bin/python
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.yarn.executor.memoryOverhead 384
spark.yarn.historyServer.address hn0.internal.cloudapp.net:18080
spark.yarn.jars local:///usr/hdp/current/spark2-client/jars/*
spark.yarn.preserve.staging.files false
spark.yarn.queue default
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.submit.file.replication 3

Template

spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/directory
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              5g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

Hive

<configuration>
	<property>
	  <name>hive.exec.scratchdir</name>
	  <value>hdfs://mycluster/tmp/hive</value>
	</property>
	<property>
	  <name>hive.metastore.client.connect.retry.delay</name>
	  <value>5</value>
	</property>
	<property>
	  <name>hive.metastore.client.socket.timeout</name>
	  <value>1800</value>
	</property>
	<property>
	  <name>hive.metastore.uris</name>
	  <value>thrift://hn0.internal.cloudapp.net:9083,thrift://hn1.internal.cloudapp.net:9083</value>
	</property>
	<property>
	  <name>hive.server2.enable.doAs</name>
	  <value>false</value>
	</property>
	<property>
	  <name>hive.server2.thrift.http.path</name>
	  <value>/</value>
	</property>
	<property>
	  <name>hive.server2.thrift.http.port</name>
	  <value>10002</value>
	</property>
	<property>
	  <name>hive.server2.thrift.port</name>
	  <value>10016</value>
	</property>
	<property>
	  <name>hive.server2.transport.mode</name>
	  <value>http</value>
	</property>
</configuration>

Documentation / Reference





Discover More
Card Puncher Data Processing
PySpark - How to add a Jar

How to a jar file when executing a PySpark script. When starting pyspark, it had this directory in the classpath. Add your Jar there. site-packages\pyspark\jars\ With the pyspark client and...
Card Puncher Data Processing
PySpark - Install and configuration

env: PYSPARK_PYTHON : Python binary executable to use for PySpark in both driver and workers (default is python2.7 if available, otherwise python). Property spark.pyspark.python take precedence if it...
Spark Program
Spark - Application

An application is an instance of a driver created via the initialization of a spark context (RDD) or a spark session (Data Set) This instance can be created via: a whole script (called batch mode)...
Card Puncher Data Processing
Spark - Catalog

Metadata store for the table definition. spark.sql.catalogImplementation. A value of hive means the Hive metastore ??
Card Puncher Data Processing
Spark - Classpath (SPARK_CLASSPATH)

in the context of Spark is a configuration property The conf files are searched in the classpath. SPARK_CLASSPATH
Card Puncher Data Processing
Spark - Connection (Context)

A Spark Connection is : a context object (known also as connection) the first step when creating a script This object is called: an SQL Context for a RDD (in Spark 1.x.) SparkSession for a...
Spark Cluster Tasks Slot
Spark - Core (Slot)

Cores (or slots) are the number of available threads for each executor (Spark daemon also ?) slotscoresDatabricks...
Spark Cluster
Spark - Driver

The driver is a (daemon|service) wrapper created when you get a spark context (connection) that look after the lifecycle of the Spark job. cluster managerapplication manager The driver: start as its...
Rdd 5 Partition 3 Worker
Spark - Executor (formerly Worker)

When running on a cluster, each Spark application gets an independent set of executor JVMs that only run tasks and store data for that application. Worker or Executor are processes that run computations...
Sql Hive Arch
Spark - Hive

Hive is the default Spark catalog. Since Spark 2.0, Spark SQL supports builtin Hive features such as: HiveQL Hive SerDes UDFs read...



Share this page:
Follow us:
Task Runner