Spark - Yarn

About

Yarn is a cluster manager supported by Spark.

Articles Related

Mode

The deployment mode sets where the driver will run. The driver will run:

In client mode, in the client process (ie in the current machine), and the application master is only used for requesting resources from YARN. This is basic mode when want to use a REPL (e.g. Spark shell) (Interactive coding)
In cluster mode, inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. cluster mode allows YARN to choose the driver machine.

Mode	Client	Cluster
Interactive coding	Yes	No
Driver Machine	The client machine	The cluster
Process	Synchronous	Asynchronous (Background)

Example:

with the Spark - Shell

./bin/spark-shell --master yarn --deploy-mode client

or Spark - Spark-submit

./bin/spark-submit  --deploy-mode cluster

Steps

Configuration

The HADOOP_CONF_DIR or YARN_CONF_DIR environment variable points to the directory which contains the (client side) configuration files for the Hadoop cluster.

These configs permits to read the configurations:

of Hdfs in order to connect to HDFS (HADOOP_CONF_DIR)
of Yarn in order to connect to the YARN ResourceManager (YARN_CONF_DIR). The ResourceManager’s address is picked up from the Hadoop configuration.

Example: Set the HADOOP_CONF_DIR or YARN_CONF_DIR

set YARN_CONF_DIR=C:\Users\gerardn\Downloads\YARN_CLIENT

Copy the yarn-site.xml file into the conf directory. If this is the default file, change at minimal:
- yarn.resourcemanager.hostname with the hostname
- yarn.client.nodemanager-connect.max-wait-ms to 10000 (10 sec)
- yarn.resourcemanager.connect.max-wait.ms to 10000 (10 sec)
- yarn.resourcemanager.connect.retry-interval.ms to 10000 - 10 sec (total time to retry before failing)

Deployment mode

Cluster

The master value is yarn and not the cluster URL. The ResourceManager’s address is picked up from the Hadoop configuration

With spark-submit

./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]

./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \
    10

where:

master points to the cluster manager. See Spark - Master (Connection URL )
deploy-mode is the mode
queue : Yarn - Queue
driver-memory : Spark - Driver
executor: Spark - Executor (formerly Worker)

Client

Shell feedback

A YARN client program is started along with an Application Master (in the above example with the default one)
The client periodically poll the Application Master for status updates and display them in the console.

Spark installation on Yarn.

Start a shell (you need to be on the same network and reachable from all node)

:: To locate winutils
set HADOOP_HOME=C:\spark-2.2.0-bin-hadoop2.7
REM suppress the HADOOP_HOME\conf files if you don't want them to be used

REM Then
set HADOOP_CONF_DIR=%HADOOP_HOME%\confAap
set YARN_CONF_DIR=%HADOOP_HOME%\confAap
set HADOOP_BIN=%HADOOP_HOME%\bin

REM the user
set HADOOP_USER_NAME=gnicolas

cd %HADOOP_BIN%

spark-shell.cmd --master yarn --deploy-mode client
REM or
pyspark.cmd --master yarn --deploy-mode client

Note

Azure Conf

Suppress the decryption proprerties in core-site.xml

<property>
      <name>fs.azure.account.keyprovider.basisinfrasharedrgp122.blob.core.windows.net</name>
      <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property>
      <name>fs.azure.shellkeyprovider.script</name>
      <value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>

Add the azure jar files for the storage

FYI: Conf file send to the cluster

FYI: Example of conf file send by Spark in client deploy mode where 10.0.75.1 is the IP of the host machine (The client)

send by the Spark shell

spark.yarn.cache.visibilities=PRIVATE
spark.yarn.cache.timestamps=1553518131341
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5816/__spark_conf__.zip
spark.yarn.cache.sizes=208833138
spark.jars=
spark.sql.catalogImplementation=hive
spark.home=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..
spark.submit.deployMode=client
spark.yarn.queue=root.development
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168/__spark_libs__9157723267130265104.zip\#__spark_libs__
spark.yarn.cache.types=ARCHIVE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.repl.class.outputDir=C\:\\Users\\gerard\\AppData\\Local\\Temp\\spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168\\repl-66e09de6-41c3-47ab-9589-f8f95578432c
spark.app.name=Spark shell
spark.repl.class.uri=spark\://10.0.75.1\:10361/classes
spark.driver.port=10361

Send by pySpark

spark.executorEnv.PYTHONPATH=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python\\lib\\py4j-0.10.4-src.zip;C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python;<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip
spark.yarn.cache.visibilities=PRIVATE,PRIVATE,PRIVATE
spark.yarn.cache.timestamps=1553513892305,1498864159000,1498864159000
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5377/__spark_conf__.zip
spark.yarn.isPython=true
spark.yarn.cache.sizes=208833138,480115,74096
spark.sql.catalogImplementation=hive
spark.submit.deployMode=client
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-c5350af3-fabd-469e-bfc3-565eb0f6ed4b/__spark_libs__2786045563156883095.zip\#__spark_libs__,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip\#pyspark.zip,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip\#py4j-0.10.4-src.zip
spark.serializer.objectStreamReset=100
spark.yarn.cache.types=ARCHIVE,FILE,FILE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.rdd.compress=True
spark.app.name=PySparkShell
spark.driver.port=6067

Documentation / Reference

https://spark.apache.org/docs/latest/running-on-yarn.html