About
Yarn is a cluster manager supported by Spark.
Articles Related
Mode
The deployment mode sets where the driver will run. The driver will run:
- In client mode, in the client process (ie in the current machine), and the application master is only used for requesting resources from YARN. This is basic mode when want to use a REPL (e.g. Spark shell) (Interactive coding)
- In cluster mode, inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. cluster mode allows YARN to choose the driver machine.
Mode | Client | Cluster |
---|---|---|
Interactive coding | Yes | No |
Driver Machine | The client machine | The cluster |
Process | Synchronous | Asynchronous (Background) |
Example:
- with the Spark - Shell
./bin/spark-shell --master yarn --deploy-mode client
./bin/spark-submit --deploy-mode cluster
Steps
Configuration
The HADOOP_CONF_DIR or YARN_CONF_DIR environment variable points to the directory which contains the (client side) configuration files for the Hadoop cluster.
These configs permits to read the configurations:
- of Yarn in order to connect to the YARN ResourceManager (YARN_CONF_DIR). The ResourceManager’s address is picked up from the Hadoop configuration.
Example: Set the HADOOP_CONF_DIR or YARN_CONF_DIR
set YARN_CONF_DIR=C:\Users\gerardn\Downloads\YARN_CLIENT
- Copy the yarn-site.xml file into the conf directory. If this is the default file, change at minimal:
- yarn.resourcemanager.hostname with the hostname
- yarn.client.nodemanager-connect.max-wait-ms to 10000 (10 sec)
- yarn.resourcemanager.connect.max-wait.ms to 10000 (10 sec)
- yarn.resourcemanager.connect.retry-interval.ms to 10000 - 10 sec (total time to retry before failing)
Deployment mode
Cluster
The master value is yarn and not the cluster URL. The ResourceManager’s address is picked up from the Hadoop configuration
With spark-submit
./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
examples/jars/spark-examples*.jar \
10
where:
- master points to the cluster manager. See Spark - Master (Connection URL )
- deploy-mode is the mode
- queue : Yarn - Queue
- driver-memory : Spark - Driver
- executor: Spark - Executor (formerly Worker)
Client
Shell feedback
- A YARN client program is started along with an Application Master (in the above example with the default one)
- The client periodically poll the Application Master for status updates and display them in the console.
Spark installation on Yarn.
- Start a shell (you need to be on the same network and reachable from all node)
:: To locate winutils
set HADOOP_HOME=C:\spark-2.2.0-bin-hadoop2.7
REM suppress the HADOOP_HOME\conf files if you don't want them to be used
REM Then
set HADOOP_CONF_DIR=%HADOOP_HOME%\confAap
set YARN_CONF_DIR=%HADOOP_HOME%\confAap
set HADOOP_BIN=%HADOOP_HOME%\bin
REM the user
set HADOOP_USER_NAME=gnicolas
cd %HADOOP_BIN%
spark-shell.cmd --master yarn --deploy-mode client
REM or
pyspark.cmd --master yarn --deploy-mode client
Note
Azure Conf
- Suppress the decryption proprerties in core-site.xml
<property>
<name>fs.azure.account.keyprovider.basisinfrasharedrgp122.blob.core.windows.net</name>
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property>
<name>fs.azure.shellkeyprovider.script</name>
<value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>
- Add the azure jar files for the storage
FYI: Conf file send to the cluster
FYI: Example of conf file send by Spark in client deploy mode where 10.0.75.1 is the IP of the host machine (The client)
- send by the Spark shell
spark.yarn.cache.visibilities=PRIVATE
spark.yarn.cache.timestamps=1553518131341
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5816/__spark_conf__.zip
spark.yarn.cache.sizes=208833138
spark.jars=
spark.sql.catalogImplementation=hive
spark.home=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..
spark.submit.deployMode=client
spark.yarn.queue=root.development
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168/__spark_libs__9157723267130265104.zip\#__spark_libs__
spark.yarn.cache.types=ARCHIVE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.repl.class.outputDir=C\:\\Users\\gerard\\AppData\\Local\\Temp\\spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168\\repl-66e09de6-41c3-47ab-9589-f8f95578432c
spark.app.name=Spark shell
spark.repl.class.uri=spark\://10.0.75.1\:10361/classes
spark.driver.port=10361
- Send by pySpark
spark.executorEnv.PYTHONPATH=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python\\lib\\py4j-0.10.4-src.zip;C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python;<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip
spark.yarn.cache.visibilities=PRIVATE,PRIVATE,PRIVATE
spark.yarn.cache.timestamps=1553513892305,1498864159000,1498864159000
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5377/__spark_conf__.zip
spark.yarn.isPython=true
spark.yarn.cache.sizes=208833138,480115,74096
spark.sql.catalogImplementation=hive
spark.submit.deployMode=client
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-c5350af3-fabd-469e-bfc3-565eb0f6ed4b/__spark_libs__2786045563156883095.zip\#__spark_libs__,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip\#pyspark.zip,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip\#py4j-0.10.4-src.zip
spark.serializer.objectStreamReset=100
spark.yarn.cache.types=ARCHIVE,FILE,FILE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.rdd.compress=True
spark.app.name=PySparkShell
spark.driver.port=6067