Yarn is a cluster manager supported by Spark.
The deployment mode sets where the driver will run. The driver will run:
Mode | Client | Cluster |
---|---|---|
Interactive coding | Yes | No |
Driver Machine | The client machine | The cluster |
Process | Synchronous | Asynchronous (Background) |
Example:
./bin/spark-shell --master yarn --deploy-mode client
./bin/spark-submit --deploy-mode cluster
The HADOOP_CONF_DIR or YARN_CONF_DIR environment variable points to the directory which contains the (client side) configuration files for the Hadoop cluster.
These configs permits to read the configurations:
Example: Set the HADOOP_CONF_DIR or YARN_CONF_DIR
set YARN_CONF_DIR=C:\Users\gerardn\Downloads\YARN_CLIENT
The master value is yarn and not the cluster URL. The ResourceManager’s address is picked up from the Hadoop configuration
With spark-submit
./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
examples/jars/spark-examples*.jar \
10
where:
Shell feedback
Spark installation on Yarn.
:: To locate winutils
set HADOOP_HOME=C:\spark-2.2.0-bin-hadoop2.7
REM suppress the HADOOP_HOME\conf files if you don't want them to be used
REM Then
set HADOOP_CONF_DIR=%HADOOP_HOME%\confAap
set YARN_CONF_DIR=%HADOOP_HOME%\confAap
set HADOOP_BIN=%HADOOP_HOME%\bin
REM the user
set HADOOP_USER_NAME=gnicolas
cd %HADOOP_BIN%
spark-shell.cmd --master yarn --deploy-mode client
REM or
pyspark.cmd --master yarn --deploy-mode client
<property>
<name>fs.azure.account.keyprovider.basisinfrasharedrgp122.blob.core.windows.net</name>
<value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property>
<name>fs.azure.shellkeyprovider.script</name>
<value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>
FYI: Example of conf file send by Spark in client deploy mode where 10.0.75.1 is the IP of the host machine (The client)
spark.yarn.cache.visibilities=PRIVATE
spark.yarn.cache.timestamps=1553518131341
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5816/__spark_conf__.zip
spark.yarn.cache.sizes=208833138
spark.jars=
spark.sql.catalogImplementation=hive
spark.home=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..
spark.submit.deployMode=client
spark.yarn.queue=root.development
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168/__spark_libs__9157723267130265104.zip\#__spark_libs__
spark.yarn.cache.types=ARCHIVE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.repl.class.outputDir=C\:\\Users\\gerard\\AppData\\Local\\Temp\\spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168\\repl-66e09de6-41c3-47ab-9589-f8f95578432c
spark.app.name=Spark shell
spark.repl.class.uri=spark\://10.0.75.1\:10361/classes
spark.driver.port=10361
spark.executorEnv.PYTHONPATH=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python\\lib\\py4j-0.10.4-src.zip;C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python;<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip
spark.yarn.cache.visibilities=PRIVATE,PRIVATE,PRIVATE
spark.yarn.cache.timestamps=1553513892305,1498864159000,1498864159000
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5377/__spark_conf__.zip
spark.yarn.isPython=true
spark.yarn.cache.sizes=208833138,480115,74096
spark.sql.catalogImplementation=hive
spark.submit.deployMode=client
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-c5350af3-fabd-469e-bfc3-565eb0f6ed4b/__spark_libs__2786045563156883095.zip\#__spark_libs__,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip\#pyspark.zip,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip\#py4j-0.10.4-src.zip
spark.serializer.objectStreamReset=100
spark.yarn.cache.types=ARCHIVE,FILE,FILE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.rdd.compress=True
spark.app.name=PySparkShell
spark.driver.port=6067