Spark - Yarn

Card Puncher Data Processing

About

Yarn is a cluster manager supported by Spark.

Mode

The deployment mode sets where the driver will run. The driver will run:

Mode Client Cluster
Interactive coding Yes No
Driver Machine The client machine The cluster
Process Synchronous Asynchronous (Background)

Example:

./bin/spark-shell --master yarn --deploy-mode client
./bin/spark-submit  --deploy-mode cluster 

Steps

Configuration

The HADOOP_CONF_DIR or YARN_CONF_DIR environment variable points to the directory which contains the (client side) configuration files for the Hadoop cluster.

These configs permits to read the configurations:

  • of Hdfs in order to connect to HDFS (HADOOP_CONF_DIR)
  • of Yarn in order to connect to the YARN ResourceManager (YARN_CONF_DIR). The ResourceManager’s address is picked up from the Hadoop configuration.

Example: Set the HADOOP_CONF_DIR or YARN_CONF_DIR

set YARN_CONF_DIR=C:\Users\gerardn\Downloads\YARN_CLIENT
  • Copy the yarn-site.xml file into the conf directory. If this is the default file, change at minimal:
    • yarn.resourcemanager.hostname with the hostname
    • yarn.client.nodemanager-connect.max-wait-ms to 10000 (10 sec)
    • yarn.resourcemanager.connect.max-wait.ms to 10000 (10 sec)
    • yarn.resourcemanager.connect.retry-interval.ms to 10000 - 10 sec (total time to retry before failing)

Deployment mode

Cluster

The master value is yarn and not the cluster URL. The ResourceManager’s address is picked up from the Hadoop configuration

With spark-submit

./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \
    10

where:

Client

Shell feedback

  • A YARN client program is started along with an Application Master (in the above example with the default one)
  • The client periodically poll the Application Master for status updates and display them in the console.

Spark installation on Yarn.

  • Start a shell (you need to be on the same network and reachable from all node)
:: To locate winutils
set HADOOP_HOME=C:\spark-2.2.0-bin-hadoop2.7
REM suppress the HADOOP_HOME\conf files if you don't want them to be used

REM Then
set HADOOP_CONF_DIR=%HADOOP_HOME%\confAap
set YARN_CONF_DIR=%HADOOP_HOME%\confAap
set HADOOP_BIN=%HADOOP_HOME%\bin

REM the user
set HADOOP_USER_NAME=gnicolas

cd %HADOOP_BIN%

spark-shell.cmd --master yarn --deploy-mode client
REM or
pyspark.cmd --master yarn --deploy-mode client

Note

Azure Conf

  • Suppress the decryption proprerties in core-site.xml
<property>
      <name>fs.azure.account.keyprovider.basisinfrasharedrgp122.blob.core.windows.net</name>
      <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property>
      <name>fs.azure.shellkeyprovider.script</name>
      <value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>
  • Add the azure jar files for the storage

FYI: Conf file send to the cluster

FYI: Example of conf file send by Spark in client deploy mode where 10.0.75.1 is the IP of the host machine (The client)

  • send by the Spark shell
spark.yarn.cache.visibilities=PRIVATE
spark.yarn.cache.timestamps=1553518131341
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5816/__spark_conf__.zip
spark.yarn.cache.sizes=208833138
spark.jars=
spark.sql.catalogImplementation=hive
spark.home=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..
spark.submit.deployMode=client
spark.yarn.queue=root.development
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168/__spark_libs__9157723267130265104.zip\#__spark_libs__
spark.yarn.cache.types=ARCHIVE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.repl.class.outputDir=C\:\\Users\\gerard\\AppData\\Local\\Temp\\spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168\\repl-66e09de6-41c3-47ab-9589-f8f95578432c
spark.app.name=Spark shell
spark.repl.class.uri=spark\://10.0.75.1\:10361/classes
spark.driver.port=10361
  • Send by pySpark
spark.executorEnv.PYTHONPATH=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python\\lib\\py4j-0.10.4-src.zip;C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python;<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip
spark.yarn.cache.visibilities=PRIVATE,PRIVATE,PRIVATE
spark.yarn.cache.timestamps=1553513892305,1498864159000,1498864159000
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5377/__spark_conf__.zip
spark.yarn.isPython=true
spark.yarn.cache.sizes=208833138,480115,74096
spark.sql.catalogImplementation=hive
spark.submit.deployMode=client
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-c5350af3-fabd-469e-bfc3-565eb0f6ed4b/__spark_libs__2786045563156883095.zip\#__spark_libs__,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip\#pyspark.zip,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip\#py4j-0.10.4-src.zip
spark.serializer.objectStreamReset=100
spark.yarn.cache.types=ARCHIVE,FILE,FILE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.rdd.compress=True
spark.app.name=PySparkShell
spark.driver.port=6067

Documentation / Reference





Discover More
Spark Cluster
Spark - Cluster

A cluster in Spark has the following component: A spark application composed of a driver program which include the SparkContext (for RDD) or the Spark Session for a data frame which connect to a cluster...
Spark Cluster
Spark - Driver

The driver is a (daemon|service) wrapper created when you get a spark context (connection) that look after the lifecycle of the Spark job. cluster managerapplication manager The driver: start as its...
Card Puncher Data Processing
Spark - Installation

Spark is agnostic to the underlying cluster manager. The installation is then cluster manager dependent . Mesos See To enable HDFS,...
Card Puncher Data Processing
Spark - Master (Connection URL )

The master defines the master service of a cluster manager where spark will connect. The value of the master property defines the connection URL to this master. local. Start the standalone spark...
Card Puncher Data Processing
Spark - Spark-submit

The spark submit application to submit application. The spark-submit script is used to launch applications on a cluster. Spark jobs are generally submitted from an edge node where: class is...
Card Puncher Data Processing
Spark - User

user identity Do we need to add the user to all the nodes (head and workernodes) of the...



Share this page:
Follow us:
Task Runner