Spark - Local Installation

Card Puncher Data Processing

About

A local installation is a spark installation on a single machine (generally a dev machine).

The local master connection will start for you a local standalone spark installation on your machine.

This steps were written for a Windows laptop.

Connection URL

The master connection URL local will start for you locally the standalone spark cluster manager:

  • with one thread local
  • with N threads local[N]

Example with sparklyr:

sc <- sparklyr::spark_connect(master = "local")

where: master = Spark - Master (Connection URL )

Installation Steps

This a manually installation, you may want also to check the semi-automatic sparklyr installation.

This steps were written for a Windows laptop.

Pre-built unarchive

Download the “Pre-built for Hadoop X.X and later” package of the latest release of Spark and simply unpack it.

They are located at https://d3kbcqa49mib13.cloudfront.net/ to download the version spark-2.2.0-bin-hadoop2.7.tgz you would type: https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz

Once it is unpacked, you should be able to run the spark-shell script from the package’s bin directory

Env

SPARK_HOME

The SPARK_HOME environment variable gives the installation directory.

Set the SPARK_HOME environment variable. This environment variable is used to locate

  • Winutils (on Windows) in the bin
  • the conf file first at SPARK_HOME/conf then at SPARK_HOME/hadoop/conf
 
SET SPARK_HOME=/pathToSpark

HADOOP_HOME

Set the HADOOP_HOME environment variable. The environment variable is used to locate

  • Winutils (on Windows) in the HADOOP_HOME/bin
  • the conf file HADOOP_HOME/conf
 
SET HADOOP_HOME=%SPARK_HOME%\hadoop

Classpath

The conf files are searched within the classpath in this order:

  • SPARK_HOME/conf
  • HADOOP_HOME/conf

Example of command line when starting the spark sql cli where you can see that the classpath (cp) has two conf location.

java 
-cp "C:\spark-2.2.0-bin-hadoop2.7\bin\..\conf\;C:\spark-2.2.0-bin-hadoop2.7\bin\..\jars\*;C:\spark-2.2.0-bin-hadoop2.7\hadoop\conf" 
-Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal

In your IDE, be sure to add this two directory in your classpath.

Example with IDEA:

Idea Classpath Spark

Winutil

For windows only:

  • Download and put winutils under the %HADOOP_HOME%\bin

Hive

Spark - Hive

In %HADOOP_HOME%\conf\hive-site.xml

Example of configuration file for a test environment where the base dir for hive is C:\spark-2.2.0-hive\

<configuration>
    <property>
        <name>hive.exec.scratchdir</name>
        <value>C:\spark-2.2.0-hive\scratchdir</value>
        <description>Scratch space for Hive jobs</description>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>C:\spark-2.2.0-hive\spark-warehouse</value>
        <description>Spark Warehouse</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.apache.derby.jdbc.EmbeddedDriver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
</configuration>

Directory

The hive configuration has two importants directory that must be writable:

  • the scratch dir. A temporary runtime directory (default value is c:\tmp\hive, common location %TEMP%\hive)
  • the warehouse. A directory where the internal Hive data are saved.

Steps:

  • Make the directories
set SPARK-SCRATCHDIR=C:\spark-2.2.0-hive\scratchdir
set SPARK-WAREHOUSE=C:\spark-2.2.0-hive\warehouse
mkdir %SPARK-SCRATCHDIR%
mkdir %SPARK-WAREHOUSE%
winutils.exe chmod -R 777 %SPARK-SCRATCHDIR%
winutils.exe chmod -R 777 %SPARK-WAREHOUSE%

Metastore

The metastore is a Derby local metastore because the jar is already located in SPARK_HOME/jars

If when starting, you can see an error saying that it can found a driver, this is caused by a faulty Jdbc Url. Verify your URL

<property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
	<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>org.apache.derby.jdbc.EmbeddedDriver</value>
	<description>Driver class name for a JDBC metastore</description>
</property>

You may install and configure locally a SQL Server if you want to access the metastore while Spark is running. Because the default Derby installation allows only one connection to the database

Sparklyr

sparklyr has a function to install a local spark instance.

  • Install a local Spark.
# check the available version
spark_available_versions()
# Install the one that you want locally
spark_install(version = "1.6.2")
Installing Spark 1.6.2 for Hadoop 2.6 or later.
Downloading from:
- 'https://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz'
Installing to:
- 'C:\Users\gerardn\AppData\Local\rstudio\spark\Cache/spark-1.6.2-bin-hadoop2.6'
trying URL 'https://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz'
Content type 'application/x-tar' length 278057117 bytes (265.2 MB)
downloaded 265.2 MB

Installation complete.

  • Restart RStudio and verify that you have the HADOOP_HOME
Sys.getenv("HADOOP_HOME")
[1] "C:\\Users\\gerardn\\AppData\\Local\\rstudio\\spark\\Cache\\spark-1.6.2-bin-hadoop2.6\\tmp\\hadoop"

Documentation / Reference





Discover More
Data Mining Tool 2
ML - SparklingWater (h20 inside Spark)

h2oai/sparkling-waterSparkling Water provides H2O's fast scalable machine learning engine inside Spark cluster. Sparkling Water is distributed as a Spark application library which can be used by any Spark...
Idea Python Interpreter Venv
PySpark - Installation and configuration on Idea (PyCharm)

Installation and configuration of a PySpark (Spark Python) environment on Idea (PyCharm) You have already installed locally a Spark distribution. See Install Anaconda 2.7 (3.7 is also supported)...
Sql Hive Arch
Spark - Hive

Hive is the default Spark catalog. Since Spark 2.0, Spark SQL supports builtin Hive features such as: HiveQL Hive SerDes UDFs read...
Card Puncher Data Processing
Spark - Installation

Spark is agnostic to the underlying cluster manager. The installation is then cluster manager dependent . Mesos See To enable HDFS,...
Card Puncher Data Processing
Spark - Log

or Spark executor logs are located in the /work/app- name of your application Driver logs
Idea Classpath Spark
Spark Java - Installation and configuration on IDEA

How to configure IDEA against Spark with Java This section shows the important configuration that you need to pass to any Spark App in order to have a valid run. The HADOOP_HOME environment variable...



Share this page:
Follow us:
Task Runner