A local installation is a spark installation on a single machine (generally a dev machine).
The local master connection will start for you a local standalone spark installation on your machine.
This steps were written for a Windows laptop.
The master connection URL local will start for you locally the standalone spark cluster manager:
Example with sparklyr:
sc <- sparklyr::spark_connect(master = "local")
where: master = Spark - Master (Connection URL )
This a manually installation, you may want also to check the semi-automatic sparklyr installation.
This steps were written for a Windows laptop.
Download the “Pre-built for Hadoop X.X and later” package of the latest release of Spark and simply unpack it.
They are located at https://d3kbcqa49mib13.cloudfront.net/ to download the version spark-2.2.0-bin-hadoop2.7.tgz you would type: https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz
Once it is unpacked, you should be able to run the spark-shell script from the package’s bin directory
The SPARK_HOME environment variable gives the installation directory.
Set the SPARK_HOME environment variable. This environment variable is used to locate
SET SPARK_HOME=/pathToSpark
Set the HADOOP_HOME environment variable. The environment variable is used to locate
SET HADOOP_HOME=%SPARK_HOME%\hadoop
The conf files are searched within the classpath in this order:
Example of command line when starting the spark sql cli where you can see that the classpath (cp) has two conf location.
java
-cp "C:\spark-2.2.0-bin-hadoop2.7\bin\..\conf\;C:\spark-2.2.0-bin-hadoop2.7\bin\..\jars\*;C:\spark-2.2.0-bin-hadoop2.7\hadoop\conf"
-Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal
In your IDE, be sure to add this two directory in your classpath.
Example with IDEA:
For windows only:
In %HADOOP_HOME%\conf\hive-site.xml
Example of configuration file for a test environment where the base dir for hive is C:\spark-2.2.0-hive\
<configuration>
<property>
<name>hive.exec.scratchdir</name>
<value>C:\spark-2.2.0-hive\scratchdir</value>
<description>Scratch space for Hive jobs</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>C:\spark-2.2.0-hive\spark-warehouse</value>
<description>Spark Warehouse</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
</configuration>
The hive configuration has two importants directory that must be writable:
Steps:
set SPARK-SCRATCHDIR=C:\spark-2.2.0-hive\scratchdir
set SPARK-WAREHOUSE=C:\spark-2.2.0-hive\warehouse
mkdir %SPARK-SCRATCHDIR%
mkdir %SPARK-WAREHOUSE%
winutils.exe chmod -R 777 %SPARK-SCRATCHDIR%
winutils.exe chmod -R 777 %SPARK-WAREHOUSE%
The metastore is a Derby local metastore because the jar is already located in SPARK_HOME/jars
If when starting, you can see an error saying that it can found a driver, this is caused by a faulty Jdbc Url. Verify your URL
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
You may install and configure locally a SQL Server if you want to access the metastore while Spark is running. Because the default Derby installation allows only one connection to the database
sparklyr has a function to install a local spark instance.
# check the available version
spark_available_versions()
# Install the one that you want locally
spark_install(version = "1.6.2")
Installing Spark 1.6.2 for Hadoop 2.6 or later.
Downloading from:
- 'https://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz'
Installing to:
- 'C:\Users\gerardn\AppData\Local\rstudio\spark\Cache/spark-1.6.2-bin-hadoop2.6'
trying URL 'https://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz'
Content type 'application/x-tar' length 278057117 bytes (265.2 MB)
downloaded 265.2 MB
Installation complete.
Sys.getenv("HADOOP_HOME")
[1] "C:\\Users\\gerardn\\AppData\\Local\\rstudio\\spark\\Cache\\spark-1.6.2-bin-hadoop2.6\\tmp\\hadoop"