Spark - Hive

About

Enable

  • The SparkSession must be instantiated with Hive support
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport() // Hive support
  .getOrCreate();

Default

Users who do not have an existing Hive deployment can still enable Hive support.

When not configured by the hive-site.xml, the context automatically:

  • creates metastore_db in the current directory
  • creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started.

Management

Configuration

Dependency

Since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution.

On all of the worker nodes, the following must be installed on the classpath:

File

Configuration of Hive is done by placing:

in:

Options

Server

Metastore

Example of configuration file for a local installation in a test environment.

<configuration>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>C:\spark-2.2.0-metastore\scratchdir</value>
    <description>Scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>C:\spark-2.2.0-metastore\spark-warehouse</value>
    <description>Spark Warehouse</description>
  </property>
  <property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
  
</configuration>

Database

The default metastore is created with Derby if not set.

Warehouse

Hive - Warehouse

Search Algorithm:

  • If the setting value of spark.sql.warehouse.dir is not null, spark.sql.warehouse.dir
  • If the setting value of hive.metastore.warehouse.dir is not null, hive.metastore.warehouse.dir
  • otherwise workingDir/spark-warehouse/

The doc said that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0.

Example of log output:

18/07/01 00:10:50 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('C:\spark-2.2.0-metastore\spark-warehouse').
18/07/01 00:10:50 INFO SharedState: Warehouse path is 'C:\spark-2.2.0-metastore\spark-warehouse'.

Normally, it will be found in the hive-site.xml file but you can change it via code:

Path warehouseLocation = Paths.get("target","spark-warehouse");
SparkSession spark = SparkSession
      .builder()
      .appName("Java Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation.toAbsolutePath().toString())
      .enableHiveSupport()
      .getOrCreate();

Table

From saving-to-persistent-tables

DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable command.

A DataFrame for a persistent table can be created by calling the table method on a SparkSession with the name of the table.

Internal

Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore.

Partition Hive DDLs are supported

ALTER TABLE PARTITION ... SET LOCATION 

External

For an external table, you can specify the table path via the path option

df.write.option("path", "/some/path").saveAsTable("t")

The partition information is not gathered by default when creating external datasource tables (those with a path option). To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.

Documentation / Reference


Powered by ComboStrap