Spark - Hive

Card Puncher Data Processing

About

Hive is the default Spark catalog.

Since Spark 2.0, Spark SQL supports builtin Hive features such as:

See:

Sql Hive Arch

Enable

  • The SparkSession must be instantiated with Hive support
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport() // Hive support
  .getOrCreate();

Default

Users who do not have an existing Hive deployment can still enable Hive support.

When not configured by the hive-site.xml, the context automatically:

  • creates metastore_db in the current directory
  • creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started.

Management

Configuration

Dependency

Since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution.

On all of the worker nodes, the following must be installed on the classpath:

File

Configuration of Hive is done by placing:

in:

Options

Spark SQL - Conf (Set)

Server

see Spark SQL - Server (Thrift) (STS)

Metastore

Example of configuration file for a local installation in a test environment.

<configuration>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>C:\spark-2.2.0-metastore\scratchdir</value>
    <description>Scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>C:\spark-2.2.0-metastore\spark-warehouse</value>
    <description>Spark Warehouse</description>
  </property>
  <property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
  
</configuration>

Database

The default metastore is created with Derby if not set.

Warehouse

Hive - Warehouse

Search Algorithm:

  • If the setting value of spark.sql.warehouse.dir is not null, spark.sql.warehouse.dir
  • If the setting value of hive.metastore.warehouse.dir is not null, hive.metastore.warehouse.dir
  • otherwise workingDir/spark-warehouse/

The doc said that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0.

Example of log output:

18/07/01 00:10:50 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('C:\spark-2.2.0-metastore\spark-warehouse').
18/07/01 00:10:50 INFO SharedState: Warehouse path is 'C:\spark-2.2.0-metastore\spark-warehouse'.

Normally, it will be found in the hive-site.xml file but you can change it via code:

Path warehouseLocation = Paths.get("target","spark-warehouse");
SparkSession spark = SparkSession
      .builder()
      .appName("Java Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation.toAbsolutePath().toString())
      .enableHiveSupport()
      .getOrCreate();

Table

From saving-to-persistent-tables

DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable command.

A DataFrame for a persistent table can be created by calling the table method on a SparkSession with the name of the table.

Internal

Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore.

Partition Hive DDLs are supported

ALTER TABLE PARTITION ... SET LOCATION 

External

For an external table, you can specify the table path via the path option

df.write.option("path", "/some/path").saveAsTable("t")

The partition information is not gathered by default when creating external datasource tables (those with a path option). To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.

Documentation / Reference





Discover More
Card Puncher Data Processing
Apache - Hive (HS|Hive Server)

Hive is a relational database developed on top of Hadoop to deliver data warehouse functionality. It uses SQL queries (HiveQL) to run MapReduce jobs on Hadoop. The Hive driver converts the HiveQL queries...
Card Puncher Data Processing
Hive - Spark Engine

Hive on Spark provides Hive with the ability to utilize Apache Spark as its execution engine Instead of the capacity scheduler, the fair scheduler is required. ConfigurationProperties...
Card Puncher Data Processing
Spark - Catalog

Metadata store for the table definition. spark.sql.catalogImplementation. A value of hive means the Hive metastore ??
Card Puncher Data Processing
Spark - Configuration

The configuration of Spark is mostly: configuration around an app. runtime-environment The application web UI...
Idea Classpath Spark
Spark - Local Installation

A local installation is a spark installation on a single machine (generally a dev machine). The local master connection will start for you a local standalone spark installation on your machine. This...
Card Puncher Data Processing
Spark - SQL Framework

The Spark SQL Framework is a library based around an sql in order to create dataset, data frame with bindings in Python, Scala, Java, and R The Spark SQL Framework can execute SQL queries (Hive as...
Card Puncher Data Processing
Spark - Sql

This section is : the SQL Grammar of Spark and the SQL Thrift Server. Spark SQL SQL is an interface to the spark Spark Sql engine that supports: all existing Hive data formats, the hive syntax...
Card Puncher Data Processing
Spark - spark-sql cli

The Spark SQL CLI is a sparl sql client tool to run the Hive metastore service in local mode and execute queries...
Spark Jdbc
Spark SQL - JDBC/ODBC

JDBC/ODBC means the Hive Server where: URL With dbeaver: You need the core/common and the hive-jdbc jar. Example for 1.2.1:
Card Puncher Data Processing
Spark SQL - Server (Thrift) (STS)

The spark SQL server is the HiveServer2 in Hive 1.2.1. It's a Thrift JDBC/ODBC server beeline from Spark or Hive 1.2.1 Hive 1.2.1 There is not yet a service discovery (SPARK-19541)...



Share this page:
Follow us:
Task Runner