Hive - External Table

About

external table are external because the data is stored outside the data warehouse.

Articles Related

Usage

Use external tables when:

The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
Data needs to remain in the underlying location, even after dropping the table.
You need a custom location, such as a non-default storage account.
A program other than hive manages the data format, location, etc.

Structure

a carriage return/line feed at the end of all rows including the last one
.

Syntax

CREATE EXTERNAL TABLE log4jLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE 
LOCATION '/example/data/';

where:

STORED AS: defined the storage format
LOCATION If the location is a directory, the query will be on all files in the directory

Example location:

Azure with:
- a Blob storage: wasb://{container}@{storageaccount}.blob.core.windows.net/path/pth
- a Data Lake adl://home/hive/warehouse/myDatabase.db/tableName

Management

Partition Load

There is multiple way to load data in a external table

Sql Statement (Insert overwrite, …)
Direct File system operation using:
- Load data local and Load data statement
- or file system client

Example with the following external table,

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User',
                country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';

Insert overwrite

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';

Hdfs put

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

Load data

Load data from the local file system with one thread (The path argument can take a directory)

LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

Load data from Hdfs - Loaded in parallel previously with an external tool

LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

Documentation / Reference

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables