Hive - External Table

Card Puncher Data Processing

About

external table are external because the data is stored outside the data warehouse.

Usage

Use external tables when:

  • The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
  • Data needs to remain in the underlying location, even after dropping the table.
  • You need a custom location, such as a non-default storage account.
  • A program other than hive manages the data format, location, etc.

Structure

  • a carriage return/line feed at the end of all rows including the last one

    .

Syntax

CREATE EXTERNAL TABLE log4jLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE 
LOCATION '/example/data/';

where:

  • STORED AS: defined the storage format
  • LOCATION If the location is a directory, the query will be on all files in the directory

Example location:

  • Azure with:
    • a Blob storage: wasb://{container}@{storageaccount}.blob.core.windows.net/path/pth
    • a Data Lake adl://home/hive/warehouse/myDatabase.db/tableName

Management

Partition Load

There is multiple way to load data in a external table

  • Sql Statement (Insert overwrite, …)
  • Direct File system operation using:
    • Load data local and Load data statement
    • or file system client

Example with the following external table,

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User',
                country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';

Insert overwrite

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';

Hdfs put

hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view

Load data

  • Load data from the local file system with one thread (The path argument can take a directory)
LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
  • Load data from Hdfs - Loaded in parallel previously with an external tool
LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

Documentation / Reference





Discover More
Card Puncher Data Processing
Hive - Load data

How to load data into Hive Table. This is are the following possibilities: File System operation SQL Operations Third party tools Replace the file on HDFS when the input data format is the...
Card Puncher Data Processing
Hive - Partition

in Hive Each Table can have one or more partition. Data in each partition may be furthermore divided into Buckets. The partition columns determine how the data is stored. A separate data directory...
Card Puncher Data Processing
Hive - Table

Table implementation in Hive. serializer/deserializers (SerDe) The fully qualified name in Hive for a table is: where: db_name is the database name By default, tables are assumed to be of:...
Sql Hive Arch
Spark - Hive

Hive is the default Spark catalog. Since Spark 2.0, Spark SQL supports builtin Hive features such as: HiveQL Hive SerDes UDFs read...



Share this page:
Follow us:
Task Runner