About
external table are external because the data is stored outside the data warehouse.
Articles Related
Usage
Use external tables when:
- The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
- Data needs to remain in the underlying location, even after dropping the table.
- You need a custom location, such as a non-default storage account.
- A program other than hive manages the data format, location, etc.
Structure
- a carriage return/line feed at the end of all rows including the last one
.
Syntax
CREATE EXTERNAL TABLE log4jLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION '/example/data/';
where:
- STORED AS: defined the storage format
- LOCATION If the location is a directory, the query will be on all files in the directory
Example location:
- Azure with:
- a Blob storage: wasb://{container}@{storageaccount}.blob.core.windows.net/path/pth
- a Data Lake adl://home/hive/warehouse/myDatabase.db/tableName
Management
Partition Load
There is multiple way to load data in a external table
- Sql Statement (Insert overwrite, …)
- Direct File system operation using:
- Load data local and Load data statement
- or file system client
Example with the following external table,
CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';
Insert overwrite
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';
Hdfs put
hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
Load data
- Load data from the local file system with one thread (The path argument can take a directory)
LOAD DATA LOCAL INPATH '/tmp/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
- Load data from Hdfs - Loaded in parallel previously with an external tool
LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')