About
Data Partitions (Clustering of data) in Hive
Each Table can have one or more partition.
Data in each partition may be furthermore divided into Buckets.
Articles Related
Column
Directory Hierarchy
The partition columns determine how the data is stored. A separate data directory is created for each distinct value combination in the partition columns. The partition column order determines the directory hierarchy.
Example:
- For a table T with a date partition column, the data will be stored for a particular date in the following directory
<table directory location>/ds=<date>
- partitioned by (ds string, dept int) implies a directory structure of ds=2009-02-26/dept=2.
Data Type and Value
The partition column is a virtual column. Its value is not stored with the data; it is derived from the directory where the data is saved.
Their data type and value has to be able to be converted to:
- a string
- and thereafter to a valid directory name.
Then no special characters that is reserved in FS path names (such as '/' or '..').
Type
Dynamic
Dynamic Partition (DP) columns: columns whose values are only known at EXECUTION TIME.
Dynamic partitioning means that you want Hive to create partitions automatically for you. Since you have already created the partitioning table from the staging table, all you need to do is insert data to the partitioned table
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT INTO TABLE lineitem_part
PARTITION (L_SHIPDATE)
SELECT L_ORDERKEY as L_ORDERKEY, L_PARTKEY as L_PARTKEY ,
L_SUPPKEY as L_SUPPKEY, L_LINENUMBER as L_LINENUMBER,
L_QUANTITY as L_QUANTITY, L_EXTENDEDPRICE as L_EXTENDEDPRICE,
L_DISCOUNT as L_DISCOUNT, L_TAX as L_TAX, L_RETURNFLAG as L_RETURNFLAG,
L_LINESTATUS as L_LINESTATUS, L_SHIPDATE as L_SHIPDATE_PS,
L_COMMITDATE as L_COMMITDATE, L_RECEIPTDATE as L_RECEIPTDATE,
L_SHIPINSTRUCT as L_SHIPINSTRUCT, L_SHIPMODE as L_SHIPMODE,
L_COMMENT as L_COMMENT, L_SHIPDATE as L_SHIPDATE FROM lineitem;
Static
Static partitioning means that you have already sharded data in the appropriate directories. With static partitions, you add Hive partitions manually based on the directory location.
- Add the partition
ALTER TABLE lineitem_part ADD PARTITION (L_SHIPDATE = ‘5/23/1996 12:00:00 AM’))
LOCATION ‘/data/partitions/5_23_1996/'
- Insert data
INSERT OVERWRITE TABLE lineitem_part
PARTITION (L_SHIPDATE = ‘5/23/1996 12:00:00 AM’)
SELECT * FROM lineitem
WHERE lineitem.L_SHIPDATE = ‘5/23/1996 12:00:00 AM’
Partitioned Views
Syntax
Partition
Partitioned tables can be created using the PARTITIONED BY clause.
CREATE TABLE lineitem_part
(L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT,L_LINENUMBER INT,
L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE,
L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING,
L_SHIPDATE_PS STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING,
L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING)
PARTITIONED BY(L_SHIPDATE STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;
Bucket
Tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns.
Management
Describe
SHOW PARTITIONS page_view;
-- To list columns and all other properties of table. This prints lot of information and that too not in a pretty format. Usually used for debugging.
DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');
Rebuild
- rebuild the partitions all at once. MSCK REPAIR TABLE
Drop
ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')
Note that any data for this table or partitions will be dropped and may not be recoverable. *
Load
External table
Exchange
https://cwiki.apache.org/confluence/display/Hive/Exchange+Partition
Insert overwrite
Since the partitioned column is derived from the directory, you can't load directly partitioned column from a file. You have to do it in two steps:
- Create a staging table
CREATE EXTERNAL TABLE my_table_staging (id INT, fname STRING, dt STRING, lname STRING)
LOCATION '/usr/hive/warehouse/my_table_staging';
- Load it with dynamic partition.
INSERT OVERWRITE TABLE my_table PARTITION (dt)
SELECT id, fname, lname, dt FROM my_table_staging;
Pruning
Data Partition - Partition Pruning (Elimination)
hive.partition.pruning A strict value for this variable indicates that an error is thrown by the compiler in case no partition predicate is provided on a partitioned table. This is used to protect against a user inadvertently issuing a query against all the partitions of the table. Default nonstrict
Statistic
List:
- Number of rows
- Number of files
- Size in Bytes
See Stats