About
Table implementation in Hive.
Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data.
Articles Related
Property
Qualified Name
The fully qualified name in Hive for a table is:
db_name.table_name
where:
- db_name is the database name
Default Storage
By default, tables are assumed to be of:
- and the delimiters are assumed to be ^A(ctrl-a).
Type
The type is based on the location of data.
- External: The Data is stored outside the Hive data warehouse.
- Internal: The Data is stored in the Hive data warehouse.
External
Use external tables when:
- The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
- Data needs to remain in the underlying location, even after dropping the table.
- You need a custom location, such as a non-default storage account.
- A program other than hive manages the data format, location, etc.
Internal
Use internal tables when:
- Data is temporary.
- You want Hive to manage the lifecycle of the table and data.
Management
Create
With transform
- Regex
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;
ROW FORMAT
row format specifies how the rows are stored in the hive table.
You can use:
STORED AS
Stored as design the file format
Show
List
show tables;
-- The pattern follows Java regular expression syntax (so the period is a wildcard).
SHOW TABLES 'Pattern';
-- Example
SHOW TABLES 'page.*';
Ddl
SHOW CREATE TABLE tableName;
Describe
DESCRIBE FORMATTED tableName
-- To list columns and column types of table.
DESCRIBE EXTENDED page_view;
# col_name data_type comment
clientid string
querytime string
market string
deviceplatform string
devicemake string
devicemodel string
state string
country string
querydwelltime double
sessionid bigint
sessionpagevieworder bigint
# Detailed Table Information
Database: default
Owner: root
CreateTime: Thu Jan 25 13:59:12 UTC 2018
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: wasb://hi-clus-hadoop-01-2018-01-25t11-58-44-894z@hiinformaticasawe.blob.core.windows.net/hive/warehouse/hivesampletable
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 1
numRows 0
rawDataSize 0
totalSize 4955715
transient_lastDdlTime 1516888756
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.449 seconds, Fetched: 41 row(s)
where:
Drop
DROP TABLE IF EXISTS tableName;
Partition
Each Table can have one or more partition.
Load
with beeline
LOAD DATA INPATH '/user/gerardn/myTable.csv' into TABLE myTable;
Table default.myTable stats: [numFiles=1, numRows=0, totalSize=266712, rawDataSize=0]
OK
Time taken: 3.502 seconds
where:
CREATE EXTERNAL TABLE IF NOT EXISTS myTable
(
col1 string,
col2 int,
col3 double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");
Storage
All the data of a table is stored in a directory in HDFS
- location: the base directory location is specified within the create statement.
A partitioned table will create sub-directory by partition columns. See Partitioned column
Statistics
Table and partition statistics are stored in the Hive Metastore
- Number of rows
- Number of files
- Size in Bytes
- Number of partition
Constraint
- PRIMARY KEY
- FOREIGN KEY
- UNIQUE
- NOT NULL
- DEFAULT (to come)