Hive - Open Csv Serde

Card Puncher Data Processing

About

The Csv Serde is a serde that is applied above a text file. It's one way of reading a CSV / TSV format.

Architecture

  • The CSVSerde is available in Hive 0.14 and greater.
  • It was added to the Hive distribution in HIVE-7777.
  • The CSVSerde has been built and tested against Hive 0.14 and later, and uses Open-CSV 2.3 which is bundled with the Hive distribution.

Limitations

This SerDe:

  • does not handle embedded newlines.
  • treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type. To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type.

Syntax

The following example creates a TSV (Tab-separated) file.

CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "\t",
   "quoteChar"     = "'",
   "escapeChar"    = "\\"
)  
STORED AS TEXTFILE;

where:

  • the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde must be specified.
  • the default properties value are
DEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER  "
DEFAULT_SEPARATOR        ,

Example

With the tpcds customer table

Step 1 - Create the staging external table

CREATE EXTERNAL  TABLE `customer_dat`(
  `c_customer_sk` int, 
  `c_customer_id` char(16), 
  `c_current_cdemo_sk` int, 
  `c_current_hdemo_sk` int, 
  `c_current_addr_sk` int, 
  `c_first_shipto_date_sk` int, 
  `c_first_sales_date_sk` int, 
  `c_salutation` char(10), 
  `c_first_name` char(20), 
  `c_last_name` char(30), 
  `c_preferred_cust_flag` char(1), 
  `c_birth_day` int, 
  `c_birth_month` int, 
  `c_birth_year` int, 
  `c_birth_country` varchar(20), 
  `c_login` char(13), 
  `c_email_address` char(50), 
  `c_last_review_date_sk` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "|"
)
LOCATION 'hdfs://locationToADirectory';

where:

  • location is a directory where the tpcds customer data can be found

Et voila:

Tpcds Customer Data

Step 2 - Load the data into the target table with data type

FROM `customer_stg`
INSERT OVERWRITE TABLE `customer` 
SELECT *;

Documentation / Reference





Discover More
Card Puncher Data Processing
Hive - CSV

CSV / TSV format in Hive. You can create a external table with: the Open Csv Serde or with the default TEXTFILE. See Example with the customer table of the TPCDS schema STORED AS TEXTFILE...
Card Puncher Data Processing
Hive - SerDe

Serde in Hive permits to define how a storage format should be processed to produce records. orc Compressed...



Share this page:
Follow us:
Task Runner