Relation - ORC format (Optimized RC)

Data System Architecture

About

ORC Files or Optimized RC Files were invented to optimize performance in Hive and are primarily backed by HortonWorks.

Data in ORC files is fast to load because data stripes can be read in parallel. The rows in each data stripe are loaded sequentially. To optimize load time, use a data stripe size of approximately 256 MB or less.

Example Hive

CREATE TABLE lineitem_orc_part
    (L_ORDERKEY INT, L_PARTKEY INT,L_SUPPKEY INT, L_LINENUMBER INT,
     L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE,
     L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING,
     L_SHIPDATE_PS STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING,
     L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT      STRING)
PARTITIONED BY(L_SHIPDATE STRING)
STORED AS ORC;

Documentation





Discover More
Aws User Click Event Processing Architecture
Aws - Kinesis Data Firehose Delivery Stream

Amazon Kinesis Data Firehose is a simple service for delivering real-time streaming data to destinations. It is part of the Kinesis streaming data platform Delivery streams load data, automatically and...
Data System Architecture
Table - Physical Data Structure

The different way, structure that exists to saved tabular data on (disk|memory). Columnar format are generally slower to write than non-columnar file formats. (On Disk) Different...
Undraw File Manager Re Ms29
What are the Read-optimized File Formats (write once, read many)?

This page lists the file formats that follows the principle write once, read many. These formats are therefore read-optimized encoding formats. Parquet, ORCFile, AVRO



Share this page:
Follow us:
Task Runner