data:type:relation:benchmark:tpcds:dsdgen

About

dsdgen generate the data sets for the benchmark (initial and refresh data)

Articles Related

Prerequisites

dsdgen always needs and reads the tpcds.idx file from the current directory.

Syntax

Full:

# Windows
dsdgen.exe /help
# Linux
dsdgen –h

dsdgen Population Generator (Version 2.8.0)
Copyright Transaction Processing Performance Council (TPC) 2001 - 2018


USAGE: dsdgen [options]

Note: When defined in a parameter file (using -p), parmeters should
use the form below. Each option can also be set from the command
line, using a form of '/param [optional argument]'
Unique anchored substrings of options are also recognized, and
case is ignored, so '/sc' is equivalent to '/SCALE'

General Options
===============
ABREVIATION =  <s>       -- build table with abreviation <s>
DIR =  <s>               -- generate tables in directory <s>
HELP =  <n>              -- display this message
PARAMS =  <s>            -- read parameters from file <s>
QUIET =  [Y|N]           -- disable all output to stdout/stderr
SCALE =  <n>             -- volume of data to generate in GB
TABLE =  <s>             -- build only table <s>
UPDATE =  <n>            -- generate update data set <n>
VERBOSE =  [Y|N]         -- enable verbose output
PARALLEL =  <n>          -- build data in <n> separate chunks
CHILD =  <n>             -- generate <n>th chunk of the parallelized data
RELEASE =  [Y|N]         -- display the release information
_FILTER =  [Y|N]         -- output data to stdout
VALIDATE =  [Y|N]        -- produce rows for data validation

Advanced Options
===============
DELIMITER =  <s>         -- use <s> as output field separator
DISTRIBUTIONS =  <s>     -- read distributions from file <s>
FORCE =  [Y|N]           -- over-write data files without prompting
SUFFIX =  <s>            -- use <s> as output file suffix
TERMINATE =  [Y|N]       -- end each record with a field delimiter
VCOUNT =  <n>            -- set number of validation rows to be produced
VSUFFIX =  <s>           -- set file suffix for data validation
RNGSEED =  <n>           -- set RNG seed

where:

The default field delimiter is |

Example

File Structure

The output of dsdgen is text.

Content of each field is terminated by default with the '|'. (Delimiters can be change with the delimiter options)
A '|' in the first position of a row indicates that the first column of the row is empty.
Two consecutive '|' indicate that the given column value is empty. Empty column values, as generated by dsdgen, must be treated as NULL values in the data processing system, i.e. the data processing system must be able to retrieve NULL-able columns using 'is null' predicates.

The data generated by dsdgen includes some international characters.

See https://github.com/teradata/tpcds - dsdgen java implementation

Data Validation

The test database must be verified for correct data content. This must be done after the initial database load and prior to any performance tests. A validation data set is produced using dsdgen with the “-validate” and “- vcount” options. The minimum value for “-vcount” is 50, which produces 50 rows of validation data for most tables. The exceptions being the “returns” fact tables which will only have 5 rows each on average and the dimension tables with fewer than 50 total rows.

Doesn't work in 2.8.0

Hadoop

https://github.com/cloudera/impala-tpcds-kit/tree/master/tpcds-gen - Data generation is done via a MapReduce wrapper around TPC-DS dsdgen

Table of Contents

About

Articles Related

Prerequisites

Syntax

Example

File Structure

Data Validation

Hadoop