A Data Maintenance Test consists of the execution of a series of refresh streams.
This process tracks, possibly with some delay, the state of an operational database through data maintenance functions,
The modeling of Transformation and Load (The TL in ETL) is known as Data Maintenance (DM) or Data Refresh.
The dimension tables can be classified into one of the following types:
The TPC-DS data maintenance process starts from generated flat files that are assumed to be the output of the external Extraction process.
Reading the refresh data is a timed part of the data maintenance process. The data set for a specific refresh run must be loaded and timed as part of the execution of the refresh run. The loading of data must be performed via generic processes inherent to the data processing system or by the loader utility the database software provides and supports for general data loading. It is explicitly prohibited to use a loader tool that has been specifically developed for TPC-DS.
Data maintenance functions perform insert and delete operations that are defined in pseudo code.
where:
Each data maintenance function inserting or updating rows in dimension and fact tables is defined by the following components:
Each data maintenance function contains a table with column mapping between its view V and its data warehouse table T.
The primary key of V is denoted in bold letters on the left side of the mapping table.
V is a logical table that does not need to be instantiated. The primary key of V is defined in the data maintenance function.
Each data maintenance function deleting rows from fact tables is defined by the following components:
Comment: In the flat files generated by dsdgen for data maintenance, there are 2 files that relate to deletes.
DF_SS:
DF_CS:
DF_WS:
DF_I:
Depending on which operation they perform and on which type of table, they are categorized as follow:
This function reads rows from a source view V and insert rows into a target data warehouse table T.
Fact Table Load
for every row v in view V corresponding to fact table F:
get row v into local variable lv
for every type 1 business key column bkc in v:
get row d from dimension table D corresponding to bkc
where the business keys of v and d are
equal update bkc of lv with surrogate key of d
end for
for every type 2 business key column bkc in v
get row d from dimension table D corresponding to bkc
where the business keys of v and d are equal and rec_end_date is NULL
update bkc of lv with surrogate key of d
end for
insert lv into F
end for
Sales and Returns Fact Table Delete
Delete rows from R with corresponding rows in S where d_date between Date1 and Date2
Delete rows from S where d_date between Date1 and Date2
D_date is a column of the date_dim dimension. D_date has to be obtained by joining to the date_dim dimension on sales date surrogate key. The sales date surrogate key for the store sales is ss_sold_date_sk, for catalog it is cs_sold_date_sk and for web sales it is ws_sold_date_sk.
Inventory Fact Table Delete
Delete rows from I where d_date between Date1 and Date2
Comment: D_date is a column of the date_dim dimension. D_date has to be obtained by joining to the date_dim dimension on inv_date_sk.
See:
The refresh data sets of each data maintenance function must be generated using dsdgen.
The output file for each table of the refresh data set can be split into n files where each file contains approximately 1/n of the total number of rows of the original output file. The order of the rows in the original output file must be preserved, such that the concatenation of all n files is identical to the original file.
The refresh data consists of a series of refresh data sets, numbered 1, 2, 3…n. <n> is identical to the number of streams used in the Throughput Tests of the benchmark. Each refresh data set consists of <N> flat files.
where:
Example to generate the refresh data files with dsdgen in /tmp for the 3rd update stream:
# Linux
dsdgen –scale 100 –dir /tmp –update 3
# Windows
dsdgen /scale 100 /dir C:\tmp /update 3
The output files will be of the form s_<table>_<stream>.csv.
Since dsdgen only generates 200-300GB/hour (on a 2-3GHz x86 processor), it is useful to run multiple parallel streams when generating large amounts of data. Here’s an example for the 3rd stream/child of 10 parallel streams:
dsdgen –scale 100 –dir /tmp –update 3 –parallel 10 –child 3
See tpcds_home\tests\gen_update_data.sh
#!/bin/sh
# $id:$
# $log:$
cd temp_build
if [ -f FAILED ]
then
exit
fi
rm -rf /data/s_*.csv
./dbgen2 -f -dir /data -update 1 > gen_update_data.out 2>&1 || exit -1
cat - << _EOF_ > /tmp/goal
100 s_brand.csv
100 s_business_address.csv
100 s_call_center.csv
100 s_catalog.csv
100 s_catalog_order.csv
900 s_catalog_order_lineitem.csv
100 s_catalog_page.csv
100 s_catalog_promotional_item.csv
84 s_catalog_returns.csv
100 s_category.csv
100 s_class.csv
100 s_company.csv
100 s_customer.csv
100 s_division.csv
100 s_inventory.csv
100 s_item.csv
100 s_manager.csv
100 s_manufacturer.csv
100 s_market.csv
100 s_product.csv
100 s_promotion.csv
3 s_purchase.csv
15 s_purchase_lineitem.csv
100 s_reason.csv
100 s_store.csv
100 s_store_promotional_item.csv
2 s_store_returns.csv
100 s_subcategory.csv
100 s_subclass.csv
100 s_warehouse.csv
3 s_web_order.csv
27 s_web_order_lineitem.csv
100 s_web_page.csv
100 s_web_promotional_item.csv
3 s_web_returns.csv
100 s_web_site.csv
99400 s_zip_to_gmt.csv
103237 total
_EOF_
wc -l /data/*.csv > /tmp/results
diff -w /tmp/goal /tmp/results || exit -1
rm /tmp/goal /tmp/results gen_update_data.out