Data Warehouse

What’s a data warehouse?

A data warehouse is a large central data repository of current, history and summarised data coming from operational and external sources used primarily for analysis.

Data Warehouses is large historical databases for decision-support that are loaded with new data on a periodic basis.

“Single Version of the truth”

Data warehousing is a component of BI.

This category covers general subject over applications such as Business Intelligence and Decision Support Systems.

The structure of a data warehouse is very different from that of an operational system:

  • Data is not only current and atomic, but also historical and summarised
  • Data is organised by subject area such as customer or product, not by business transaction
  • Each transaction accesses many data records (batch/bulk process), not one record at a time
  • Data is static until refreshed, not dynamic

A data warehouse supports on-line analytical processing (OLAP), as against the on-line transaction processing (OLTP) supported by the operational systems.

In the past, DataWarehouse had to do three things:

  • load data
  • manage the data for the long term
  • and provide access to the data typically to a limited number and senior business user

Data Warehouses were invented first to improve the performance of large queries.

A DataWarehouse must be commutative throughout all this layers. A data set in the source must be retrieved in all layers (Staging Area, DataWarehouse, Data Mart, …)

Why a DataWarehouse ?

  • performance: the data is not in a suitable form for reporting (data model and physical implementation),
  • performance: the BI System processing degraded business transaction performance (OLTP),
  • data quality: the data has quality issues,
  • data integration: data is dispersed across many different systems,
  • data historical: there is a general lack of historical information.

De-coupling the data warehouse from the source system makes the entire architecture more flexible in the long-term.

Table in a datawarehouse

  • Reference and lookup tables store master, reference, and dimensional data,
  • while base, derived, and aggregate tables store transaction and fact data at different granularities. Base tables store the transaction data at the lowest level of granularity, while derived and aggregate tables store consolidated and summary transaction data.

A derived table is based on the content of the base, reference, and lookup tables and is the stored result of a query. The aim is to have a set of data in a day level generally in order to analyze a business process. An aggregate table is an aggregation which follow the hierarchy level based on the derived table in order to speed up the performance.

A lookup table is at a high level of a hierarchy. They are tables such as type, category, status, reason, role, group whereas the reference table is a the lowest level of the hierarchy.

Populate the tables in the following order:

  • Lookup tables
  • Reference tables
  • Base tables

Analyze the tables in one category before loading the tables in the next category (for example, analyze the reference tables before loading the lookup tables). Additionally, you can analyze all of the tables loaded by the source-ETL process before executing the intra-ETL processes).

The size of a warehouse

To plan your data warehouse accurately, you should do some experiments by loading some sample data and measure the actual storage size.

Development

System

Slowly Changing

ELT, ETL (Refresh, Load)

Scan (Access)

Storage

Glossary

DataWarehouse Specialized Database

Documentation, Tutorial

Task Runner