About
This page is about how to capture the changes of state in order to create a stream. The stream processing will then perform an incremental computation
Capturing data changes is far from a trivial task, and this page tries to help you understand the pro and cons of every strategy.
Articles Related
Usage
DataWarehouse
Most data warehouse tables are so large that they cannot be refreshed during every load cycle. You must have a capability to transfer only the relevant changes to the source data since the last update. Isolating the latest source data is called change data capture (CDC). The idea behind change data capture is simple enough: just transfer the data that has been changed since the last load.
During the data warehouse's initial historic load, capturing source data content changes is not important since you are loading all data.
Replication
Incremental Processing
Characteristic
Model
- data-based (state) - incremental column
- event-based (changes) - log miner, trigger, …
Real-time
- Real time
- Schedule
Source
- Log
- App
If updates to source system tables occur outside the application itself, each external update should take into account the incremental functionalities (and update the incremental column accordingly. Because the application strategy has a weak consistency level, a data reconciliation strategy should be put in place in order to identify any difference culprit.
CDC mechanisms
Log Miner
Log miner app
Materialized View
Incremental column
Also known as the tracability column, in order to identify the change as modifications, insertion and deletion by date
Timestamp
A timestamp (Date and time stamps) that track the last time data was changed (ie the equivalent of the file modification time but for the row)
- Might require modification in the program
- The programs may fail to update this field (ie eventually consistency level)
- Deletions are not included since the entire record including the time stamps are gone.
Id
A single column containing a ascendant unique ID where every new row has an new id bigger than the previous one
Use case: Table that are insert-only such as a fact table
Data Set Diff
The conventional approach to data integration involves extracting all data from the source system and then integrating the entire set. This phase uses possibly also an incremental strategy by detecting the changes between the target and the source via a data set comparison based on a hash function: OWB - How to implement a type 2 slowly changing dimension with a hash function ?
- Not real time (impossible in the given timeframe)
Database Triggers
A trigger-based approaches support incremental iteration with only weak consistency guarantees
In concept, the code that the trigger launches can be written to write a record of the transaction to a database table and the ETL tool can then poll those tables on a periodic basis.
Pro:
- Easy to setup: The trigger-based data capture system is easy to understand and widely supported by database systems.
- No admin right needed: Table synchronization can be setup by users and application developers without requiring a database administrator to modify the server.
Cons:
- The capture may lost change.. For example triggers can regularly be deleted/disabled then re-added/re-enabled in the normal course of a business operation.
- The main database server process is used
CDC vs Full reload
- Long running time
- More storage (Flip Mechanism needed)
- Problem in the workflow let data inconsistent - The mechanism is not resilient.