This page is about how to capture the changes of state in order to create a stream. The stream processing will then perform an incremental computation
Capturing data changes is far from a trivial task, and this page tries to help you understand the pro and cons of every strategy.
Most data warehouse tables are so large that they cannot be refreshed during every load cycle. You must have a capability to transfer only the relevant changes to the source data since the last update. Isolating the latest source data is called change data capture (CDC). The idea behind change data capture is simple enough: just transfer the data that has been changed since the last load.
During the data warehouse's initial historic load, capturing source data content changes is not important since you are loading all data.
See Data Warehousing - Data Loading (Fast Refresh)
If updates to source system tables occur outside the application itself, each external update should take into account the incremental functionalities (and update the incremental column accordingly. Because the application strategy has a weak consistency level, a data reconciliation strategy should be put in place in order to identify any difference culprit.
Also known as the tracability column, in order to identify the change as modifications, insertion and deletion by date
A timestamp (Date and time stamps) that track the last time data was changed (ie the equivalent of the file modification time but for the row)
A single column containing a ascendant unique ID where every new row has an new id bigger than the previous one
Use case: Table that are insert-only such as a fact table
The conventional approach to data integration involves extracting all data from the source system and then integrating the entire set. This phase uses possibly also an incremental strategy by detecting the changes between the target and the source via a data set comparison based on a hash function: OWB - How to implement a type 2 slowly changing dimension with a hash function ?
A trigger-based approaches support incremental iteration with only weak consistency guarantees
In concept, the code that the trigger launches can be written to write a record of the transaction to a database table and the ETL tool can then poll those tables on a periodic basis.
Pro:
Cons: