Changed Data Capture

Introduction

Changed Data Capture (CDC) allows Oracle Data Integrator to track changes in source data caused by other applications. When running integration interfaces, Oracle Data Integrator can avoid processing unchanged data in the flow.

Reducing the source data flow to only changed data is useful in many contexts, such as data synchronization and replication. It is essential when setting up an event-oriented architecture for integration. In such an architecture, applications make changes in the data ("Customer Deletion", "New Purchase Order") during a business process. These changes are captured by Oracle Data Integrator and transformed into events that are propagated throughout the information system.

Changed Data Capture is performed by journalizing models. Journalizing a model consists of setting up the infrastructure to capture the changes (inserts, updates and deletes) made to the records of this model's datastores.

Oracle Data Integrator supports two journalizing modes:

The Journalizing Components

The journalizing components are:

These components are implemented in the journalizing infrastructure.

Simple vs. Consistent Set Journalizing

Simple Journalizing enables you to journalize one or more datastores. Each journalized datastore is treated separately when capturing the changes.

This approach has a limitation, illustrated in the following example: Say you need to process changes in the ORDER and ORDER_LINE datastores (with a referential integrity constraint based on the fact that an ORDER_LINE record should have an associated ORDER record). If you have captured insertions into ORDER_LINE, you have no guarantee that the associated new records in ORDERS have also been captured. Processing ORDER_LINE records with no associated ORDER records may cause referential constraint violations in the integration process.

Consistent Set Journalizing provides the guarantee that when you have an ORDER_LINE change captured, the associated ORDER change has been also captured, and vice versa. Note that consistent set journalizing guarantees the consistency of the captured changes. The set of available changes for which consistency is guaranteed is called the Consistency Window. Changes in this window should be processed in the correct sequence (ORDER followed by ORDER_LINE) by designing and sequencing integration interfaces into packages.

Although consistent set journalizing is more powerful, it is also more difficult to set up. It should be used when referential integrity constraints need to be ensured when capturing the data changes. For performance reasons, consistent set journalizing is also recommended when a large number of subscribers are required.

Note: It is not possible to journalize a model (or datastores within a model) using both consistent set and simple journalizing.

Setting up Journalizing

This is the basic process for setting up CDC on an Oracle Data Integrator data model. Each of these steps is described in more detail below.

  1. Set the CDC parameters

  2. Add the datastores to the CDC

  3. For consistent set journalizing, arrange the datastores in order

  4. Add subscribers

  5. Start the journals

To set the data model CDC parameters:

This includes  selecting or changing the journalizing mode and journalizing knowledge module used for the model. If the model is already being journalized, it is recommended that you stop journalizing with the existing configuration before modifying the data model journalizing parameters.

  1. Edit the data model you want to journalize, and then select the Journalizing tab.

  2. Select the journalizing mode you want to set up: Consistent Set or Simple.

  3. Select the Journalizing KM you want to use for this model. Only knowledge modules suitable for the data model's technology and journalizing mode, and that have been previously imported into at least one of your projects will appear in the list.

  4. Set the Options for this KM. Refer to the knowledge module's description for more information on the options.

  5. Click OK to save the changes.

To add or remove datastores to or from the CDC:

You should now flag the datastores that you want to journalize. A change in the datastore flag is taken into account the next time the journals are (re)started. When flagging a model or a sub-model, all of the datastores contained in the model or sub-model are flagged.

  1. Select the datastore, the model or the sub-model you want to add/remove to/from CDC.

  2. Right-click then select Changed Data Capture > Add to CDC to add the datastore, model or sub-model to CDC, or select Changed Data Capture > Remove from CDC to remove it.

  3. Refresh the tree view. The datastores added to CDC should now have a marker icon. The journal icon represents a small clock. It should be yellow, indicating that the journal infrastructure is not yet in place.

It is possible to add datastores to the CDC after the journal creation phase. In this case, the journals should be re-started.

Note: If a datastore with journals running is removed from the CDC in simple mode, the journals should be stopped for this individual datastore. If a datastore is removed from CDC in Consistent Set mode, the journals should be restarted for the model (Journalizing information is preserved for the other datastores).

To arrange the datastores in order (consistent set journalizing only):

You only need to arrange the datastores in order when using consistent set journalizing. You should arrange the datastores in the consistent set into an order which preserves referential integrity when using their changed data. For example, if an ORDER table has references imported from an ORDER_LINE datastore (i.e. ORDER_LINE has a foreign key constraint that references ORDER), and both are added to the CDC, the ORDER datastore should come before ORDER_LINE. If the PRODUCT datastore has references imported from both ORDER and ORDER_LINE (i.e. both ORDER and ORDER_LINE have foreign key constraints to the ORDER table), its order should be lower still.

  1. Edit the data model you want to journalize, then select the Journalized Tables tab.

  2. If the datastores are not currently in any particular order, click the Reorganize button. This feature suggests an order for the journalized datastores based on the data models' foreign keys. Note that this automatic reorganization is not error-free, so you should review the suggested order afterwards.

  3. Select a datastore from the list, then use the Up and Down buttons to move it within the list. You can also directly edit the Order value for this datastore.

  4. Repeat step 3 until the datastores are ordered correctly, then click OK to save the changes.

Changes to the order of datastores are taken into account the next time the journals are (re)started.

Note: If existing scenarios consume changes from this CDC set, you should regenerate them to take into account the new organization of the CDC set.

Note: From this tab, you can remove datastores from CDC using the Remove from CDC button

To add or remove subscribers:

This adds or removes a list of entities that will use the captured changes.

  1. Select the journalized data model if using Consistent Set Journalizing or select a data model or individual datastore if using Simple Journalizing.

  2. Right-click, then select Changed Data Capture > Subscriber > Subscribe. A window appears which lets you select your subscribers.

  3. Type a subscriber name into the field, then click the Add Subscriber button.

  4. Repeat the operation for each subscriber you want to add. Then click OK.

A session to add the subscribers to the CDC is launched. You can track this session from the Operator.

To remove a subscriber is very similar. Select the Changed Data Capture > Subscriber > Unsubscribe option instead.

You can also add subscribers after starting the journals. Subscribers added after journal startup will only retrieve changes captured since they were added to the subscribers list.

To start/stop the journals:

Starting the journals creates the CDC infrastructure if it does not exist yet. It also validates the addition, removal and order changes for journalized datastores.

Note: Stopping the journals deletes the entire the journalizing infrastructure and all captured changes are lost. Restarting a journal does not remove or alter any changed data that has already been captured.

  1. Select the data model or datastore you want to journalize.

  2. Right-click, then select Changed Data Capture > Start Journal if you want to start the journals, or Changed Data Capture > Drop Journal if you want to stop them.

A session begins to start or drops the journals. You can track this session from the Operator.

The journalizing infrastructure is implemented by the journalizing KM at the physical level. Consequently, Add Subscribers and Start Journals operations should be performed in each context where journalizing is required for the data model. It is possible to automate these operations using Oracle Data Integrator packages. Automating these operations is recommended to deploy a journalized infrastructure across different contexts.

Typical situation: the developer manually configures CDC in the Development context. After this is working well, CDC is automatically deployed in the Test context by using a package. Eventually the same package is used to deploy CDC in the Production context.

To automate journalizing setup:

  1. Create a new package in Designer

  2. Drag and drop the model or datastore you want to journalize. A new step appears.

  3. Double-Click the step icon in the diagram. The properties panel opens.

  4. In the Type list, select Journalizing Model/Datastore.

  5. Check the Start box to start the journals.

  6. Check the Add Subscribers box, then enter the list of subscribers into the Subscribers group.

  7. Click OK to save.

  8. Generate a scenario for this package.

When this scenario is executed in a context, it starts the journals according to the model configuration and creates the specified subscribers using this context.

It is possible to split subscriber and journal management into different steps and packages. Deleting subscribers and stopping journals can be automated in the same manner. See the Packages section for more information.

Journalizing Infrastructure Details

When the journals are started, the journalizing infrastructure (if not installed yet) is deployed or updated in the following locations:

Note: All components (except the triggers) of the journalizing infrastructure (like all Data Integrator temporary objects, such as integration, error and loading tables) are installed in the Work Schema for the Oracle Data Integrator physical schemas of the data server. These work schemas should be kept separate from the schema containing the application data (Data Schema).

Important Note: The journalizing triggers are the only components for journalizing that must be installed, when needed, in the same schema as the journalized data. Before creating triggers on tables belonging to a third-party software package, please check that this operation is not a violation of the software agreement or maintenance contract. Also ensure that installing and running triggers is technically feasible without interfering with the general behavior of the software package.

Journalizing Status

Datastores in models or interfaces have an icon marker indicating their journalizing status in Designer's current context:

Using Changed Data

Once journalizing is started and changes are tracked for subscribers, it is possible to view the changes captured.

To view the changed data:

  1. Select the journalized datastore

  2. Right-click, then select Changed Data Capture > Journal Data.

A window containing the changed data appears.

Note: This window selects data using the journalizing view.

The changed data displays three extra columns for the changes details:

Journalized data is mostly used within integration processes. Changed data can be used as the source of integration interfaces. The way it is used depends on the journalizing mode.

Using Changed Data: Simple Journalizing

Using changed data from simple journalizing consists of designing interfaces using journalized datastores as sources.

Designing Interfaces

Journalizing Filter

When a journalized datastore is inserted into an interface diagram, a Journalized Data Only check box appears in this datastore's property panel.

When this box is checked:

Note: In simple journalizing mode all the changes taken into account by the interface (after the journalizing filter is applied) are automatically considered consumed at the end of the interface and removed from the journal. They cannot be used by a subsequent interface.

Knowledge Module Options

When processing journalized data, the SYNC_JRN_DELETE option of the integration knowledge module should be set carefully. It invokes the deletion from the target datastore of the records marked as deleted (D) in the journals and that are not excluded by the journalizing filter. If this option is set to No, integration will only process inserts and updates.

Using Changed Data: Consistent Set Journalizing

Using Changed data in Consistent journalizing is similar to simple journalizing regarding interface design.  It requires extra steps before and after processing the changed data in the interfaces, in order to enforce changes consistently within the set.

Operations Before Using the Changed Data

The following operations should be undertaken before using the changed data when using consistent set journalizing:

Designing Interfaces

The changed data in consistent set journalizing are also processed using interfaces sequenced into packages.

Designing interfaces when using consistent set journalizing is similar to simple journalizing, except for the following differences:

Operations after Using the Changed Data

After using the changed data, the following operations should be performed:

To create an Extend Window, Lock/Unlock Subscriber or Purge Journal step in a package:

  1. Open the package where the operations will be performed.

  2. Drag and drop the model for which you want to perform the operation.

  3. In the Type list, select Journalizing Model.

  4. Check the option boxes corresponding to the operations you want to perform.

  5. Enter the list of subscribers into the Subscribers group if performing lock/unlock subscribers operations.

  6. Click OK.

Note: It is possible to perform an Extend Window or Purge Journal on a datastore. This operation is provided to process changes for tables that are in the same consistency set at different frequencies. This option should be used carefully, as consistency for the changes may be no longer maintained at the consistency set level

Journalizing Tools

Oracle Data Integrator provides a set of tools that can be used in journalizing to refresh information on the captured changes or trigger other processes:

See the Oracle Data Integrator Tools Reference for more information on these functions.

Package Templates for Using Journalizing

A number of templates may be used when designing packages to use journalized data. Below are some typical templates.

Template 1: One Simple Package (Consistent Set)

This package is scheduled to process all changes every minutes. This template is relevant if changes are made regularly in the journalized tables.

Template 2: One Simple Package (Simple Journalizing)

This package is scheduled to process all changes every minutes. This template is relevant if changes are made regularly in the journalized tables.

Template 3: Using SnpsWaitForLogData (Consistent Set or Simple)

This package is scheduled regularly. Changed data will only be processed if new changes have been detected. This avoids useless processing if changes occur sporadically to the journalized tables (i.e. to avoid running interfaces that would process no data).

Template 4: Separate Processes (Consistent Set)

This template dissociates the consistency window,  the purge, and the changes consumption (for two different subscribers) in different packages.

Package 1: Extend Window

This package is scheduled every minute. Extend Window may be resource consuming. It is better to have this operation triggered only when new data appears.

Package 2: Purge Journal (at the end of week)

This package is scheduled once every Friday. We will keep track of the journals for the entire week.

Package 3: Process the Changes for Subscriber A

This package is scheduled every minute. Such a package is used for instance to generate events in a MOM.

Package 4: Process the Changes for Subscriber B

This package is scheduled every day. Such a package is used for instance to load a data warehouse during the night with the changed data.