Data Quality - Data Correction

Dataquality Metrics

About

Data Correction is the second step in a data cleansing process after the detection of values that not meet the business rules (data rules).

For each data values that are not accepted, you can have to choose one of the following actions:

  • Ignore: The data rule is ignored and, therefore, no values are rejected based on this data rule.
  • Report: The data rule is run only after the data has been loaded for reporting purposes only. It is like the Ignore option, except a report is created that contains the values that did not adhere to the data rules.
  • Cleanse: The values rejected by this data rule are moved to an error table where cleansing strategies are applied. When you choose this option, you must specify a cleansing strategy of correction rule. See the following section for details about cleansing strategies.

Cleansing Strategy (Data Cleansing Rule)

Cleansing or correction rules, identifies a violation of some expectation and a way to modify the data to then meet the business needs. For example, while there are many ways that people provide telephone numbers, an application may require that each telephone number be separated into its area code, exchange, and line components. This is a cleansing rule, as is shown in the figure below, which can be implemented and tracked using data cleansing tools.

Data Quality Correction Rule

Cleansing Strategy Description
Remove Does not populate the target table with error records
Custom Custom function in the target table
Set to Min Sets the attribute value of the error record to the minimum value defined in the data rule.
Set to Max Sets the attribute value of the error record to the maximum value defined in the data rule.
Similarity Uses a similarity algorithm based on permitted domain values to find a value that is similar to the error record.
Soundex Uses a soundex algorithm based on permitted domain values to find a value that is similar to the error record.
Merge Merge duplicate records into a single row.
Task Runner