Table of Contents

Data Quality - Data Profiling

Description

Data profiling is a set of algorithms for statistical analysis and assessment of the quality of data values within a data set, as well as exploring relationships that exists between value collections within and across data sets.

On this page, you can see a demo of such tool in OWB

For each column in a table, a data profiling tool will provide a frequency distribution of the different values, providing insight into the type and use of each column. Cross-column analysis can expose embedded value dependencies, while inter-table analysis explores overlapping values sets that may represent foreign key relationships between entities, and it is in this way that profiling can be used for anomaly analysis and assessment, which feeds the process of defining data quality metrics.

The analysis performed by data profiling tools exposes :

The data profiling give a lot of insight about data in a set of data but the most important goal is to derive data rule and to achieve it, the set of algorithm is normally classified by data rule type to find :

Refer to data rule type for more details on data rule type.

Data profiling analysis type

The data rules may be discover or classify through three type of data profiling analysis:

Request

Performance

A lot of the profiling resources are spent on working on column to find relationship. So don’t just profile every column and try to see if there are relationships between all of them, because you are not efficiently using the resources.

Documentation / Reference