Data Quality - Data Profiling

Description

Data profiling is a set of algorithms for statistical analysis and assessment of the quality of data values within a data set, as well as exploring relationships that exists between value collections within and across data sets.

On this page, you can see a demo of such tool in OWB

For each column in a table, a data profiling tool will provide a frequency distribution of the different values, providing insight into the type and use of each column. Cross-column analysis can expose embedded value dependencies, while inter-table analysis explores overlapping values sets that may represent foreign key relationships between entities, and it is in this way that profiling can be used for anomaly analysis and assessment, which feeds the process of defining data quality metrics.

The analysis performed by data profiling tools exposes :

data rule (For instance, identifies dependencies that represent business rules embedded within the data)
anomalies that exist within the data sets

The data profiling give a lot of insight about data in a set of data but the most important goal is to derive data rule and to achieve it, the set of algorithm is normally classified by data rule type to find :

Domain List
Domain Pattern List
Domain Range
…

Refer to data rule type for more details on data rule type.

Articles Related

Data profiling analysis type

The data rules may be discover or classify through three type of data profiling analysis:

Attribute Analysis. Analysis between values of the same column.
Functional dependency. Analysis between columns of the same row.
Referential Analysis. Analysis between columns from different tables to discover Jo0ins, Redundant Attributes, Orphans, Childless.

Request

Candidate Key: Determines whether a column can be used as a table key
Column Length Distribution reports the range of lengths for string values in a column.
Column Null Ratio reports the percentage of null values in a column.
Column Pattern identifies regular expressions that are applicable to the values in a column.
Column Statistics reports statistics such as minimum, maximum, and average values for a column.
Column Value Distribution reports the groupings of distinct values in a column.
Functional Dependency determines whether the value of a column is dependent on the value of other columns in the same table.
Value Inclusion reports the percentage of time that a column value in one table matches a column in another table.

Performance

A lot of the profiling resources are spent on working on column to find relationship. So don’t just profile every column and try to see if there are relationships between all of them, because you are not efficiently using the resources.

Documentation / Reference

https://www.slideshare.net/julianhyde/data-profiling-with-apache-calcite