About
Data Preparation is a data step that prepares your data for further analyis.
It's a key factor in any data project:
- mining, ai
- analytics
Preparing has several steps that are explained below.
Output
Data Mining
For data mining, the data exists within a single table or view where each case (record) is a row.
If you want to mine data stored in a star schema, you still need to create a unique table with a record by case and the variable of interests.
The data mining development process may require several data sets.
A data set may be:
- needed for building (training) the model;
- used for scoring.
- used for testing.
Type Preparation / Data Transformation
Data transformations may be required by algorithms.
Data Cleansing
The data must be properly cleansed to eliminate inconsistencies and support the needs of the mining application.
Data Discretization
Put data into bin: binning (discretization)
Data Correction
Data Correction - There is always a bad input
Data Normalization
Normalize the data to be able to have:
- a common scale
Example:
- Email and URL should be all lowercase
- Metrics should be expressed in a rate rather than in a number
Outlier suppression
outlier suppression is required to not skew the result.
Tools
- Google Cloud Dataprep, an intelligent, fully-managed cloud service (built in collaboration with Trifacta) that visually explores, cleans and prepares structured and unstructured data for analysis or training machine-learning models.
- In Oracle, DBMS_DATA_MINING_TRANSFORM is a data transformation package that includes a variety of missing value and outlier treatments, as well as binning and normalization capabilities.