(Statistics|Machine Learning|Data Mining) - (Unit|Individual|Case|Subject|Observation|Instance|Input)
- a unit
- an individual
- a case
- a subject
- an instance
- an observation
- input data
Data contains values grouped into variables and observations.
They are formally composed by attributes (features) which together constitute a description of their characteristics.
They are also known as:
This is the piece of input data for all algorithm for which an output value must be generated.
Oracle Data Mining
The data that you wish to mine must be defined within a single table or view. The information for each record must be stored in a separate row. The data records are commonly called cases. Each case can be identified by a unique case ID. The table or view itself is referred to as a case table.
Oracle Data Mining requires that the data be presented as a case table in single-record case format. All the data for each record (case) must be contained within a row.
Most typically, the case table is a view that presents the data in the required format for mining.
Model details reveal information about model attributes and their treatment by the algorithm. There is a separate GET_MODEL_DETAILS routine for each algorithm.
Oracle Data Mining requires a case table in single-record case format, with each record in a separate row.
Oracle Data Mining supports dimensioned/transactional data through nested columns. Each row in the nested column consists of an attribute name/value pair. Oracle Data Mining internally processes each nested row as a separate attribute.
Algorithms that support nested data:
|classification and regression
|classification, regression, and anomaly detection
Note on data format
Previous versions of Oracle Data Mining allowed two distinct data formats:
- Single Row per Record, in which all the information about an individual resides in a single row of the table/view,
- and Multiple row per Record (sometimes called “Transactional” format), in which information for a given individual may be found in several rows (for example if each row represents an item purchased).
In ODM 10g Release 2 and ODM 11g Release 1, only Single Row per Record format is acceptable (except in the case of Association Rules).
The database feature called Nested Column is used to accommodate the use case previously handled by Transactional format.
The possibilities for gathering data are:
- The case table or view contains all the data to be mined.
- Other tables or views contain additional simple attributes of an individual, such as FIRST_NAME, LAST_NAME, etc.
- Other tables or views contain complex attributes of an individual such as a list of products purchased or a list of telephone calls for a given period (sometimes called “transactional” data).
- The data to be mined consists of transactional data only; in this case, the case table must be constructed from the transactional data, and might consist only of a column containing the unique identifiers for the individuals and a target column.
Transactional Data Only
In special situations such as in Life Sciences problems, where each individual may have a very high number (perhaps thousands) of attributes, all the data is contained in a transactional-format table. This table must contain at least the three columns indicating the unique case ID, the attribute name, and the attribute value. For example, the attributes may be gene expression names and the attribute value is a gene expression value. Typically, the attribute values have been normalized and binned to obtain binary values of 0 and 1 (representing, for example, that the gene expression for a particular case is above (1) or below (0) the average value for that gene. For each case, there is one attribute name and value pair representing the target value – for example Target=1 means “responds to treatment” and Target=0 means “does not respond to treatment”. Suppose that we have a transactional table LYMPH_OUTCOME_BINNED with 5591 gene expressions for each of 58 patients and the binary target OUTCOME (0/1) indicating the success in treating Lymphoma patients. The business problem consists of the likely success in treating a particular patient based only on the values of gene expressions for that patient. The first step is to separate the case table information (ID, OUTCOME) from the gene information to be joined in as a nested column.