Table of Contents

Data Mining - (Life cycle|Project|Data Pipeline)

About

Data mining is an experimental science.

Data mining reveals correlation, not causation.

From data to information (patterns, or expectations, that underlie them)

Any data scientist worth their salary will say you should start with a question, NOT the data,

Most #bigdata problems can be addressed by proper sampling/filtering and running models on a single (perhaps large) machine …

Observation against Perturbation

The only way to �find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively

Fred Mosteller and John Tukey, paraphrasing George Box

In other words, if you want to make a causal statement about a predictor for an outcome, you actually have to be able to take the system and perturb that particular predictor keeping the other ones fixed.

That will allow you to make a causal statement about a predictor variable and its effect on the outcome. It's not good enough simply to observe some observations from the system. Data from this observation can't conclude to causality.

So in order to know what happens when a complex system is perturbed, it must be perturbed not only observed.

Lifecycle

The following paragraph must be merged in one.

P Value Pipeline

Data Preparation

See Data Mining - Data (Preparation | Wrangling | Munging)

Null

  1. Define the question of interest, Identify the problem
  2. (Get|Collect) the data
  3. Data Preparation: Prepare the data (Integrate, transform, clean, filter aggregate) What is Data Processing (Data Integration)?
  4. (Explore|Interact) with the data (And always visualize the data to understand the distribution. See the Anscombe's quartet to understand why ?)
  5. ? train a model to distinguish between your training set & unlabeled data. If it works, your training data may be incomplete! Jake van der Plas
  6. (Build|Fit) a model
  7. Evaluation is how to determine if the classifier is a good representation.
  8. Communicate the results
  9. Make the analysis reproducible

Classifier

  1. Choose a classifier with a knowledge representation (how the data is classified - decision tree, rule, …)

Learning is iterative:

Second

Three

The phases of solving a business problem using Data Mining are as follows:

Supervised

For a Supervised problem:

Cross Industry Standard Process Model for Data Mining

The Cross Industry Standard Process Model for Data Mining (CRISP-DM). From: An Oracle White Paper - February 2013 - Information Management and Big Data A Reference Architecture

Crisp Dm

Uber

https://eng.uber.com/michelangelo/ 6 steps:

* Evaluate models

A Model is dynamic

When Google rolled out flu stories in Google News, people started reading about flu in the news and searching on those stories and that skewed their results. During the period from 2011 to 2013, it overestimated the prevalence of flu (factor of two in 2012 and 2013). They needed to take this new factor into account.

Google Flu Trends teaches us that the modelling process cannot be static, but rather we must periodically revist the process and understand what underlying factors, if any, may have changed.

Pitfall / Pratfall

Software

Documentation / Reference