Data mining is an experimental science.
Data mining reveals correlation, not causation.
From data to information (patterns, or expectations, that underlie them)
Any data scientist worth their salary will say you should start with a question, NOT the data,
Most #bigdata problems can be addressed by proper sampling/filtering and running models on a single (perhaps large) machine …
The only way to �find out what will happen when a complex system is disturbed is to disturb the system, not merely to observe it passively
In other words, if you want to make a causal statement about a predictor for an outcome, you actually have to be able to take the system and perturb that particular predictor keeping the other ones fixed.
That will allow you to make a causal statement about a predictor variable and its effect on the outcome. It's not good enough simply to observe some observations from the system. Data from this observation can't conclude to causality.
So in order to know what happens when a complex system is perturbed, it must be perturbed not only observed.
The following paragraph must be merged in one.
See Data Mining - Data (Preparation | Wrangling | Munging)
Learning is iterative:
The phases of solving a business problem using Data Mining are as follows:
For a Supervised problem:
The Cross Industry Standard Process Model for Data Mining (CRISP-DM). From: An Oracle White Paper - February 2013 - Information Management and Big Data A Reference Architecture
https://eng.uber.com/michelangelo/ 6 steps:
* Evaluate models
When Google rolled out flu stories in Google News, people started reading about flu in the news and searching on those stories and that skewed their results. During the period from 2011 to 2013, it overestimated the prevalence of flu (factor of two in 2012 and 2013). They needed to take this new factor into account.
Google Flu Trends teaches us that the modelling process cannot be static, but rather we must periodically revist the process and understand what underlying factors, if any, may have changed.