# Statistics - (Data Set|Sample)

Because of the difficulties of obtaining information about all units in a population, it is common to use a small, random and representative subset of the population called a sample.

A sample is a smaller, random and representative (group|subset|data set) of the population.

Whenever a sample is used instead of the entire population, we have to accept that our results are merely estimates and therefore have some chance of being incorrect. This is called sampling error.

Any one sample will never be perfect if we're only getting a random sample from a population.

A larger sample should not affect the mean, but would reduce the standard deviation.

## Importance of Sampling

While data mining can be used to uncover patterns in data samples, it is important to be aware that:

• the use of non-representative samples of data may produce results that are not indicative of the domain.
• data mining will not find patterns that may be present in the domain, if those patterns are not present in the sample being “mined”. Data mining will only functions with indicative and representative data

## Documentation / Reference

Discover More
(Statistics|Machine Learning|Data Mining) - (Unit|Individual|Case|Subject|Observation|Instance|Input)

in Statistics. Each member of a sample is also known as: a unit an individual a case a subject an instance an observation input data Data contains values grouped into variables and observations....
Data Mining - Training (Data|Set)

In statistics, the training data is the sample whereas in data mining, machine learning, the training data is often a subset of the data set. See Training Set represents the hindsight whereas test set...
Model Building - ReSampling Validation

Resampling method are a class of methods that estimate the test error by holding out a subset of the training set from the fitting process, and then applying the statistical learning method to those held...
Statistical - Inference

Methods for drawing conclusions a population from sample data Two key methods Hypothesis tests (significance tests) Confidence intervals E.g., t-test – enables inferences population beyond our...
Statistics

is a scientific discipline devoted to the study of data. is the art of extracting information from data. From Data to Information to Knowledge. No learning. lies lies, damned lies, and statistics....
Statistics - ( Spread | Variability ) of a sample

An important element of a data set is how it is spread. Variability is measure that describes the range and diversity of scores in a distribution Measures of spread: , inter-quartile range,...
Statistics - (Estimation|Approximation)

In statistics, it's always impossible to do experimentation on the entire population. Sample of this population are then used to estimate characteristic of this population (ie the statistics). A sample...
Statistics - (Experimentation|Experimental research|Study)

The characteristics of Experimental Research are: Random sampling from a population Random assignment to treatment conditions (treatment group) As you cane see, Experimentational Research is characterized...
Statistics - Causation - Causality (Cause and Effect) Relationship

Cause and Effect Relationship. Nothing beats a simple, elegant, controlled, randomized experiment if you want to make strong claims causality. Causal inference is a difficult and slippery topic, which...
Statistics - Central limit theorem (CLT)

The Central_limit_theoremcentral limit theorem (CLT) is a probability theorem (unofficial sovereign) It establishes that when: random variables (independent) (estimate of a random process) are added...