Table of Contents

Data Science - Big Data

About

Big Data describes data defined in terms of the 3Vs:

Doug Laney of Gartner originally defined the 3Vs 12 years ago in this paper.

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…

Dan Ariely, 2013

A big data structure suits good to process large amount of data (often unstructured).

Designing a schema, loading all the data into this schema to after get the benefits of querying it is a big (parallel) job for 20 terabytes of text files for instance.

This is where the Hadoop ecosystem (Pig, Hive, …) come into play.

Data science application will be designed where Hadoop and its extensions are being used to do the initial processing, parsing, and loading. The obtained result is then loaded into a more conventional database for kind of ad hoc querying.

The Obama campaign was using this architecture. They have mentioned:

This architecture is becoming a standard in designing these kind of data analysis application.

Word Cloud

Apache Cassandra, Machine Learning, Hadoop, NoSQL, Apache Hive, Map/Reduce and HDFS, Data Visualization, ZooKeeper, NoSQL, Distributed Search and Real Time Analytics, Avro, Visualizing Your Graph, Analytics Maturity Model, R

Sources

Counter

Monitoring

Much data source of Big data occurs with online recording:

Anything that occurs potentially could be recorded.

A lot of it is recorded, but very little of it gets analyzed, and that's why we get to know the picture of an iceberg because a phenomenal amount of data is collected but only a tiny amount of that data is analyzed.

Internet of things

User-generated content

Health and scientific computing

Cost Genome Sequencing Vs Moore Laws

See

Graphs

Graphs include things like:

Some of these graphs can be absolutely enormous (Facebook's user graph)

Log files

Log - Logging

Framework

Documentation / Reference