Data Science - Big Data

Data System Architecture

Data Science - Big Data


Big Data describes data defined in terms of the 3Vs:

  • volume, (A lot, Internet-scale data set.)
  • velocity, (Quick)
  • and variety. (In a lot of structure)

Doug Laney of Gartner originally defined the 3Vs 12 years ago in this paper.

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…

Dan Ariely, 2013

A big data structure suits good to process large amount of data (often unstructured).

Designing a schema, loading all the data into this schema to after get the benefits of querying it is a big (parallel) job for 20 terabytes of text files for instance.

This is where the Hadoop ecosystem (Pig, Hive, …) come into play.

Data science application will be designed where Hadoop and its extensions are being used to do the initial processing, parsing, and loading. The obtained result is then loaded into a more conventional database for kind of ad hoc querying.

The Obama campaign was using this architecture. They have mentioned:

This architecture is becoming a standard in designing these kind of data analysis application.

Word Cloud

Apache Cassandra, Machine Learning, Hadoop, NoSQL, Apache Hive, Map/Reduce and HDFS, Data Visualization, ZooKeeper, NoSQL, Distributed Search and Real Time Analytics, Avro, Visualizing Your Graph, Analytics Maturity Model, R




Much data source of Big data occurs with online recording:

  • every click on a website,
  • every ad viewed,
  • every billing event,
  • every fast-forward or pause while you're watching a video,
  • every request that's made from a client to a server,
  • every transaction,
  • every network message,
  • and every fault.

Anything that occurs potentially could be recorded.

A lot of it is recorded, but very little of it gets analyzed, and that's why we get to know the picture of an iceberg because a phenomenal amount of data is collected but only a tiny amount of that data is analyzed.

Internet of things

  • sensor
  • RFID tag (California wiki/FasTrak Electronic Toll Collection transponder to pay our tolls on the highways but also used to collect data that's used for traffic reporting)

User-generated content

  • post on Facebook
  • picture on Instagram
  • review on Yelp or TripAdvisor
  • tweet on Twitter
  • video on YouTube.

Health and scientific computing

  • the Large Hadron Collider. It generates more data in a year than all the other data sources combined.
  • genome sequencing data. The cost of performing sequencing, is dropping exponentially, much faster than Moore's Law, so as result we're collecting more sequencing data than ever before.

Cost Genome Sequencing Vs Moore Laws



Graphs include things like:

  • social networks,
  • telecommunication networks,
  • computer networks,
  • road networks,
  • and collaborations or relationships.

Some of these graphs can be absolutely enormous (Facebook's user graph)

Log files

Log - Logging


  • Cloudera
  • MapR

Documentation / Reference

Discover More
Data System Architecture
(Data|State|Operand) Management and Processing

This section is and state management as opposed to code. System that manages data are called database. In a computer, there is two kinds of byte instruction byte and data byte. This section is...
Gartner Hype Cycle
Product Technology - Hype Cycle (Gartner) for Emerging Technologies

where we can see: Timesten (In-Memory Analytics and In-Memory Database Management System Data mining (predictive Analytics) Hype_cycle Gartner's...
Thomas Bayes
What is a Pattern ?

A pattern means that the data (visual or not) are correlated that they have a relationship and that they are predictable. When you have a lack of pattern, you have true randomness When you find a pattern,...

Share this page:
Follow us:
Task Runner