Data Science - Big Data
Big Data describes data defined in terms of the 3Vs:
- volume, (A lot, Internet-scale data set.)
- velocity, (Quick)
- and variety. (In a lot of structure)
Doug Laney of Gartner originally defined the 3Vs 12 years ago in this paper.
Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it…
A big data structure suits good to process large amount of data (often unstructured).
Designing a schema, loading all the data into this schema to after get the benefits of querying it is a big (parallel) job for 20 terabytes of text files for instance.
This is where the Hadoop ecosystem (Pig, Hive, …) come into play.
Data science application will be designed where Hadoop and its extensions are being used to do the initial processing, parsing, and loading. The obtained result is then loaded into a more conventional database for kind of ad hoc querying.
The Obama campaign was using this architecture. They have mentioned:
- Hadoop for the ETL work (load, extract, transform)
- and a vertical database for the slicing and dicing.
This architecture is becoming a standard in designing these kind of data analysis application.
Apache Cassandra, Machine Learning, Hadoop, NoSQL, Apache Hive, Map/Reduce and HDFS, Data Visualization, ZooKeeper, NoSQL, Distributed Search and Real Time Analytics, Avro, Visualizing Your Graph, Analytics Maturity Model, R
Much data source of Big data occurs with online recording:
- every click on a website,
- every ad viewed,
- every billing event,
- every fast-forward or pause while you're watching a video,
- every request that's made from a client to a server,
- every transaction,
- every network message,
- and every fault.
Anything that occurs potentially could be recorded.
A lot of it is recorded, but very little of it gets analyzed, and that's why we get to know the picture of an iceberg because a phenomenal amount of data is collected but only a tiny amount of that data is analyzed.
Internet of things
- RFID tag (California wiki/FasTrak Electronic Toll Collection transponder to pay our tolls on the highways but also used to collect data that's used for traffic reporting)
- post on Facebook
- picture on Instagram
- review on Yelp or TripAdvisor
- tweet on Twitter
- video on YouTube.
Health and scientific computing
- the Large Hadron Collider. It generates more data in a year than all the other data sources combined.
- genome sequencing data. The cost of performing sequencing, is dropping exponentially, much faster than Moore's Law, so as result we're collecting more sequencing data than ever before.
Graphs include things like:
- social networks,
- telecommunication networks,
- computer networks,
- road networks,
- and collaborations or relationships.
Some of these graphs can be absolutely enormous (Facebook's user graph)