Table of Contents

(Apache) Hadoop

About

Apache Hadoop is basically 3 things:

Image comes from Hortonworks

Hadoop:

scales, runs on commodity hardware where you setup a Hadoop cluster (a group of machines connected together over a network)
is fault tolerant.
is good for “embarassingly parallel” problems, such as building an index for the web.

Steps:

setting up a Hadoop cluster and then store huge amounts of data spread across all the machines into the HDFS
processing the scattered data by writing MapReduce programs (or jobs) that goes locally on each machine in order to not relocate the data.

Single threaded writes.

Use cases

Data Warehouse/Analytics Offload
Data Lake/Enterprise Data Hub

Building

The BUILDING.txt in the code repository has all information to build it (also for windows)

Proposition

Oracle: Big Data Appliance

In Oct 2011, Oracle announced the Big Data Appliance, which integrates:

Cloudera's Distribution Including Hadoop (CDH),
Oracle Enterprise Linux,
the R programming language,
and a NoSQL database

with the Exadata hardware.

Azure

Azure - HDInsight (Microsoft's Hadoop)

Documentation / Reference

wiki/Apache Hadoop