MapReduce - Shuffling (Combine)

Mapreduce Pipeline

About

Distributed SQL query processing in Hadoop differs from conventional relational query engine when it comes to handling of intermediate result sets. Query processing often requires sorting and reassembling of intermediate result set; this is called shuffling in Hadoop parlance.

Most of the existing query optimizations in Hive are about minimizing shuffling cost.

Mapreduce Pipeline





Discover More
Card Puncher Data Processing
Hive - Engine

The SQL Processing engine of hive SQL query are converted to a physical operator tree which is optimized and converted to the underlining engine via the calcite engine. Most of the existing query optimizations...
Mapreduce Pipeline
Map Reduce - Data (Stream) - pairs

MapReduce framework types the Writable interface (to be serializable) the WritableComparable interface (to facilitate sorting) pipeline
Mapreduce Pipeline
Map Reduce - Sort

sort in a mapreduce is a operation that happens after a shuffling.
Mapreduce Pipeline
MapReduce - Operations (Transformations)

Every mapreduce app has two kind of operations/transformations:
Mapreduce Pipeline
MapReduce - Pipeline

A MapReduce app implements a pipeline where: the input is transformed in key value pair stream/data the stream/data is process in paralleled via a map operations the result is then combined/shuffled...



Share this page:
Follow us:
Task Runner