Data Partition - Horizontal


Horizontal partitions cuts the data by row.

Vertical partition cuts the data by column


Partition is what enable parallelism.

  • Do not under partition - Partitioning on columns with only a few values can cause few partitions and therefore few parallel process. For example, partitioning on gender only creates two partitions to be created (male and female), thus only reduce the latency by a maximum of half.
  • Do not over partition - On the other extreme, creating a partition on a column with a unique value (for example, userid) causes multiple partitions. Over partition causes much stress as it has to handle the large number of partitions (implemented as directory, buffer, block)
  • Avoid data skew - Choose your partitioning key wisely or use a hash functions so that all partitions are even size. Otherwise, the parallel thread with the most data will determine the total latency.


Powered by ComboStrap