Spark DataSet - Partition
Table of Contents
About
Spark Engine - Partition in Spark
Partitions:
- may be subdivides in bucket
- follow the same SQL rule than Hive Partitions
The num of Partitions dictate the number of tasks that are launched.
The computation is taking place on one node if the number of partition is one.
Articles Related
Number of partition
On read
The no. of partitions is determined by spark.sql.files.maxPartitionBytes parameter, which is set to 128 MB, by default. This configuration determines the maximum number of bytes to pack into a single partition when (reading|writing ?) files.
134217728 = 128 Mb
So if you are reading a file of size 1GB with a maxPartitionBytes of 128 Mb, it creates 10 partitions.
The value of 128Mb is the default HDFS block size and file size should be greater than it. (Aim for around 1GB per file)
In Memory
The number of partitions can be changed after with the methods:
-
- equally distributed chunks thanks to hash
- perform a shuffle of the data and create partitions.
- and coalesce
- only to reduce the number of partitions - move data partition wise to existing another partitions. Unlike repartition, coalesce doesn’t perform a shuffle to create the partitions. the data from a partition is removed and appended to another partition
Management
Write
Partitions the output by the given columns on the file system.
Apply function
- foreachPartition(func) - Runs func on each partition of this Dataset.
- mapPartitions - Returns a new Dataset that contains the result of applying func to each partition. Experimental
Sort
- sortWithinPartitions - Returns a new Dataset with each partition sorted by the given expressions. This is the same operation as “SORT BY” in SQL (Hive QL).
Coalesce
Example:
// coalesce the data to a single file.
data.coalesce(1).write
Repartition
Repartition creates partitions based on the user’s input by performing a full shuffle on the data (after read)
- repartition by column (takes the number of partition from spark.sql.shuffle.partitions)
- repartitionByRange - range partition
If the number of partitions is not specified, the number is taken from spark.sql.shuffle.partitions.
Returns a new Dataset partitioned by the given partitioning expressions.
Same as DISTRIBUTE BY in SQL.
All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.
Support
Number of dynamic partitions created is XXXX, which is more than 1000.
Number of dynamic partitions created is 2100, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 2100.;
The below configuration must be set before starting the spark application
spark.hadoop.hive.exec.max.dynamic.partitions
A set with Spark SQL Server will not work. You need to set the configuration at the start of the server.