Spark SQL - Conf (Set)

Card Puncher Data Processing


Spark - Configuration of Spark in SQL is made through the SET statement. Same usage as Hive - Set




One by one

set hive.exec.dynamic.partition.mode=nonstrict;
set spark.executor.cores=1; -- the number of executor
set spark.dynamicAllocation.enabled=false;
set spark.executor.instances=1; --	The number of executors for static allocation. With spark.dynamicAllocation.enabled, the initial set of executors will be at least this large.
set spark.cores.max=1; -- the maximum amount of CPU cores to request for the application from across the cluster (not from each machine). If not set, the default will be spark.deploy.defaultCores 
-- you control the degree of parallelism post-shuffle using �SET spark.sql.shuffle.partitions=[num_tasks];�.
set spark.sql.shuffle.partitions= 1;
set spark.default.parallelism = 1;
set spark.sql.files.maxPartitionBytes = 1073741824; -- The maximum number of bytes to pack o a single partition when reading files.


See doc ref

SET -v
  • Scala
sparkSession.sql("SET -v").show(numRows = 200, truncate = false)
  • Java
sparkSession.sql("SET -v").show(200, false);
  • Python
sparkSession.sql("SET -v").show(n=200, truncate=False)
  • R
properties <- sql("SET -v")
showDF(properties, numRows = 200, truncate = FALSE)


Conf key Value Desc
spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE Sets the action to take when a case-sensitive schema cannot be read from a Hive table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file formats such as Parquet are. Spark SQL must use a case-preserving schema when querying any table backed by files containing case-sensitive field names or queries may not return accurate results. Valid options include INFER_AND_SAVE (the default mode– infer the case-sensitive schema from the underlying data files and write it back to the table properties), INFER_ONLY (infer the schema but don't attempt to write it to the table properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema instead of inferring).
spark.sql.hive.convertMetastoreParquet true When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support.
spark.sql.hive.convertMetastoreParquet .mergeSchema false When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files. This configuration is only effective when “spark.sql.hive.convertMetastoreParquet” is true.
spark.sql.hive.filesourcePartitionFileCacheSize 262144000 When nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled.
spark.sql.hive.manageFilesourcePartitions true When true, enable metastore partition management for file source tables as well. This includes both datasource and converted Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning.
spark.sql.hive.metastore.barrierPrefixes A comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a prefix that typically would be shared (i.e. org.apache.spark.*).
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,
A comma separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j.
spark.sql.hive.metastore.version 1.2.1 Version of the Hive metastore. Available options are 0.12.0 through 1.2.1.
spark.sql.hive.metastorePartitionPruning true When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information).
spark.sql.hive.thriftServer.async true When set to true, Hive Thrift server executes SQL queries in an asynchronous way.
spark.sql.hive.thriftServer.singleSession false When set to true, Hive Thrift server is running in a single session mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.
spark.sql.hive.verifyPartitionPath false When true, check all the partition paths under the table's root directory when reading data stored in HDFS.
spark.sql.hive.version 1.2.1 Version of Hive used internally by Spark SQL.

Discover More
Sql Hive Arch
Spark - Hive

Hive is the default Spark catalog. Since Spark 2.0, Spark SQL supports builtin Hive features such as: HiveQL Hive SerDes UDFs read...

Share this page:
Follow us:
Task Runner