About
The Spark Metastore is based generally on Hive - Metastore
Articles Related
Management
Remote connection
// if using kerneros
System.setProperty("hive.metastore.sasl.enabled", "true");
System.setProperty("hive.security.authorization.enabled", "false");
System.setProperty("hive.metastore.kerberos.principal", hivePrincipal);
System.setProperty("hive.metastore.execute.setugi", "true");
// Conf to a remote Hive
SparkSession spark = SparkSession
.builder()
.appName("RemoteConnection Example")
.config("hive.metastore.uris", "thrift://METASTORE:9083")
.enableHiveSupport()
.getOrCreate();
Conf
Conf key | Value | Desc |
---|---|---|
spark.sql.hive.caseSensitiveInferenceMode | INFER_AND_SAVE | Sets the action to take when a case-sensitive schema cannot be read from a Hive table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file formats such as Parquet are. Spark SQL must use a case-preserving schema when querying any table backed by files containing case-sensitive field names or queries may not return accurate results. Valid options include INFER_AND_SAVE (the default mode– infer the case-sensitive schema from the underlying data files and write it back to the table properties), INFER_ONLY (infer the schema but don't attempt to write it to the table properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema instead of inferring). |
spark.sql.hive.convertMetastoreParquet | true | When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support. |
spark.sql.hive.convertMetastoreParquet .mergeSchema | false | When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files. This configuration is only effective when “spark.sql.hive.convertMetastoreParquet” is true. |
spark.sql.hive.filesourcePartitionFileCacheSize | 262144000 | When nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled. |
spark.sql.hive.manageFilesourcePartitions | true | When true, enable metastore partition management for file source tables as well. This includes both datasource and converted Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning. |
spark.sql.hive.metastore.barrierPrefixes | A | comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a prefix that typically would be shared (i.e. org.apache.spark.*). |
spark.sql.hive.metastore.sharedPrefixes | com.mysql.jdbc, org.postgresql, com.microsoft.sqlserver, oracle.jdbc | A comma separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j. |
spark.sql.hive.metastore.version | 1.2.1 | Version of the Hive metastore. Available options are 0.12.0 through 1.2.1. |
spark.sql.hive.metastorePartitionPruning | true | When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information). |
spark.sql.hive.thriftServer.async | true | When set to true, Hive Thrift server executes SQL queries in an asynchronous way. |
spark.sql.hive.thriftServer.singleSession | false | When set to true, Hive Thrift server is running in a single session mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database. |
spark.sql.hive.verifyPartitionPath | false | When true, check all the partition paths under the table's root directory when reading data stored in HDFS. |
spark.sql.hive.version | 1.2.1 | Version of Hive used internally by Spark SQL. |