Spark - Metastore

Card Puncher Data Processing

About

The Spark Metastore is based generally on Hive - Metastore

Management

Remote connection

// if using kerneros
System.setProperty("hive.metastore.sasl.enabled", "true");
System.setProperty("hive.security.authorization.enabled", "false");
System.setProperty("hive.metastore.kerberos.principal", hivePrincipal);
System.setProperty("hive.metastore.execute.setugi", "true");

// Conf to a remote Hive
SparkSession spark = SparkSession
      .builder()
      .appName("RemoteConnection Example")
      .config("hive.metastore.uris", "thrift://METASTORE:9083")
      .enableHiveSupport()
      .getOrCreate();

Conf

Spark - Configuration

Conf key Value Desc
spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE Sets the action to take when a case-sensitive schema cannot be read from a Hive table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file formats such as Parquet are. Spark SQL must use a case-preserving schema when querying any table backed by files containing case-sensitive field names or queries may not return accurate results. Valid options include INFER_AND_SAVE (the default mode– infer the case-sensitive schema from the underlying data files and write it back to the table properties), INFER_ONLY (infer the schema but don't attempt to write it to the table properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema instead of inferring).
spark.sql.hive.convertMetastoreParquet true When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support.
spark.sql.hive.convertMetastoreParquet .mergeSchema false When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files. This configuration is only effective when “spark.sql.hive.convertMetastoreParquet” is true.
spark.sql.hive.filesourcePartitionFileCacheSize 262144000 When nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled.
spark.sql.hive.manageFilesourcePartitions true When true, enable metastore partition management for file source tables as well. This includes both datasource and converted Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning.
spark.sql.hive.metastore.barrierPrefixes A comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a prefix that typically would be shared (i.e. org.apache.spark.*).
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,
org.postgresql,
com.microsoft.sqlserver,
oracle.jdbc
A comma separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j.
spark.sql.hive.metastore.version 1.2.1 Version of the Hive metastore. Available options are 0.12.0 through 1.2.1.
spark.sql.hive.metastorePartitionPruning true When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information).
spark.sql.hive.thriftServer.async true When set to true, Hive Thrift server executes SQL queries in an asynchronous way.
spark.sql.hive.thriftServer.singleSession false When set to true, Hive Thrift server is running in a single session mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.
spark.sql.hive.verifyPartitionPath false When true, check all the partition paths under the table's root directory when reading data stored in HDFS.
spark.sql.hive.version 1.2.1 Version of Hive used internally by Spark SQL.

Documentation / Reference





Discover More
Sql Hive Arch
Spark - Hive

Hive is the default Spark catalog. Since Spark 2.0, Spark SQL supports builtin Hive features such as: HiveQL Hive SerDes UDFs read...



Share this page:
Follow us:
Task Runner