Big Data

Hive Metastore

Apache Hive

2 min readDec 13, 2018

Overview

Hive Metastore is a repository containing metadata (column names, data types, comments, etc.) about objects we create in Hive. Having said, when we create a Hive table, the table definitions (column names, data types, comments, location, etc.) are stored in the Hive Metastore. This Hive Metastore is implemented using tables in a relational database.

By default, Hive uses a built-in Derby SQL server to store its metadata. We can change the relational database to any of the supported databases (Derby, MySQL, PostgreSQL, Oracle) — using hive-site.xml we can specify where Hive to store its metadata. Hive Metastore acts as a central schema repository which can be used by other tools like Spark and Pig.

Hive Metastore service

Hive Metastore service stores the metadata and provides clients (including Hive) access to this information using the metastore service API.

Metastore deployment modes

There are three modes for Hive Metastore deployment:

Embedded mode
Local mode
Remote mode

Embedded Mode (Not recommended for production use)

Embedded mode is the default metastore deployment mode for Cloudera distribution. In this mode, the metastore uses a Derby database. Both the database and the metastore service are embedded in the main HiveServer process, i.e. runs in the same JVM as the Hive service. Both the database and the metastore service get started when we start the HiveServer process. This mode supports only one active user at a time, i.e. only one Hive session could be open at a time. Note that this mode is not certified for production use.

Local Mode

In Local mode, the Hive Metastore service runs in the same process as the main HiveServer process, but the metastore database runs in a separate process and can be on a separate host. The embedded metastore service communicates with the metastore database over JDBC. This mode allows us to have multiple Hive sessions, i.e. multiple users can use the metastore at the same time.

Remote Mode (Recommended & Required by HCatalog)

In Remote mode, the Hive Metastore service runs in its own JVM process. HiveServer2, HCatalog, Impala, and other processes communicate with it using the Thrift network API (configured using the hive.metastore.uris property). The metastore service communicates with the metastore database over JDBC (configured using the javax.jdo.option .ConnectionURL property). The database, the HiveServer process and the metastore service can all be on the same host, but running the HiveServer process on a separate host provides better availability and scalability. This also brings better manageability/security because the database tier can be completely firewalled. HCatalog requires this mode.

Accessing Hive Metastore

Through Hiveserver2 we can access the Hive Metastore using ODBC and JDBC connections — this opens the schema to visualization-tools like Power BI or Tableau.