HCatalog in Hadoop

Hadoop Tutorial

2 min readDec 13, 2018

Overview

HCatalog is a metadata and table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop tools like Hive, Pig, MapReduce and Sqoop.

It also enables the users of other Hadoop tools to share data across tools.

Also, it ensures that users need not worry about where or in what format their data is stored.

HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) is required.

Primary Problems Solved by HCatalog

Problems

In Hadoop cluster people uses all kinds of different tools and those tools don’t tend to agree on:

What the schema is?
What data types are?
Where and how the data is stored?
Where in the file system the data lives?
What format the data is in?

HCatalog Solutions

Provides one consistent data model for various Hadoop tools
Provides a shared schema for all these tools
Allows users of the tools to see when each others shared data are available.
Presents data abstraction.

Key Benefits

If all Hadoop tools share one metastore then the users of each tool have immediate access to data created with another tool — no loading or transfer steps are required.

Required Improvements

Expansion of its data model to support structured and semi-structured data — currently based on Hive’s metastore which are being stored in relational database.

HCatalog Architecture

HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL.

WebHCat

WebHCat is the REST API for HCatalog.