HCatalog in Hadoop
Hadoop Tutorial
Overview
HCatalog is a metadata and table storage management tool for Hadoop that exposes the tabular data of Hive metastore to other Hadoop tools like Hive, Pig, MapReduce and Sqoop.
It also enables the users of other Hadoop tools to share data across tools.
Also, it ensures that users need not worry about where or in what format their data is stored.
HCatalog supports reading and writing files in any format for which a SerDe (serializer-deserializer) is required.
Primary Problems Solved by HCatalog
Problems
In Hadoop cluster people uses all kinds of different tools and those tools don’t tend to agree on:
- What the schema is?
- What data types are?
- Where and how the data is stored?
- Where in the file system the data lives?
- What format the data is in?
HCatalog Solutions
- Provides one consistent data model for various Hadoop tools
- Provides a shared schema for all these tools
- Allows users of the tools to see when each others shared data are available.
- Presents data abstraction.
Key Benefits
If all Hadoop tools share one metastore then the users of each tool have immediate access to data created with another tool — no loading or transfer steps are required.
Required Improvements
Expansion of its data model to support structured and semi-structured data — currently based on Hive’s metastore which are being stored in relational database.
HCatalog Architecture
HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL.
WebHCat
WebHCat is the REST API for HCatalog.