Skip to main content

Hive Catalog

Overview

The Hive Catalog is a widely used metadata catalog for Spark, Hadoop, and big data environments. At its core, it stores table schemas, locations, and other metadata in a central database called the Hive Metastore. This makes it possible for Spark and other compute engines to consistently find and access tables across multiple jobs and sessions.

In simpler terms, Hive Catalog is like a registry or "table of contents" for your data lake. It keeps track of which tables exist, their schemas, partitions, and where their data physically resides (for example, on HDFS, S3, or MinIO).
Ilum deeply integrates Hive Catalog, making it the default catalog for all SQL queries, jobs, and groups unless another is specified.

Unlike Git-like catalogs (e.g., Nessie), Hive only tracks the latest state of each table; it does not support branching, commit history, or time travel across the entire catalog. However, it is reliable, mature, and universally compatible with a huge ecosystem.

Hive Catalog Interface

Hive vs. Other Data Catalogs

Here’s how Hive Catalog compares with modern alternatives like Nessie or AWS Glue:

  • No Version Control: Hive keeps only the most recent version of each table. It does not support branching, tagging, or commit history at the catalog level. To track historical states, you must rely on table-format-specific features (like Iceberg’s or Delta’s time travel), not Hive itself.

  • Centralized Metadata: Table schemas, locations, and partitioning are stored in the Hive Metastore database. This ensures consistent metadata across all Spark jobs and engines using the catalog.

  • Universal Compatibility: Hive Metastore is supported by nearly all big data engines (Spark, Hive, Trino, Flink, etc.), making it a safe default for mixed-technology environments.

  • No Multi-table Transactions: Catalog-level atomic transactions (covering multiple tables at once) are not supported. Each DDL/DML operation is handled separately.

  • No Branch Isolation: To isolate dev/staging/prod environments, you must maintain multiple catalogs or databases, or physically copy data. There is no "branching" mechanism built in.

Core Concepts in Hive Catalog

Hive Metastore

The Hive Metastore is a service and a backing database (often PostgreSQL or MySQL) where all metadata about tables, views, and partitions is stored.
Whenever Spark or another engine queries a table, it looks up the details in the Hive Metastore.

Tables, Databases, and Storage

  • Tables define the schema and storage location of your datasets.
  • Databases in Hive are namespaces for grouping related tables.
  • Warehouse Location is the root folder (on HDFS, S3, or other storage) where table data files reside.

Using Hive Catalog in Ilum

Ilum automatically configures the Hive Catalog as the default for Spark jobs, SQL Viewer queries, and pipeline groups.
You can run standard SQL commands such as:

CREATE DATABASE IF NOT EXISTS mydb;
CREATE TABLE IF NOT EXISTS mydb.sales (date STRING, amount INT);
INSERT INTO mydb.sales VALUES ('2025-06-01', 1000);
SELECT * FROM mydb.sales;

Spark Configuration for Hive

If you run Spark manually, set these parameters to enable Hive support:

spark.sql.catalogImplementation=hive
spark.hadoop.hive.metastore.uris=thrift://ilum-hive-metastore:9083

However, Ilum handles this for you in all standard workflows. No manual configuration is required.

Setting up the Hive Metastore in Ilum

Normally, using Hive Catalog requires:

  • Installing the Hive Metastore service.
  • Configuring a backing database (like PostgreSQL or MySQL) for metadata.
  • Connecting the service to your object storage (HDFS, S3, MinIO, GCS, WASBS).
  • Setting up security, network, and storage options.

Ilum automates all of these steps!
When you deploy Ilum via Helm, it provisions the Hive Metastore, database, and object storage integration for you.

Enabling Hive Metastore

To enable Hive Metastore in Ilum, add these flags to your Helm upgrade/install:

helm upgrade  \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.metastore.enabled=true \
--set ilum-core.metastore.type=hive \
--reuse-values ilum ilum/ilum

After running the helm upgrade command, navigate to the Edit Cluster tab for the cluster where you want to use the catalog and select it in the General metastore dropdown:

Catalog Selection

Using Custom PostgreSQL Credentials

If you want to use a custom PostgreSQL database for the Hive Metastore:

helm upgrade \
--set postgresql.auth.username=customuser \
--set postgresql.auth.password="CHOOSE PASSWORD" \
--reuse-values ilum ilum/ilum

Configure Hive Metastore to use those credentials:

helm upgrade \
--set ilum-hive-metastore.postgresql.auth.password="CHOOSE PASSWORD" \
--set ilum-hive-metastore.postgresql.auth.username=customuser \
--reuse-values ilum ilum/ilum

Setting up Hive Metastore: Storage

Storage, also referred to as a Warehouse, is the location where the actual data is stored. Hive supports various storage backends, including:

  • HDFS (Hadoop Distributed File System)
  • Amazon S3 Buckets and MinIO
  • Google Cloud Storage (GCS)
  • Windows Azure Storage Blob (WASBS)

Typically, you would need to set up one of these storage options and configure Hive's metastore connection accordingly within an XML file.

However, with Ilum, the S3 MinIO storage is pre-configured for you, and the Hive Metastore is already set up to use it by default. Configuring Other Storage Backends

If you prefer to use an alternative storage backend, you can configure Hive to work with it by reconfiguring your helm values:

For S3 storage or MinIO:

helm upgrade 
--set ilum-hive-metastore.storage.type="s3" \
--set ilum-hive-metastore.storage.metastore.warehouse="s3a://yourbucket/yourfolder" \
--set ilum-hive-metastore.storage.s3.accessKey="your_access_key" \
--set ilum-hive-metastore.storage.s3.secretKey="your_secret_key" \
--set ilum-hive-metastore.storage.s3.host="yourhost" \
--set ilum-hive-metastore.storage.s3.port=yourport \
--reuse-values ilum ilum/ilum

For GCS:

helm upgrade
--set ilum-hive-metastore.storage.type="gcs" \
--set ilum-hive-metastore.storage.metastore.warehouse="gs://my-gcs-bucket/path/to/folder/" \
--set ilum-hive.metastore.storage.gcs.clientEmail="your@email" \
--set ilum-hive-metastore.storage.gcs.privateKey="yourprivatekey" \
--set ilum-hive-metastore.storage.gcs.privateKeyId="privatekeyid" \
--reuse-values ilum ilum/ilum

For WASBS:

helm upgrade 
--set ilum-hive-metastore.storage.type="wasbs" \
--set ilum-hive-metastore.storage.metastore.warehouse="wasbs://[email protected]/path/to/folder/" \
--set ilum-hive-metastore.storage.wasbs.accountName="youraccountname" \
--set ilum-hive-metastore.storage.wasbs.accessKey="youraccesskey" \
--reuse-values ilum ilum/ilum

For HDFS:

Here you will require specify your hdfs configurations in

ilum-hive-metastore.storage.hdfs.config

You can provide them in hdfs-config.yaml:

helm upgrade 
--set ilum-hive-metastore.storage.type="hdfs" \
--set ilum-hive-metastore.storage.metastore.warehouse="hdfs://node:port/path/to/folder" \
--set ilum-hive-metastore.storage.hdfs.hadoopUsername="yourusername" \
--reuse-values ilum ilum/ilum \
-f hdfs-config.yaml

Best Practices and Recommendations

  • Use Hive for Maximum Compatibility: Hive Metastore is the universal “common denominator” for big data engines.
  • For Version Control, Use Iceberg or Nessie: If you need branching, time travel, or commit history, combine Hive Catalog with a table format (like Iceberg) that supports these features, or use Nessie as your catalog.
  • Secure Your Metastore: Always use strong credentials and network restrictions for your Hive Metastore database and service.
  • Monitor Warehouse Storage: Make sure your warehouse (MinIO, S3, HDFS, etc.) is backed up and monitored for health and available space.

Learn More

For more on using Hive Catalog, see the Hive Metastore documentation.

For detailed Ilum configuration and Helm reference, visit the Ilum Getting Started guide.

Hive Catalog in Ilum combines ease of use, automation, and broad compatibility, giving you a robust foundation for SQL analytics and data engineering at scale.