Hive Catalog
Overview
The Hive Catalog is a widely used metadata catalog for Spark, Hadoop, and big data environments. At its core, it stores table schemas, locations, and other metadata in a central database called the Hive Metastore. This makes it possible for Spark and other compute engines to consistently find and access tables across multiple jobs and sessions.
In simpler terms, Hive Catalog is like a registry or "table of contents" for your data lake. It keeps track of which tables exist, their schemas, partitions, and where their data physically resides (for example, on HDFS, S3, or MinIO).
Ilum deeply integrates Hive Catalog, making it the default catalog for all SQL queries, jobs, and groups unless another is specified.
Unlike Git-like catalogs (e.g., Nessie), Hive only tracks the latest state of each table; it does not support branching, commit history, or time travel across the entire catalog. However, it is reliable, mature, and universally compatible with a huge ecosystem.

Hive vs. Other Data Catalogs
Here’s how Hive Catalog compares with modern alternatives like Nessie or AWS Glue:
-
No Version Control: Hive keeps only the most recent version of each table. It does not support branching, tagging, or commit history at the catalog level. To track historical states, you must rely on table-format-specific features (like Iceberg’s or Delta’s time travel), not Hive itself.
-
Centralized Metadata: Table schemas, locations, and partitioning are stored in the Hive Metastore database. This ensures consistent metadata across all Spark jobs and engines using the catalog.
-
Universal Compatibility: Hive Metastore is supported by nearly all big data engines (Spark, Hive, Trino, Flink, etc.), making it a safe default for mixed-technology environments.
-
No Multi-table Transactions: Catalog-level atomic transactions (covering multiple tables at once) are not supported. Each DDL/DML operation is handled separately.
-
No Branch Isolation: To isolate dev/staging/prod environments, you must maintain multiple catalogs or databases, or physically copy data. There is no "branching" mechanism built in.
Core Concepts in Hive Catalog
Hive Metastore
The Hive Metastore is a service and a backing database (often PostgreSQL or MySQL) where all metadata about tables, views, and partitions is stored.
Whenever Spark or another engine queries a table, it looks up the details in the Hive Metastore.
Tables, Databases, and Storage
- Tables define the schema and storage location of your datasets.
- Databases in Hive are namespaces for grouping related tables.
- Warehouse Location is the root folder (on HDFS, S3, or other storage) where table data files reside.
Using Hive Catalog in Ilum
Ilum automatically configures the Hive Catalog as the default for Spark jobs, SQL Viewer queries, and pipeline groups.
You can run standard SQL commands such as:
CREATE DATABASE IF NOT EXISTS mydb;
CREATE TABLE IF NOT EXISTS mydb.sales (date STRING, amount INT);
INSERT INTO mydb.sales VALUES ('2025-06-01', 1000);
SELECT * FROM mydb.sales;
Spark Configuration for Hive
If you run Spark manually, set these parameters to enable Hive support:
spark.sql.catalogImplementation=hive
spark.hadoop.hive.metastore.uris=thrift://ilum-hive-metastore:9083
However, Ilum handles this for you in all standard workflows. No manual configuration is required.
Setting up the Hive Metastore in Ilum
Normally, using Hive Catalog requires:
- Installing the Hive Metastore service.
- Configuring a backing database (like PostgreSQL or MySQL) for metadata.
- Connecting the service to your object storage (HDFS, S3, MinIO, GCS, WASBS).
- Setting up security, network, and storage options.
Ilum automates all of these steps!
When you deploy Ilum via Helm, it provisions the Hive Metastore, database, and object storage integration for you.
Enabling Hive Metastore
To enable Hive Metastore in Ilum, add these flags to your Helm upgrade/install:
helm upgrade \
--set ilum-hive-metastore.enabled=true \
--set ilum-core.metastore.enabled=true \
--set ilum-core.metastore.type=hive \
--reuse-values ilum ilum/ilum
After running the helm upgrade command, navigate to the Edit Cluster tab for the cluster where you want to use the catalog and select it in the General metastore dropdown:

Using Custom PostgreSQL Credentials
If you want to use a custom PostgreSQL database for the Hive Metastore:
helm upgrade \
--set postgresql.auth.username=customuser \
--set postgresql.auth.password="CHOOSE PASSWORD" \
--reuse-values ilum ilum/ilum
Configure Hive Metastore to use those credentials:
helm upgrade \
--set ilum-hive-metastore.postgresql.auth.password="CHOOSE PASSWORD" \
--set ilum-hive-metastore.postgresql.auth.username=customuser \
--reuse-values ilum ilum/ilum
Setting up Hive Metastore: Storage
Storage, also referred to as a Warehouse, is the location where the actual data is stored. Hive supports various storage backends, including:
- HDFS (Hadoop Distributed File System)
- Amazon S3 Buckets and MinIO
- Google Cloud Storage (GCS)
- Windows Azure Storage Blob (WASBS)
Typically, you would need to set up one of these storage options and configure Hive's metastore connection accordingly within an XML file.
However, with Ilum, the S3 MinIO storage is pre-configured for you, and the Hive Metastore is already set up to use it by default. Configuring Other Storage Backends
If you prefer to use an alternative storage backend, you can configure Hive to work with it by reconfiguring your helm values:
For S3 storage or MinIO:
helm upgrade
--set ilum-hive-metastore.storage.type="s3" \
--set ilum-hive-metastore.storage.metastore.warehouse="s3a://yourbucket/yourfolder" \
--set ilum-hive-metastore.storage.s3.accessKey="your_access_key" \
--set ilum-hive-metastore.storage.s3.secretKey="your_secret_key" \
--set ilum-hive-metastore.storage.s3.host="yourhost" \
--set ilum-hive-metastore.storage.s3.port=yourport \
--reuse-values ilum ilum/ilum
For GCS:
helm upgrade
--set ilum-hive-metastore.storage.type="gcs" \
--set ilum-hive-metastore.storage.metastore.warehouse="gs://my-gcs-bucket/path/to/folder/" \
--set ilum-hive.metastore.storage.gcs.clientEmail="your@email" \
--set ilum-hive-metastore.storage.gcs.privateKey="yourprivatekey" \
--set ilum-hive-metastore.storage.gcs.privateKeyId="privatekeyid" \
--reuse-values ilum ilum/ilum
For WASBS:
helm upgrade
--set ilum-hive-metastore.storage.type="wasbs" \
--set ilum-hive-metastore.storage.metastore.warehouse="wasbs://[email protected]/path/to/folder/" \
--set ilum-hive-metastore.storage.wasbs.accountName="youraccountname" \
--set ilum-hive-metastore.storage.wasbs.accessKey="youraccesskey" \
--reuse-values ilum ilum/ilum
For HDFS:
Here you will require specify your hdfs configurations in
ilum-hive-metastore.storage.hdfs.config
You can provide them in hdfs-config.yaml:
helm upgrade
--set ilum-hive-metastore.storage.type="hdfs" \
--set ilum-hive-metastore.storage.metastore.warehouse="hdfs://node:port/path/to/folder" \
--set ilum-hive-metastore.storage.hdfs.hadoopUsername="yourusername" \
--reuse-values ilum ilum/ilum \
-f hdfs-config.yaml
Best Practices and Recommendations
- Use Hive for Maximum Compatibility: Hive Metastore is the universal “common denominator” for big data engines.
- For Version Control, Use Iceberg or Nessie: If you need branching, time travel, or commit history, combine Hive Catalog with a table format (like Iceberg) that supports these features, or use Nessie as your catalog.
- Secure Your Metastore: Always use strong credentials and network restrictions for your Hive Metastore database and service.
- Monitor Warehouse Storage: Make sure your warehouse (MinIO, S3, HDFS, etc.) is backed up and monitored for health and available space.
Learn More
For more on using Hive Catalog, see the Hive Metastore documentation.
For detailed Ilum configuration and Helm reference, visit the Ilum Getting Started guide.
Hive Catalog in Ilum combines ease of use, automation, and broad compatibility, giving you a robust foundation for SQL analytics and data engineering at scale.