Data Catalogs
Catalogs provide persistent metadata layers that allow Spark to store, retrieve, and organize table definitions across sessions and jobs. Ilum supports multiple catalog options—Hive, Nessie, and Unity Catalog—each integrated with Spark workloads, data storage, and session management, enabling reliable and collaborative data workflows.
Supported Catalogs
Hive Catalog (Default)
- Hive Catalog leverages the Apache Hive Metastore to store table metadata persistently.
- It is enabled by default in Ilum and used automatically for all Spark SQL operations.
- No additional setup is required—Ilum configures Hive Metastore and integrates it with Spark, storage, and the Table Explorer UI.
- Ideal for classic data lakehouse scenarios where metadata durability, schema discovery, and compatibility with tools like Trino, Hive, or Superset is critical.
Nessie Catalog (Optional)
- Nessie Catalog integrates Project Nessie, enabling Git-like version control for your data lake.
- It allows you to branch, tag, and merge changes in your data catalog, bringing collaborative and auditable workflows to your tables.
- Nessie is optional in Ilum and requires additional setup. Once configured, it integrates with Spark via Apache Iceberg.
- Best suited for versioned analytics, data experimentation, CI/CD pipelines, and scenarios needing full catalog-level rollback or isolated environments.
Unity Catalog (Optional)
- Unity Catalog OSS is an open-source data catalog for lakehouse architectures, providing a unified metadata layer with built-in governance features.
- It uses a three-level namespace (catalog → schema → table) for better data organization and provides comprehensive audit logging and data lineage.
- Unity Catalog is optional in Ilum and requires configuration. It offers fine-grained access control and centralized governance across workspaces.
- ⚠️ Known Limitation: Unity Catalog OSS currently has compatibility issues with MinIO. For production use, AWS S3, GCS, or ADLS are recommended.
- Best suited for organizations needing centralized governance, detailed audit trails, and fine-grained access control for their data lakehouse.
Feature Comparison
| Feature / Aspect | Hive Catalog (Default) | Nessie Catalog (Optional) | Unity Catalog (Optional) |
|---|---|---|---|
| Persistence | Yes – table metadata stored in Hive Metastore | Yes – versioned metadata via Nessie service | Yes – metadata stored in Unity Catalog metastore |
| Version Control | No | Yes – supports branches, tags, merges | No – rely on table format features |
| Multi-table Transactions | No | Yes – atomic changes across multiple tables | No |
| Enabled by Default | Yes | No (optional, must be configured manually) | No (optional, must be configured manually) |
| Format Support | Parquet, ORC, Delta, Iceberg (via external catalogs) | Iceberg | Iceberg, Delta Lake, Parquet |
| MinIO Support | ✅ Yes | ✅ Yes | ⚠️ Limited (known compatibility issues) |
| Best For | Stable schemas, SQL analytics, traditional data lakehouse | Experimental branches, governance, data promotion workflows | Centralized governance, audit trails, multi-workspace control |
| Compatibility | Widely supported by Spark, Hive, Trino, Superset, etc. | Supported in Spark, Flink, Trino | Supported in Spark, Delta Lake, various cloud platforms |
| Integration in Ilum | Fully automated, configured out-of-the-box | Requires external Nessie service and Spark config | Requires Unity Catalog service and Spark config |
Catalog Selection Guide
| Use Case | Hive Catalog | Nessie Catalog | Unity Catalog |
|---|---|---|---|
| Long-term table storage | ✅ | ✅ | ✅ |
| Git-like branching/tagging of data | ✅ | ||
| Multi-table transactional pipelines | ✅ | ||
| Traditional SQL analytics + BI | ✅ | ✅ | |
| Development → Staging → Prod workflows | ✅ | ✅ | |
| Auditability and rollback | ✅ | ✅ | |
| Easy compatibility with Trino/Superset | ✅ | (indirect) | (indirect) |
How Catalogs Work in Ilum
- Hive Catalog is integrated directly in Ilum Core. It comes with a pre-deployed Hive Metastore and is connected to the default Ilum storage (e.g., MinIO or S3).
- Nessie Catalog is user-configured. You must deploy a Nessie service and configure your Spark sessions to connect to it using Iceberg catalog properties.
- Unity Catalog is user-configured. You must enable the Unity Catalog service via Helm and configure your metastore type. Note the MinIO compatibility limitation for production deployments.
- Table Explorer in Ilum automatically lists Hive tables. Nessie and Unity Catalog tables can also be made visible with proper catalog configuration in Spark jobs.
For details on deploying and enabling optional components like Nessie, visit:
👉 Ilum Deployment Guide