Data Catalogs

Catalogs provide persistent metadata layers that let Ilum's execution engines (Spark, Trino, DuckDB, and Flink) share table definitions across sessions, jobs, and engines. Ilum supports four catalogs: Hive Metastore, Project Nessie, Unity Catalog, and DuckLake. Each is integrated with workloads, data storage, and session management, enabling reliable and collaborative data workflows. Tables defined once are queryable from every engine that can connect to the same catalog.

Supported Catalogs

Hive Catalog (Default)

Hive Catalog leverages the Apache Hive Metastore to store table metadata persistently.
It is enabled by default in Ilum and used automatically for all Spark SQL operations.
No additional setup is required. Ilum configures Hive Metastore and integrates it with Spark, Trino, storage, and the Table Explorer UI.
Ideal for classic data lakehouse scenarios where metadata durability, schema discovery, and compatibility with tools like Trino, Hive, or Superset is critical.

Nessie Catalog (Optional)

Nessie Catalog integrates Project Nessie, enabling Git-like version control for your data lake.
It allows you to branch, tag, and merge changes in your data catalog, bringing collaborative and auditable workflows to your tables.
Nessie is optional in Ilum and requires additional setup. Once configured, it integrates with Spark via Apache Iceberg.
Best suited for versioned analytics, data experimentation, CI/CD pipelines, and scenarios needing full catalog-level rollback or isolated environments.

Unity Catalog (Optional)

Unity Catalog OSS is an open-source data catalog for lakehouse architectures, providing a unified metadata layer with built-in governance features.
It uses a three-level namespace (catalog → schema → table) for better data organization and provides comprehensive audit logging and data lineage.
Unity Catalog is optional in Ilum and requires configuration. It offers fine-grained access control and centralized governance across workspaces.
⚠️ Known Limitation: Unity Catalog OSS currently has compatibility issues with MinIO. For production use, AWS S3, GCS, or ADLS are recommended.
Best suited for organizations needing centralized governance, detailed audit trails, and fine-grained access control for their data lakehouse.

DuckLake (Default for DuckDB)

DuckLake is a catalog built to bring concurrent capabilities to DuckDB.
It is based on the parquet format and supports multiple backends.
It supports time travel queries, schema evolution, and partitioning.
It is enabled by default in Ilum and used automatically for all DuckDB operations, but can’t be integrated with anything else than DuckDB.

Feature Comparison

Feature / Aspect	Hive Catalog (Default)	Nessie Catalog (Optional)	Unity Catalog (Optional)	DuckLake (Default for DuckDB)
Persistence	Yes – table metadata stored in Hive Metastore	Yes – versioned metadata via Nessie service	Yes – metadata stored in Unity Catalog metastore	Yes - metadata stored in attached metadata DB
Version Control	No	Yes – supports branches, tags, merges	No – rely on table format features	Yes - supports time-travel and schema evolution
Multi-table Transactions	No	Yes – atomic changes across multiple tables	No	Yes - has concurrent access over multi-table operations
Enabled by Default	Yes	No (optional, must be configured manually)	No (optional, must be configured manually)	Yes (for DuckDB)
Format Support	Parquet, ORC, Delta, Iceberg (via external catalogs)	Iceberg	Iceberg, Delta Lake, Parquet	Parquet
MinIO Support	✅ Yes	✅ Yes	⚠️ Limited (known compatibility issues)	✅ Yes
Best For	Stable schemas, SQL analytics, traditional data lakehouse	Experimental branches, governance, data promotion workflows	Centralized governance, audit trails, multi-workspace control	Multi-user workloads on DuckDB
Compatibility	Widely supported by Spark, Hive, Trino, Superset, etc.	Supported in Spark, Flink, Trino	Supported in Spark, Delta Lake, various cloud platforms	DuckDB only
Integration in Ilum	Fully automated, configured out-of-the-box	Requires external Nessie service and Spark config	Requires Unity Catalog service and Spark config	Configured automatically

Catalog Selection Guide

Use Case	Hive Catalog	Nessie Catalog	Unity Catalog	DuckLake
Long-term table storage	✅	✅	✅	(if using DuckDB)
Git-like branching/tagging of data		✅
Multi-table transactional pipelines		✅		✅
Traditional SQL analytics + BI	✅		✅	✅
Development → Staging → Prod workflows		✅	✅	✅
Auditability and rollback		✅	✅	✅
Easy compatibility with Trino/Superset	✅	(indirect)	(indirect)

How Catalogs Work in Ilum

Hive Catalog is integrated directly in Ilum Core. It comes with a pre-deployed Hive Metastore and is connected to the default Ilum storage (e.g., MinIO or S3).
Nessie Catalog is user-configured. You must deploy a Nessie service and configure your Spark sessions to connect to it using Iceberg catalog properties.
Unity Catalog is user-configured. You must enable the Unity Catalog service via Helm and configure your metastore type. Note the MinIO compatibility limitation for production deployments.
DuckLake is automatically set up in DuckDB workloads.
Table Explorer in Ilum automatically lists Hive tables. Nessie and Unity Catalog tables can also be made visible with proper catalog configuration in Spark jobs.

For details on deploying and enabling optional components like Nessie, visit:
👉 Ilum Deployment Guide

Supported Catalogs​

Hive Catalog (Default)​

Nessie Catalog (Optional)​

Unity Catalog (Optional)​

DuckLake (Default for DuckDB)​

Feature Comparison​

Catalog Selection Guide​

How Catalogs Work in Ilum​

Next Steps​