Data Catalogs
Catalogs provide persistent metadata layers that let Ilum's execution engines (Spark, Trino, DuckDB, and Flink) share table definitions across sessions, jobs, and engines. Ilum supports four catalogs: Hive Metastore, Project Nessie, Unity Catalog, and DuckLake. Each is integrated with workloads, data storage, and session management, enabling reliable and collaborative data workflows. Tables defined once are queryable from every engine that can connect to the same catalog.
Supported Catalogs
Hive Catalog (Default)
- Hive Catalog leverages the Apache Hive Metastore to store table metadata persistently.
- It is enabled by default in Ilum and used automatically for all Spark SQL operations.
- No additional setup is required. Ilum configures Hive Metastore and integrates it with Spark, Trino, storage, and the Table Explorer UI.
- Ideal for classic data lakehouse scenarios where metadata durability, schema discovery, and compatibility with tools like Trino, Hive, or Superset is critical.
Nessie Catalog (Optional)
- Nessie Catalog integrates Project Nessie, enabling Git-like version control for your data lake.
- It allows you to branch, tag, and merge changes in your data catalog, bringing collaborative and auditable workflows to your tables.
- Nessie is optional in Ilum and requires additional setup. Once configured, it integrates with Spark via Apache Iceberg.
- Best suited for versioned analytics, data experimentation, CI/CD pipelines, and scenarios needing full catalog-level rollback or isolated environments.
Unity Catalog (Optional)
- Unity Catalog OSS is an open-source data catalog for lakehouse architectures, providing a unified metadata layer with built-in governance features.
- It uses a three-level namespace (catalog → schema → table) for better data organization and provides comprehensive audit logging and data lineage.
- Unity Catalog is optional in Ilum and requires configuration. It offers fine-grained access control and centralized governance across workspaces.
- ⚠️ Known Limitation: Unity Catalog OSS currently has compatibility issues with MinIO. For production use, AWS S3, GCS, or ADLS are recommended.
- Best suited for organizations needing centralized governance, detailed audit trails, and fine-grained access control for their data lakehouse.
DuckLake (Default for DuckDB)
- DuckLake is a catalog built to bring concurrent capabilities to DuckDB.
- It is based on the parquet format and supports multiple backends.
- It supports time travel queries, schema evolution, and partitioning.
- It is enabled by default in Ilum and used automatically for all DuckDB operations, but can’t be integrated with anything else than DuckDB.
Feature Comparison
| Feature / Aspect | Hive Catalog (Default) | Nessie Catalog (Optional) | Unity Catalog (Optional) | DuckLake (Default for DuckDB) |
|---|---|---|---|---|
| Persistence | Yes – table metadata stored in Hive Metastore | Yes – versioned metadata via Nessie service | Yes – metadata stored in Unity Catalog metastore | Yes - metadata stored in attached metadata DB |
| Version Control | No | Yes – supports branches, tags, merges | No – rely on table format features | Yes - supports time-travel and schema evolution |
| Multi-table Transactions | No | Yes – atomic changes across multiple tables | No | Yes - has concurrent access over multi-table operations |
| Enabled by Default | Yes | No (optional, must be configured manually) | No (optional, must be configured manually) | Yes (for DuckDB) |
| Format Support | Parquet, ORC, Delta, Iceberg (via external catalogs) | Iceberg | Iceberg, Delta Lake, Parquet | Parquet |
| MinIO Support | ✅ Yes | ✅ Yes | ⚠️ Limited (known compatibility issues) | ✅ Yes |
| Best For | Stable schemas, SQL analytics, traditional data lakehouse | Experimental branches, governance, data promotion workflows | Centralized governance, audit trails, multi-workspace control | Multi-user workloads on DuckDB |
| Compatibility | Widely supported by Spark, Hive, Trino, Superset, etc. | Supported in Spark, Flink, Trino | Supported in Spark, Delta Lake, various cloud platforms | DuckDB only |
| Integration in Ilum | Fully automated, configured out-of-the-box | Requires external Nessie service and Spark config | Requires Unity Catalog service and Spark config | Configured automatically |
Catalog Selection Guide
| Use Case | Hive Catalog | Nessie Catalog | Unity Catalog | DuckLake |
|---|---|---|---|---|
| Long-term table storage | ✅ | ✅ | ✅ | (if using DuckDB) |
| Git-like branching/tagging of data | ✅ | |||
| Multi-table transactional pipelines | ✅ |