Skip to main content

Data Catalogs

Catalogs provide persistent metadata layers that let Ilum's execution engines (Spark, Trino, DuckDB, and Flink) share table definitions across sessions, jobs, and engines. Ilum supports four catalogs: Hive Metastore, Project Nessie, Unity Catalog, and DuckLake. Each is integrated with workloads, data storage, and session management, enabling reliable and collaborative data workflows. Tables defined once are queryable from every engine that can connect to the same catalog.

Supported Catalogs

Hive Catalog (Default)

  • Hive Catalog leverages the Apache Hive Metastore to store table metadata persistently.
  • It is enabled by default in Ilum and used automatically for all Spark SQL operations.
  • No additional setup is required. Ilum configures Hive Metastore and integrates it with Spark, Trino, storage, and the Table Explorer UI.
  • Ideal for classic data lakehouse scenarios where metadata durability, schema discovery, and compatibility with tools like Trino, Hive, or Superset is critical.

Nessie Catalog (Optional)

  • Nessie Catalog integrates Project Nessie, enabling Git-like version control for your data lake.
  • It allows you to branch, tag, and merge changes in your data catalog, bringing collaborative and auditable workflows to your tables.
  • Nessie is optional in Ilum and requires additional setup. Once configured, it integrates with Spark via Apache Iceberg.
  • Best suited for versioned analytics, data experimentation, CI/CD pipelines, and scenarios needing full catalog-level rollback or isolated environments.

Unity Catalog (Optional)

  • Unity Catalog OSS is an open-source data catalog for lakehouse architectures, providing a unified metadata layer with built-in governance features.
  • It uses a three-level namespace (catalog → schema → table) for better data organization and provides comprehensive audit logging and data lineage.
  • Unity Catalog is optional in Ilum and requires configuration. It offers fine-grained access control and centralized governance across workspaces.
  • ⚠️ Known Limitation: Unity Catalog OSS currently has compatibility issues with MinIO. For production use, AWS S3, GCS, or ADLS are recommended.
  • Best suited for organizations needing centralized governance, detailed audit trails, and fine-grained access control for their data lakehouse.

DuckLake (Default for DuckDB)

  • DuckLake is a catalog built to bring concurrent capabilities to DuckDB.
  • It is based on the parquet format and supports multiple backends.
  • It supports time travel queries, schema evolution, and partitioning.
  • It is enabled by default in Ilum and used automatically for all DuckDB operations, but can’t be integrated with anything else than DuckDB.

Feature Comparison

Feature / AspectHive Catalog (Default)Nessie Catalog (Optional)Unity Catalog (Optional)DuckLake (Default for DuckDB)
PersistenceYes – table metadata stored in Hive MetastoreYes – versioned metadata via Nessie serviceYes – metadata stored in Unity Catalog metastoreYes - metadata stored in attached metadata DB
Version ControlNoYes – supports branches, tags, mergesNo – rely on table format featuresYes - supports time-travel and schema evolution
Multi-table TransactionsNoYes – atomic changes across multiple tablesNoYes - has concurrent access over multi-table operations
Enabled by DefaultYesNo (optional, must be configured manually)No (optional, must be configured manually)Yes (for DuckDB)
Format SupportParquet, ORC, Delta, Iceberg (via external catalogs)IcebergIceberg, Delta Lake, ParquetParquet
MinIO Support✅ Yes✅ Yes⚠️ Limited (known compatibility issues)✅ Yes
Best ForStable schemas, SQL analytics, traditional data lakehouseExperimental branches, governance, data promotion workflowsCentralized governance, audit trails, multi-workspace controlMulti-user workloads on DuckDB
CompatibilityWidely supported by Spark, Hive, Trino, Superset, etc.Supported in Spark, Flink, TrinoSupported in Spark, Delta Lake, various cloud platformsDuckDB only
Integration in IlumFully automated, configured out-of-the-boxRequires external Nessie service and Spark configRequires Unity Catalog service and Spark configConfigured automatically

Catalog Selection Guide

Use CaseHive CatalogNessie CatalogUnity CatalogDuckLake
Long-term table storage(if using DuckDB)
Git-like branching/tagging of data
Multi-table transactional pipelines
Traditional SQL analytics + BI
Development → Staging → Prod workflows
Auditability and rollback
Easy compatibility with Trino/Superset(indirect)(indirect)

How Catalogs Work in Ilum

  • Hive Catalog is integrated directly in Ilum Core. It comes with a pre-deployed Hive Metastore and is connected to the default Ilum storage (e.g., MinIO or S3).
  • Nessie Catalog is user-configured. You must deploy a Nessie service and configure your Spark sessions to connect to it using Iceberg catalog properties.
  • Unity Catalog is user-configured. You must enable the Unity Catalog service via Helm and configure your metastore type. Note the MinIO compatibility limitation for production deployments.
  • DuckLake is automatically set up in DuckDB workloads.
  • Table Explorer in Ilum automatically lists Hive tables. Nessie and Unity Catalog tables can also be made visible with proper catalog configuration in Spark jobs.

For details on deploying and enabling optional components like Nessie, visit:
👉 Ilum Deployment Guide


Next Steps