Skip to main content

Data Catalogs

Catalogs provide persistent metadata layers that allow Spark to store, retrieve, and organize table definitions across sessions and jobs. Ilum supports multiple catalog options—Hive, Nessie, and Unity Catalog—each integrated with Spark workloads, data storage, and session management, enabling reliable and collaborative data workflows.

Supported Catalogs

Hive Catalog (Default)

  • Hive Catalog leverages the Apache Hive Metastore to store table metadata persistently.
  • It is enabled by default in Ilum and used automatically for all Spark SQL operations.
  • No additional setup is required—Ilum configures Hive Metastore and integrates it with Spark, storage, and the Table Explorer UI.
  • Ideal for classic data lakehouse scenarios where metadata durability, schema discovery, and compatibility with tools like Trino, Hive, or Superset is critical.

Nessie Catalog (Optional)

  • Nessie Catalog integrates Project Nessie, enabling Git-like version control for your data lake.
  • It allows you to branch, tag, and merge changes in your data catalog, bringing collaborative and auditable workflows to your tables.
  • Nessie is optional in Ilum and requires additional setup. Once configured, it integrates with Spark via Apache Iceberg.
  • Best suited for versioned analytics, data experimentation, CI/CD pipelines, and scenarios needing full catalog-level rollback or isolated environments.

Unity Catalog (Optional)

  • Unity Catalog OSS is an open-source data catalog for lakehouse architectures, providing a unified metadata layer with built-in governance features.
  • It uses a three-level namespace (catalog → schema → table) for better data organization and provides comprehensive audit logging and data lineage.
  • Unity Catalog is optional in Ilum and requires configuration. It offers fine-grained access control and centralized governance across workspaces.
  • ⚠️ Known Limitation: Unity Catalog OSS currently has compatibility issues with MinIO. For production use, AWS S3, GCS, or ADLS are recommended.
  • Best suited for organizations needing centralized governance, detailed audit trails, and fine-grained access control for their data lakehouse.

Feature Comparison

Feature / AspectHive Catalog (Default)Nessie Catalog (Optional)Unity Catalog (Optional)
PersistenceYes – table metadata stored in Hive MetastoreYes – versioned metadata via Nessie serviceYes – metadata stored in Unity Catalog metastore
Version ControlNoYes – supports branches, tags, mergesNo – rely on table format features
Multi-table TransactionsNoYes – atomic changes across multiple tablesNo
Enabled by DefaultYesNo (optional, must be configured manually)No (optional, must be configured manually)
Format SupportParquet, ORC, Delta, Iceberg (via external catalogs)IcebergIceberg, Delta Lake, Parquet
MinIO Support✅ Yes✅ Yes⚠️ Limited (known compatibility issues)
Best ForStable schemas, SQL analytics, traditional data lakehouseExperimental branches, governance, data promotion workflowsCentralized governance, audit trails, multi-workspace control
CompatibilityWidely supported by Spark, Hive, Trino, Superset, etc.Supported in Spark, Flink, TrinoSupported in Spark, Delta Lake, various cloud platforms
Integration in IlumFully automated, configured out-of-the-boxRequires external Nessie service and Spark configRequires Unity Catalog service and Spark config

Catalog Selection Guide

Use CaseHive CatalogNessie CatalogUnity Catalog
Long-term table storage
Git-like branching/tagging of data
Multi-table transactional pipelines
Traditional SQL analytics + BI
Development → Staging → Prod workflows
Auditability and rollback
Easy compatibility with Trino/Superset(indirect)(indirect)

How Catalogs Work in Ilum

  • Hive Catalog is integrated directly in Ilum Core. It comes with a pre-deployed Hive Metastore and is connected to the default Ilum storage (e.g., MinIO or S3).
  • Nessie Catalog is user-configured. You must deploy a Nessie service and configure your Spark sessions to connect to it using Iceberg catalog properties.
  • Unity Catalog is user-configured. You must enable the Unity Catalog service via Helm and configure your metastore type. Note the MinIO compatibility limitation for production deployments.
  • Table Explorer in Ilum automatically lists Hive tables. Nessie and Unity Catalog tables can also be made visible with proper catalog configuration in Spark jobs.

For details on deploying and enabling optional components like Nessie, visit:
👉 Ilum Deployment Guide


Next Steps