What is Ilum?
Modular Data Lakehouse Platform with Multi-Engine Execution on Kubernetes
Ilum is an open, modular data lakehouse platform designed for Kubernetes and Apache Hadoop Yarn environments. It unifies multi-engine SQL execution (Apache Spark, Trino, DuckDB, Apache Flink), open table formats (Delta Lake, Apache Iceberg, Apache Hudi), and open catalogs (Hive Metastore, Project Nessie, Unity Catalog, DuckLake) behind a single control plane, with column-level lineage, multi-cluster orchestration, and an extensible module system that lets each deployment include only the components it needs.
Key Capabilities
- Multi-engine SQL execution with Apache Spark, Trino, DuckDB, and Apache Flink, unified behind the Apache Kyuubi SQL gateway
- Automatic engine routing that selects the right engine for each workload: Spark for large transformations, Trino for interactive analytics, DuckDB for small-data and local execution, Flink for streaming
- Open lakehouse architecture with first-class support for Delta Lake, Apache Iceberg, and Apache Hudi, accessible through Hive Metastore, Project Nessie, Unity Catalog, and DuckLake
- Column-level data lineage powered by OpenLineage and Marquez, with an interactive graph view that toggles between lineage and ERD perspectives
- Multi-cluster control plane for Kubernetes (GKE, EKS, AKS, on-premise) and Yarn clusters, with namespace-scoped resource quotas and a single point of policy
- Modular extensibility: optional components (notebooks, BI, orchestration, ML, observability) install and upgrade at runtime via the dedicated
ilum-apimodule-management microservice - Cloud-native and open: Kubernetes-first, Helm-managed, OpenAPI-defined, no proprietary lock-in
Get started with Ilum → | View architecture documentation →
Ilum - A Modular Data Lakehouse on Kubernetes
Ilum is built around the principle that a modern data platform should be composed, not bought. Each layer of the lakehouse (execution engine, table format, catalog, orchestration, notebooks, observability) is a swappable module rather than a vertically integrated black box. This composition is what makes Ilum a credible alternative to proprietary platforms while remaining production-ready out of the box.
Architecture
Ilum is composed of clearly separated layers, each independently scalable and replaceable:
- Web UI (
ilum-ui): React-based control plane for clusters, jobs, SQL execution, table exploration, lineage, security, and module management - Platform service (
ilum-core): The main backend, providing the public REST API, job orchestration, SQL execution, security, lineage, and gRPC/Kafka communication with running jobs - Module-management microservice (
ilum-api): A dedicated service that installs, upgrades, and disables optional Ilum modules at runtime via Helm. Future releases will extendilum-apiwith Model Context Protocol (MCP) capabilities and open APIs for third-party integration. - Engine layer: Spark, Trino, DuckDB, and Flink, fronted by the Kyuubi SQL gateway
- Catalog layer: Hive Metastore, Nessie, Unity Catalog, DuckLake (selectable per workload)
- Data layer: PostgreSQL (primary metadata store) and MongoDB (legacy, still supported), with object storage on MinIO, S3, GCS, Azure Blob, or HDFS
The platform supports both Python (PySpark) and Scala programming languages for batch and interactive workloads, plus first-class SQL across every supported engine.
Execution Engines
Ilum exposes execution as a multi-engine surface rather than tying the platform to a single processing framework. Each engine has a clear sweet spot:
- Apache Spark: large-scale ETL, machine learning pipelines, and any workload that benefits from distributed processing across many executors
- Trino: interactive analytics across federated data sources, with fast response times on medium-to-large datasets
- DuckDB: single-node analytics on small-to-medium data, ideal for ad-hoc exploration and DuckLake-managed tables
- Apache Flink: low-latency stream processing
The automatic engine router selects the appropriate engine based on data size, workload type, and locality. Manual override remains available for every query and job.
Open Lakehouse
Ilum supports the three major open table formats with full ACID guarantees:
- Delta Lake: ACID transactions, time travel, schema evolution
- Apache Iceberg: partition evolution, hidden partitioning, large-scale analytics
- Apache Hudi: record-level upserts, incremental processing
Tables are addressable through any of four catalog backends:
- Hive Metastore: traditional, broadly compatible
- Project Nessie: Git-style branching and version control for Iceberg tables
- Unity Catalog: Databricks-compatible governance and access control
- DuckLake: DuckDB-native catalog enabled by default for local execution
The same table definitions are usable from every engine, making it trivial to shift a workload from Spark to Trino to DuckDB without rewriting queries or copying data.
REST API and Programmatic Access
Ilum exposes its full surface area through an OpenAPI 3.0 specification (currently 1.5.x):
# Submit a Spark job
POST /api/v1/job
# Open or interact with an interactive group
POST /api/v1/group
# Execute a SQL query against any registered engine
POST /api/v1/sql/execute
# Schedule a recurring job
POST /api/v1/schedule
Every UI action is backed by a documented REST endpoint, which makes Ilum straightforward to drive from CI pipelines, custom orchestrators, or downstream services. Use cases include:
- Triggering Spark transformations from API gateways
- Running on-demand Trino or DuckDB queries from BI tools
- Submitting and monitoring streaming Flink jobs from external workflows
- Driving Jupyter notebook kernels through HTTP
Multi-Cluster Orchestration
Ilum manages heterogeneous clusters from a single control plane:
- Cloud Kubernetes: GKE, EKS, AKS with auto-scaling node pools
- On-premise Kubernetes: bare metal, OpenShift, Rancher
- Hadoop Yarn: hybrid architectures alongside Kubernetes deployments
- Local: in-process execution for development and testing
Each cluster maintains independent resource quotas, storage backends, and security policies, while sharing centralized monitoring, lineage, and scheduling.
Modular by Design
Ilum ships as a small core with a curated set of optional modules. Each module is a Helm sub-chart enabled or disabled per deployment, and managed at runtime by the ilum-api microservice:
- Engines: Spark, DuckDB (default-on), Trino, Flink
- Catalogs: Hive Metastore (default-on), Nessie, Unity Catalog, DuckLake
- Notebooks: Jupyter (default-on), JupyterHub (Enterprise), Zeppelin
- Orchestration: Apache Airflow, Kestra, Mage, n8n, Apache NiFi
- BI and visualization: Apache Superset, Streamlit
- AI and ML: MLflow, LangFuse, AI Data Analyst
- Observability: Prometheus, Grafana, Loki, Marquez (default-on)
- Identity: OAuth2, OIDC, LDAP, Active Directory, Ory Hydra (Ilum as IdP)
- Storage: MinIO (default-on), S3, GCS, Azure Blob, HDFS
Future releases will extend ilum-api with MCP and additional open APIs, opening the module system to third-party extension.
Comparison with Alternative Solutions
| Capability | Ilum | Databricks | Cloudera |
|---|---|---|---|
| Multi-engine SQL (Spark + Trino + DuckDB + Flink) | ✓ | Spark only | Limited |
| Automatic engine routing | ✓ | ✗ | ✗ |
| Open table formats (Delta + Iceberg + Hudi) | ✓ | Delta-first | Iceberg + Hudi |
| Open catalogs (Hive + Nessie + Unity + DuckLake) | ✓ | Unity only | Hive only |
| Column-level lineage (OpenLineage) | ✓ | Proprietary | Limited |
| Kubernetes-native | ✓ | Partial | Partial |
| Yarn integration | ✓ | ✗ | ✓ |
| On-premise deployment | ✓ | Limited | ✓ |
| Multi-cluster control plane | ✓ | Limited | ✓ |
| REST API for interactive sessions | ✓ | ✓ | Limited |
| Vendor lock-in | None | High | High |
| Built on open standards | ✓ | Partial | Partial |
Video Overview
Prefer a guided path? Build your first data product on Ilum in hours. Official course →.
Features
Multi-Engine SQL and Execution
- Kyuubi SQL Gateway: Unified JDBC/ODBC entry point for Spark, Trino, and Flink
- Engine selector and lifecycle: Start, stop, and restart engines from the UI; live engine status indicators
- Dialect transpilation: Translate queries between Spark SQL, Trino SQL, DuckDB SQL, and Flink SQL via the built-in transpiler
- In-app SQL notebooks: Persistent multi-cell notebooks with per-cell execution, profiling, and visualization
- Saved queries: Folder-organized query library with bulk operations and move support
Open Lakehouse and Catalogs
- Hive Metastore (default): Centralized table metadata compatible with every engine
- Project Nessie: Git-style branches and tags for Iceberg tables, enabling reproducible analytics
- Unity Catalog: Databricks-compatible governance, access control, and lineage
- DuckLake (default): DuckDB-native catalog for local-first analytics
- Unified Tables abstraction: Read and write Delta, Iceberg, and Hudi using the same Spark API
- Table descriptions: Editable metadata directly in the Table Explorer
Spark Cluster Management
- Kubernetes Operator integration: Native CRD-based Spark application deployment with pod lifecycle management
- Multi-cluster control plane: Centralized management for GKE, EKS, AKS, on-premise Kubernetes, and Yarn
- Horizontal pod autoscaling: Dynamic executor scaling based on CPU, memory, and queue depth
- Resource quotas and LimitRanges: Namespace-scoped limits enforced through Kubernetes-native primitives
Workloads
Ilum manages five workload types as first-class concepts:
- Clusters: Compute targets (Kubernetes, Yarn, Local)
- Jobs: One-shot batch executions
- Services: Long-running interactive sessions
- Schedules: Cron-driven recurring executions
- Requests: Ad-hoc query and batch submissions
Every workload exposes Status, Logs, Metrics, and Description tabs, with URL-persisted filters and bulk actions.
Interactive Computing and Notebooks
- Jupyter and JupyterHub (Enterprise): SparkMagic kernels with automatic session binding
- Apache Zeppelin: Multi-language interpreters with paragraph-level execution
- In-app SQL notebooks: Multi-engine cells inside the Ilum SQL Editor
- Spark Connect: Client-server Spark with Kubernetes-aware proxy for remote execution
Data Exploration and Lineage
- Table Explorer: Browse Hive, Nessie (with branch switching), Unity Catalog, and DuckLake tables; preview data, edit descriptions, inspect partitions
- File Explorer: Direct browsing of MinIO, S3, GCS, Azure Blob, and HDFS storage
- Column-level lineage: Powered by OpenLineage and Marquez, with an interactive React Flow graph
- ERD ↔ lineage toggle: Switch between schema and runtime perspectives on the same dataset graph
Orchestration and Workflows
- Built-in scheduler: Cron-based job scheduling with dependency management
- Apache Airflow: DAG-based workflows with pre-configured Spark operators
- Kestra: Event-driven pipelines with Spark task execution
- Mage, n8n, Apache NiFi: Visual and code-based pipeline orchestrators
- dbt: SQL transformations executed on any registered engine
Observability
- Spark History Server: Job timeline, stage metrics, executor utilization
- Prometheus + Grafana: Pre-configured dashboards via the kube-prometheus-stack
- Loki + Promtail: Centralized log aggregation
- Graphite exporter: Push-based metrics for multi-cluster environments
- OpenLineage events: Captured automatically for every job
AI, ML, and BI
- MLflow: Experiment tracking and model registry
- LangFuse: LLM observability for AI-driven workloads
- AI Data Analyst: Assistant tooling for SQL exploration. See the AI Data Analyst page for current capabilities.
- Apache Superset: Open-source BI dashboards
- Streamlit: Lightweight Python apps for analytics and ML demos
- Tableau and PowerBI: External BI connectivity via Kyuubi JDBC
Security and Access Control
- RBAC: Fine-grained permissions enforced through the
RequiresPermissionframework - OAuth2 / OIDC: Integration with Keycloak, Okta, Azure AD, Google, GitLab
- LDAP / Active Directory: Enterprise directory integration
- Ilum as Identity Provider: Embedded Ory Hydra lets Ilum issue OAuth2 tokens for embedded tools (Airflow, Superset, Grafana, Gitea, MinIO, etc.)
- API tokens: Long-lived credentials for programmatic access
- TLS / mTLS: Certificate-based encryption for inter-service traffic
- Network policies: Kubernetes-native pod-to-pod restrictions
Explore full feature documentation → | Request new features →
Advantages
Cloud-Native and Composable
Ilum is built as a set of containerized services with declarative configuration and GitOps-compatible deployment:
- Helm charts: Parameterized Kubernetes manifests for reproducible deployments
- Modular Helm sub-charts: Each engine, catalog, and integration installs independently
- Custom Resource Definitions: Kubernetes API extensions for Spark application management
- Service mesh ready: Compatible with Istio and Linkerd for advanced traffic management
- Runtime module management: The
ilum-apimicroservice installs, upgrades, and disables modules without redeploying the platform
No Vendor Lock-In
Unlike proprietary platforms, Ilum provides:
- Open APIs: REST and gRPC interfaces defined by OpenAPI 3.0 specifications
- Standard protocols: JDBC/ODBC connectivity, S3 API compatibility, Kafka integration, OpenLineage events
- Portable workloads: Spark, Trino, DuckDB, and Flink jobs run on any Kubernetes cluster without modification
- Open table formats and catalogs: Delta, Iceberg, Hudi over Hive, Nessie, Unity, or DuckLake. Pick what fits the team.
- Multi-cloud deployment: AWS, GCP, Azure, and on-premise without platform-specific dependencies
Hadoop and Cloudera Migration Path
For organizations migrating from legacy big-data platforms:
- Yarn compatibility: Run existing Yarn-based Spark jobs without code changes
- HDFS connector: Direct access to HDFS clusters during migration phases
- Hive Metastore reuse: Preserve existing table metadata and partitioning schemes
- Incremental migration: Phased transition with hybrid Yarn/Kubernetes deployment
- Bifrost (Enterprise): A dedicated migration automation tool for Hadoop and Cloudera CDP estates, covering discovery, phased execution, data validation, and rollback
Performance and Resource Control
- Engine specialisation: The automatic router moves workloads to the engine that processes them most efficiently
- Dynamic allocation: Spark executor scaling based on shuffle data and pending tasks
- Adaptive Query Execution: Runtime optimisation for join strategies and partition coalescing
- DuckLake local execution: Sub-second response times for small-data analytics
- Kubernetes resource quotas and LimitRanges: Predictable per-namespace resource isolation
Enterprise Integration
Ilum is built for enterprise data platforms:
- Apache Kafka: Native Spark Structured Streaming integration with exactly-once semantics
- Apache Airflow: Managed Airflow with Spark operators pre-configured
- MLflow: Model registry and experiment tracking for ML pipelines
- LangFuse: Trace and audit LLM-driven workflows alongside data pipelines
- OpenLineage: Standards-based data lineage emitted by every job
- Superset, Tableau, PowerBI: BI connectivity through Kyuubi JDBC
Read architecture documentation → | View use cases →
Project Roadmap
Active areas of development include:
- Apache Flink GA: Promoting Flink from Enterprise Beta to general availability for streaming workloads
- Automatic engine router enhancements: Additional heuristics (cardinality estimates, locality awareness) and per-team policies
ilum-apiMCP and open APIs: Extending the module-management microservice with Model Context Protocol support and a public extension API for third-party modules- GPU scheduling: CUDA-enabled executors for deep-learning workloads
- Expanded Unity Catalog integration: Bidirectional governance sync with Databricks deployments
View full roadmap → | See changelog →
Additional Resources
- API Reference: REST API documentation for programmatic access
- Architecture: Layered platform architecture
- Security Guide: Authentication, authorization, and network policies
- Production Deployment: Best practices for production clusters
- User Guides: Step-by-step tutorials for common workflows