Target Architecture for Hadoop Migration: Iceberg, Trino, Airflow
Modernize and Direct paths land a Kubernetes-native lakehouse built on Ilum. This page describes that target platform from the perspective of someone planning a Bifrost migration. For the general-purpose description of the platform itself, see Ilum architecture.
The intent of this page is to make four things concrete:
- Which services exist on the target platform and how they relate.
- How a query executes against a migrated table.
- How data moves from the source cluster into the target platform during migration.
- How the dual-read bridge lets legacy and migrated tables coexist while the estate is transitioning.
High-Level Topology
The target platform is a set of interoperating Kubernetes services. Identity, query routing, orchestration, catalog, and storage each have a dedicated layer. Ilum is the Spark management plane.
+-----------------+
| KEYCLOAK |
| (OIDC IdP) |
+--------+--------+
|
+------------------------+------------------------+
| | |
+---------v--------+ +----------v--------+ +----------v--------+
| SUPERSET | | TRINO GATEWAY | | AIRFLOW 3 |
| (BI / SQL Lab) | | (L7 Router) | | (Orchestration) |
+---------+--------+ +----+----------+---+ +----------+--------+
| | | |
| +-------v---+ +---v--------+ |
| | TRINO | | TRINO | |
| | Interactive| | ETL | |
| | (no retry)| | (FT exec) | |
| +-------+---+ +---+--------+ |
| | | |
+------------------+---------+--------------------+
|
+----------v----------+
| APACHE POLARIS |
| (Iceberg REST |
| Catalog) |
+----------+----------+
|
+------------------+------------------+
| |
+---------v--------+ +---------v--------+
| CEPH OBJECT | | OPENMETADATA |
| STORAGE (S3A) | | (Governance) |
+------------------+ +------------------+
+------------------+------------------+
| | |
+---------v--------+ +------v-------+ +--------v--------+
| SPARK OPERATOR | | YUNIKORN | | CELEBORN |
| (Job CRDs) | | (Scheduler) | | (Shuffle Svc) |
+------------------+ +--------------+ +-----------------+
+------------------+------------------+
| | |
+---------v--------+ +------v-------+ +--------v--------+
| ILUM CORE | | KYUUBI | | dbt + COSMOS |
| (Spark Mgmt) | | (SQL GW) | | (Transform) |
+------------------+ +--------------+ +-----------------+
Each block is deployed as a Kubernetes workload, typically via Helm. Most components are configured centrally by Bifrost during the bifrost modernize land (or bifrost direct land) phase.
Query Data Flow
When a user runs a query against a migrated Iceberg table, the following chain executes:
User Browser
|
v
[1] Superset (authenticates user via Keycloak OIDC)
| SQLAlchemy connection to Trino
v
[2] Trino Gateway (routes to interactive or ETL cluster based on query properties)
|
v
[3] Trino Coordinator (authenticated via OAuth2 token from Keycloak)
| Parses SQL, resolves table via the catalog
v
[4] Polaris REST Catalog (authenticates Trino via OAuth2 client_credentials)
| Returns Iceberg metadata: manifest list location in object storage
| Optionally vends scoped S3 credentials
v
[5] Trino Workers read Parquet / ORC files directly from object storage via S3A
|
v
[6] Results returned to Superset, rendered to the user
The user never interacts with the object store directly. Authorization and row-level or column-level policies are enforced by Trino in collaboration with an OPA sidecar (see Identity and access).
Migration Data Flow
Table migration during Modernize or Direct follows a consistent pattern:
[1] bifrost modernize migrate-table --table db.customers --strategy snapshot
|
v
[2] Migration driver invokes a Spark job on the target cluster via Ilum
| Spark reads Hive Metastore metadata from the legacy catalog
| Spark reads data file locations from the legacy storage
| Spark creates Iceberg metadata pointing to the existing files (no data copy for snapshot)
| Spark registers the table in the Polaris REST catalog
|
v
[3] bifrost modernize validate runs data-diff
| Row-count comparison: source vs. target
| Value-level hash sampling on a configurable percentage of partitions
|
v
[4] The decision engine evaluates the results
| PROCEED: the table is marked migrated and table redirection is enabled
| ABORT: the table is reverted (Iceberg metadata deleted), an alert is sent
|
v
[5] Trino table redirection resolves queries to the Iceberg location
| Queries to db.customers on the legacy catalog transparently hit the migrated copy
Three strategies are available for table migration: snapshot, migrate, and add_files. Each is described in Per-component migration reference — catalog.
Dual-Read Bridge
During Modernize and Direct, the dual-read bridge is the mechanism that lets legacy and migrated tables coexist. Trino is configured with two catalogs: one pointing at the legacy Hive Metastore and one pointing at the Iceberg REST catalog. Table redirection routes queries to the correct catalog per table, transparently.
+-------------------+ +-------------------+
| LEGACY HADOOP | | KUBERNETES |
| (bare-metal) | | PLATFORM |
| | | |
| +-------------+ | Sync jobs | +-------------+ |
| | HDFS |--+------------------->| | Ceph RGW | |
| | NameNode | | | | (S3A) | |
| +-------------+ | | +------+------+ |
| | | | |
| +-------------+ | Catalog feed | +------v------+ |
| | Hive |--+------------------->| | Polaris | |
| | Metastore | | | | (Iceberg) | |
| +-------------+ | | +------+------+ |
| | | | |
| +-------------+ | | +------v------+ |
| | YARN | | Dual-mode Spark | | Spark on | |
| | (legacy |<-+- - - - - - - - - -| | K8s (Ilum) | |
| | workloads) | | | +-------------+ |
| +-------------+ | | |
+-------------------+ +-------------------+
| |
+-------- Trino (table redirection) -----+
| reads from BOTH catalogs |
v v
hive_hdfs catalog iceberg_s3 catalog
(legacy tables) (migrated tables)
The consequence is that BI tools, dbt models, and application queries continue to work unchanged during the transition. Table redirection makes the migration transparent at the query layer.
During this phase, the legacy HDFS cluster remains authoritative. The object storage side is kept in sync by periodic DistCp or continuous replication jobs. Only after validation passes and a silence period elapses is the legacy service decommissioned.
Identity and Access
Authentication across the platform is centralized through Keycloak using OAuth2 and OpenID Connect.
User (browser)
|
| [OIDC authorization code flow]
v
Keycloak --> issues JWT access token with claims:
- sub: [email protected]
- groups: [data-engineers, team-analytics]
- realm_roles: [lakehouse-user]
|
+---> Superset: validates JWT, maps groups to Superset roles
|
+---> Trino Coordinator: validates JWT via OIDC discovery
|
+---> Trino -> Polaris: OAuth2 client_credentials (service-to-service)
|
+---> Polaris -> Object storage: static S3 credentials on-prem,
| or vended scoped credentials on cloud
|
+---> OPA: Trino coordinator calls an OPA sidecar for every query
through four possible endpoints:
* opa.policy.uri — allow / deny decisions
* opa.policy.row-filters-uri — row-level filter expressions
* opa.policy.column-masking-uri — per-column mask expressions
* opa.policy.batch-column-masking-uri — batched column masks;
overrides the per-column
endpoint when set
Row-filter and column-mask endpoints are consulted only for the
tables and columns involved in the query, and only after the main
allow check returns allow. OPA evaluates policies against the full
input structure (input.action, input.context, input.context.identity)
and returns one response per endpoint call.
Credential vending: cloud vs. on-prem
- Cloud deployments support credential vending end to end. The catalog generates temporary, scoped object storage credentials per table access.
- On-premises deployments typically use static credentials for object storage access combined with fine-grained OPA policies at the query layer. Ceph RGW has supported STS-based credential vending since the Reef release, but end-to-end integration with the Iceberg REST catalog on-prem still requires extra role-assumption wiring; most production on-prem deployments therefore fall back to static keys. Catalog-level authorization always controls which principals can see which namespaces and tables.
Service account matrix
| Service | Authenticates to | Method |
|---|---|---|
| Superset | Keycloak | OIDC authorization code flow |
| Trino (user-facing) | Keycloak | OIDC (JWT validation) |
| Trino to Polaris | Keycloak | OAuth2 client_credentials |
| Trino to object storage | Object storage | Static S3 access or vended credentials |
| Trino to OPA | OPA sidecar | localhost HTTP (trusted-network co-location) |
| Airflow | Keycloak | OIDC authorization code flow |
| Airflow to Trino | Trino | Service account |
| Spark (via Ilum) to object storage | Object storage | Credentials injected via Ilum cluster config |
| Spark to Polaris | Keycloak | OAuth2 client_credentials |
| OpenMetadata | Keycloak | OIDC authorization code flow |
| OpenMetadata to Trino | Trino | Service account for metadata ingestion |
For the underlying Ilum identity model, see Ilum security.
Networking and Ingress
Bifrost provisions an ingress controller (or reuses the customer's existing one) with the following default hostnames. The domain is customer-specific and comes from the global.domain value in the deployment configuration.
| Hostname | Backend service | Notes |
|---|---|---|
superset.lakehouse.internal | Superset | WebSocket support for async SQL Lab |
trino.lakehouse.internal | Trino Gateway | Routes to interactive or ETL cluster |
airflow.lakehouse.internal | Airflow webserver | |
polaris.lakehouse.internal | Polaris | REST API and OAuth2 token endpoint |
openmetadata.lakehouse.internal | OpenMetadata | |
ilum.lakehouse.internal | Ilum UI | |
keycloak.lakehouse.internal | Keycloak | Must be reachable from all clients |
Legacy to Kubernetes connectivity
During the dual-read bridge phase, legacy Hadoop nodes must reach Kubernetes-hosted services (the object storage endpoint for sync writes), and Kubernetes pods must reach legacy HDFS (for Spark reads during the transition).
- Kubernetes to Hadoop — Spark on Kubernetes reads HDFS by mounting the legacy Hadoop configuration as ConfigMaps. The Kubernetes pod network must reach the source NameNode and DataNodes.
- Hadoop to Kubernetes — the object storage endpoint is exposed via a NodePort or LoadBalancer service with a stable address. DistCp uses that endpoint for large data transfers.
Default network policies
Bifrost ships a default set of NetworkPolicy resources for namespace isolation:
- Spark executor pods can communicate with the shuffle service, the object storage endpoint, Polaris, Ilum Core, and their own driver pod.
- Trino workers can communicate with the Trino coordinator, the object storage endpoint, and Polaris.
- The OPA sidecar accepts traffic from localhost only (it is co-located with the Trino coordinator).
- Shuffle service workers accept connections from Spark executor pods in the same namespace.
These defaults can be extended or tightened by the customer's platform team; they are a starting point, not a lock.
Custom Resources
Bifrost tracks migration progress using three Custom Resource Definitions installed under the bifrost.ilum.cloud/v1 API group. These are visible via kubectl get and reflect the state the CLI operates on.
TableMigration
Represents a single table in flight or migrated.
- Spec fields:
sourceTable,sourceCatalog,targetCatalog,strategy(snapshot/migrate/add_files),wave,priority. - Status fields:
phase(one ofdiscovered,planned,bridged,migrating,validating,migrated,decommissioned,failed),rowCountSource,rowCountTarget,dataDiffResult,lastTransitionTime(RFC-3339),revertAvailable.
WorkflowMigration
Represents an Oozie-to-Airflow conversion unit.
- Spec fields:
sourceType(oozie),sourcePath,targetType(airflow),targetDagId. - Status fields:
phase(discovered,analyzing,converting,converted,testing,deployed),actionsTotal,actionsConverted,actionsManual,todoAnnotations.
ServiceMigration
Represents a bulk service-level migration (for example, HDFS-to-Ceph storage migration).
- Spec fields:
sourceService,targetService,sourceCluster. - Status fields:
phase,storageMigratedTB,storageTotalTB,percentComplete,lastSyncTime.
Inspect in-flight migration state with kubectl get tablemigrations -n bifrost or bifrost modernize status, which reads the same objects.
Further Reading
- Per-component migration reference — how each legacy system maps to its replacement.
- Ilum architecture — the destination platform in general-purpose terms.
- Operations — monitoring, backup, encryption, upgrades, and multi-tenancy.
- Ilum security — security model of the destination platform.