Hadoop Component Migration: HDFS, YARN, Hive, Oozie, Ranger
Modernize and Direct migrations are composed of several parallel legs. This page is a single reference for every leg, from HDFS-to-object-storage through HBase-to-Cassandra. It is organized so that each section covers: target choices, what Bifrost automates, configuration highlights, and known limitations.
Storage: HDFS to S3A Object Storage
Target
The target is an S3A-compatible object storage endpoint. Bifrost's reference deployment uses Ceph via its RadosGW interface, which provides block, file, and object storage from a single cluster. Cloud deployments can use the native S3 endpoint of the chosen cloud provider.
The object storage underpins three use cases on the target platform:
- Block (RBD) — stateful Kubernetes applications (Postgres, Redis, Kafka persistence).
- File (CephFS) — shared scratch space for pipelines, dbt project files, Airflow DAG storage.
- Object (S3A) — all lakehouse data: Iceberg tables, Parquet and ORC files, Spark checkpoints.
Tuned S3A configuration bundle
Bifrost ships a pre-tuned core-site.xml bundle that closes the majority of S3A-on-object-storage friction. These properties are set automatically on the target platform and on every Spark job Bifrost submits:
| Property | Value | Rationale |
|---|---|---|
fs.s3a.path.style.access | true | Required for many on-prem S3 endpoints (no virtual-host DNS). |
fs.s3a.committer.name | magic | Eliminates rename-as-copy on writes. |
fs.s3a.input.fadvise | random | Optimal for Parquet and ORC column reads. The legacy alias fs.s3a.experimental.input.fadvise is still accepted on Hadoop 3.3 or later but emits a deprecation warning. |
fs.s3a.multipart.size | 64M | Balance between throughput and memory. |
fs.s3a.connection.maximum | 200 | Sufficient for heavy parallel I/O. |
fs.s3a.fast.upload.buffer | disk | Avoid OOM on large uploads. |
fs.s3a.threads.max | 64 | Match the connection pool. |
fs.s3a.endpoint | Cluster-specific | The object storage endpoint URL. |
Migration methods
| Method | Use case | Throughput |
|---|---|---|
| DistCp with magic committer | Bulk historical data | 500 MB/s to 2 GB/s per cluster |
| Continuous replication (vendor tools) | Live data that changes continuously | Near real-time |
| Incremental DistCp sync | Periodic catch-up for slowly-changing data | Same as DistCp |
| HDFS-compatible bridge (JuiceFS) | Applications requiring native HDFS SDK | HDFS-equivalent |
| Two-step via Apache Ozone | Legacy estates running Ozone | HDFS-equivalent |
Tuning tip: fewer mappers at higher bandwidth per mapper outperform many mappers against the same endpoint. Raise mapper count only until HTTP 503 (SlowDown) responses start appearing; beyond that, aggregate throughput drops.
Checksum validation caveat: HDFS uses MD5-of-CRC32C checksums that are incompatible with S3 ETags. Validate via row-count parity on Parquet or ORC files rather than byte-level CRC comparison. Bifrost's validation framework handles this automatically.
DistCp failure recovery
Bifrost runs DistCp with -i -log /path/_distcp_logs. The -i option ignores per-file failures instead of aborting the whole job; the -log option writes a per-file outcome log to the specified directory. The storage migration playbook then:
- Reads the failure log after the initial pass.
- Retries failed files up to 3 times with exponential back-off.
- Reports any files that still fail after the retries for manual investigation.
- Runs row-count validation on the source and target directory trees; any count mismatch is surfaced before the storage migration is considered complete.
Partial-failure output is preserved under /var/log/bifrost/transcripts/<cluster>/<run-id>/distcp/ and includes the full per-file DistCp log.
Capacity guardrails
- Target operating ceiling: 80 % pool utilization. Performance degrades sharply above 85 %.
- Production readiness requires the pool to be below 70 % before the first migration wave, to leave headroom for incoming data.
- Maintain at least 15 % free capacity as an operational buffer at steady state.
- Erasure Coding EC(4+2) on NVMe is the reference configuration for lakehouse data (50 % raw-capacity overhead compared to 200 % for triple replication).
Sizing formulas and reference tiers are in Operations — capacity planning.
Catalog: Hive Metastore to Iceberg REST Catalog
Target selection
Bifrost supports two target catalogs. Both implement the Iceberg REST catalog specification.
| Criterion | Apache Polaris | Apache Gravitino |
|---|---|---|
| Core model | Iceberg REST catalog plus Generic Tables for Hudi and Delta. | Federated catalog-of-catalogs. |
| Best fit | Primarily Hive-to-Iceberg migration. | Heterogeneous estate (Iceberg + HBase + Kafka + JDBC). |
| Engine support | Any Iceberg REST client. | Same; exposes Iceberg REST as a subset of its API. |
Decision rule: Polaris is the default when the estate is mostly Hive on HDFS. Gravitino is the default when substantial HBase, Kafka, or JDBC warehouses must be governed alongside Iceberg.
Conversion procedures
Three conversion procedures are available:
- Snapshot — safe, reversible. Creates an Iceberg table that reads source Hive files in place. No data copy. Best for validation and quick wins.
- Migrate — destructive in-place conversion. Renames the original table to
<name>_BACKUP_(the backup suffix is appended, not prepended). Faster for large, stable tables. - Add files — additive import into an existing Iceberg table. Best for tables needing re-partitioning.
Bulk catalog migration
For thousands of tables with no data copy required, bifrost modernize migrate-catalog moves table references between catalogs in a single command:
bifrost modernize migrate-catalog \
--source-type HIVE \
--target-type REST \
--target-uri https://polaris.example.internal/api/catalog \
--identifier-file tables_wave3.txt
Known pitfalls
- Text, RC, and SequenceFile formats are blocked by
migrate. Convert to Parquet or ORC first. - Timezone drift: set
spark.sql.session.timeZone=UTCto avoid inconsistent results between on-prem UTC and cloud Spark defaults. - Hive Metastore in-flight commits must be stopped before cutover (hard-freeze window).
- AWS Glue targets cannot use
migratebecause Glue lacks RENAME; usesnapshotplus rewrite instead.
Compute: YARN to Kubernetes
Target stack
| Component | Role |
|---|---|
| Spark Operator | Manages the SparkApplication CRD lifecycle. |
| Apache YuniKorn | Hierarchical queue scheduling with gang scheduling. |
| Apache Celeborn | Remote shuffle service that decouples shuffle from executor lifetime. |
| Ilum | Spark lifecycle management, Livy-compatible proxy for existing clients. |
| Apache Kyuubi | Multi-tenant SQL gateway speaking the HiveThrift protocol. |
Spark configuration translation
Bifrost translates Spark-on-YARN configurations to Spark-on-Kubernetes equivalents automatically. Key mappings:
| YARN setting | Kubernetes equivalent |
|---|---|
spark.master=yarn | spark.master=k8s://https://k8s-api:6443 |
spark.submit.deployMode=cluster | Same (default for CRD submissions). |
spark.yarn.queue=production | spark.kubernetes.scheduler.name=yunikorn plus spark.kubernetes.driver.label.yunikorn.apache.org/queue=root.production.etl and spark.kubernetes.executor.label.yunikorn.apache.org/queue=root.production.etl. YuniKorn propagates queue placement through Kubernetes pod labels; the spark.kubernetes.*.label.* Spark conf keys attach the label at submission time. Target a leaf queue (for example, root.production.etl), not a parent — YuniKorn rejects Spark apps submitted to a parent queue. The annotation form (spark.kubernetes.*.annotation.yunikorn.apache.org/queue) is retained for backward compatibility but is legacy. |
spark.executor.instances=10 | Same, or Dynamic Resource Allocation with spark.dynamicAllocation.enabled=true. |
spark.executor.memory=8g | Same, plus spark.kubernetes.executor.request.cores. |
spark.yarn.dist.archives=... | spark.kubernetes.file.upload.path=s3a://.... |
| HDFS locality (NODE_LOCAL) | Not applicable. Use AQE, magic committer, and compute-storage co-location in the same availability zone. |
Shuffle architecture
Shuffle is the single biggest architectural difference between YARN and Kubernetes. YARN has a cluster-wide External Shuffle Service built into NodeManagers. Kubernetes does not.
Decision rule: jobs shuffling more than 50 GB use the remote shuffle service (Celeborn); smaller jobs can use local-disk shuffle with PVC reuse.
The remote shuffle service decouples shuffle data from executor lifetime. This is what enables full Dynamic Resource Allocation on Kubernetes — executors can be added or removed during a job without losing shuffle data.
Data locality
Data locality disappears when moving from HDFS to S3A. Expect a 10 % to 30 % raw-scan slowdown compared to HDFS, typically offset by elastic scaling and cheaper storage. Mitigations:
- Adaptive Query Execution (AQE) —
spark.sql.adaptive.enabled=true— handles skew and coalesces small partitions at runtime. - Magic committer — eliminates rename-as-copy on writes.
spark.sql.files.maxPartitionBytesbetween 128 MB and 256 MB.- Co-locate compute and object storage in the same availability zone.
- Iceberg metadata pushdown (partition pruning, column pruning) to minimize scan width.
Query Engine: Hive / Impala to Trino
Dual-tier architecture
Trino deploys as two clusters behind a Trino Gateway:
- Interactive tier —
retry-policy=NONE, autoscaled on running-query JMX metrics, small JVMs (16 GB), high concurrency. Serves BI dashboards and ad-hoc SQL. - ETL tier —
retry-policy=TASKwith an exchange manager on object storage, larger JVMs (32 GB), adaptive plan enabled. Serves long-running transformation jobs. Fault-tolerant execution adds around 34 % overhead on short queries but dramatically improves reliability for ETL.
The gateway handles routing: queries are sent to the interactive or ETL cluster based on query properties and sessions, with seamless rolling upgrades behind it.
Table redirection
Trino's table redirection transparently resolves queries on the legacy catalog to the Iceberg catalog for migrated tables. BI tools, dbt models, and application queries work unchanged during cutover. This is the critical migration feature — it is what lets Modernize run without breaking consumers.
Authorization
OPA (Open Policy Agent) is the default authorization engine. OPA supports row-level filters and column masking, and is deployed as a sidecar in the coordinator pod.
Legacy policy systems can be preserved during migration to allow policy reuse, then sunset as OPA policies mature.
Workflows: Oozie to Airflow + dbt
Converter capabilities
Bifrost's workflow converter transforms Oozie workflows to Airflow 3 DAGs with dbt models extracted for SQL actions. Coverage by action type:
| Oozie action | Airflow equivalent | Automation |
|---|---|---|
| Hive / Hive2 | dbt model via Cosmos DbtTaskGroup | 85 to 90 % |
| Spark (Java / Scala) | SparkKubernetesOperator | 80 to 85 % |
| Shell | KubernetesPodOperator | 90 % |
| FS (mkdir, move, delete, chmod, touchz) | S3CreateObjectOperator, S3CopyObjectOperator, S3DeleteObjectsOperator, or a PythonOperator invoking boto3 for permissions | 75 % |
EmailOperator | 95 % | |
| SSH | SSHOperator or KubernetesPodOperator | 80 % |
| DistCp | BashOperator invoking bifrost migrate-storage | 70 % |
| Sub-workflow | TriggerDagRunOperator | 75 % |
| Fork / Join | Airflow parallel task groups | 85 % |
| Decision | BranchPythonOperator | 70 % |
Coordinator and bundle handling
- Simple time-driven coordinators (cron-like) — converted to Airflow
schedule_interval. Automation: 50 to 65 %. - Data-driven coordinators — produce TODO-annotated DAG stubs with Airflow Dataset triggers. Automation: 10 to 20 %.
- Bundles — not automated. Produce stub files with documentation. Automation: 0 %.
dbt integration
Hive and Hive2 SQL actions are extracted into dbt models under models/legacy/<workflow>_<action>.sql with Iceberg materialization config headers. dbt runs inside Airflow DAGs using Cosmos. The primary adapter is dbt-trino; dbt-spark covers transforms that specifically need Spark execution.
Cosmos execution mode is a deployment choice:
- Local / virtualenv — dbt runs inside the Airflow worker. Supports Airflow Dataset emission, so downstream DAGs can be triggered by model completion.
- Kubernetes — dbt runs as a
KubernetesPodOperatortask. Isolates dependencies per model but does not emit Airflow Datasets. Use time-based triggers or explicitTriggerDagRunOperatorfor cross-DAG dependencies.
Bifrost's converter defaults to Local mode so Dataset-aware scheduling works. Customers can switch individual DAGs to Kubernetes mode when dependency isolation outweighs Dataset support.
All converter outputs target Airflow 3 (TaskFlow API, no SubDagOperator). Airflow 2 is not supported as an output target.
Airflow on Ilum
The target Airflow deployment is managed by the Ilum platform. For details on running Airflow alongside Ilum, including LivyOperator, SparkSubmitOperator, Git Sync, and OAuth2 integration, see Airflow integration.
Governance: Atlas to OpenMetadata
Two-step migration
Step 1 — Connect. Point OpenMetadata's native Atlas connector at the live Atlas instance. Ingests tables, databases, classifications, and glossary. Each ingested object carries an origin classification that marks it as Atlas-sourced; the exact classification name depends on the OpenMetadata release and the customer's classification hierarchy.
Step 2 — Rebuild lineage. Discard Atlas lineage graphs and rebuild from scratch via OpenLineage emitters on each engine:
- Spark — OpenLineage Spark listener attached to every Spark job.
- Airflow — OpenLineage Airflow provider on the Airflow deployment.
- Trino — OpenMetadata's Trino connector with query-log ingestion.
Do not attempt to preserve Atlas lineage graphs across the migration. The underlying graph model does not round-trip to OpenMetadata's relational model. Teams that try consistently spend weeks before accepting they must re-emit from the engines.
Division of labour
The catalog (Polaris or Gravitino) governs tables and data: RBAC on catalog operations, credential vending, schema registration. OpenMetadata governs metadata, people, processes, quality, and contracts: glossary, ownership, tiers, data tests, documentation. The two are complementary, not overlapping.
User Interface: HUE to Superset
What bifrost modernize hue-import does
The import tool reads HUE's document database, classifies documents by type (query-hive, query-impala, query-sparksql, oozie-workflow2, search-dashboard), maps users via OIDC, creates Superset saved queries via Superset's API, converts embedded Oozie workflows through the workflow converter, and produces a migration report.
| Item | Result |
|---|---|
| Saved Hive, Impala, or Spark SQL queries | Created as Superset saved queries. |
| Oozie workflow documents embedded in HUE | Converted via the workflow converter. |
| Permissions | Mapped via OIDC claims to Superset roles. |
| User mapping | Applied from a customer-provided mapping file. |
What must be rebuilt manually
- HUE's Solr-backed search dashboards (no Superset analog).
- File-browser bookmarks (no Superset analog).
- Editor autocomplete session details (session-specific).
Superset configuration
The Superset deployment on the target platform is pre-built with Trino, Kyuubi-Hive, and Postgres drivers installed. Authentication is via OIDC to Keycloak with AUTH_ROLES_MAPPING for group-to-role translation. For Trino queries, the trino-python-client SQLAlchemy dialect is used. See also Superset.
Data Store: HBase to Cassandra, ScyllaDB, or Bigtable
Honest assessment
HBase is the hardest migration leg with the lowest automation ceiling (40 to 55 %). HBase and Cassandra-family systems diverge in three fundamental ways:
- Keying. HBase has a single lexicographically-sorted row key. Cassandra hash-partitions on a compound primary key. Composite HBase row keys must be decomposed, which requires application access-pattern knowledge.
- Versioning. HBase per-cell multi-versioning (
VERSIONS => N) has no Cassandra analog. Values must be flattened to latest-only or re-encoded as clustering rows. - Coprocessors. HBase Observers, Endpoints, and Phoenix secondary indexes have no Cassandra equivalent. These must be reimplemented as application logic or Spark jobs.
Target options
| Target | Best fit | Key advantage |
|---|---|---|
| Apache Cassandra | Multi-region active-active, cloud-neutral | Largest ecosystem, CQL standard. |
| ScyllaDB | Latency-sensitive workloads | CQL-compatible, 2 to 10x lower latency, fewer nodes. |
| Google Cloud Bigtable | Customers accepting a cloud target | HBase API-compatible, zero schema translation. |
Data migration pipeline
- Bulk export — HBase snapshot via a table-snapshot input format (bypasses region servers).
- Transform — Spark decomposes row keys into partition and clustering keys, flattens versions.
- Bulk load —
CQLSSTableWriterwithsstableloaderfor throughput, or a Spark-Cassandra connector for simpler cases. - CDC bridge — custom HBase replication endpoint, Kafka, and Cassandra sink for cutover (preserves timestamps).
- Reconciliation — row-level diff against the source.
Automation ceilings per aspect
| Aspect | Ceiling |
|---|---|
| Schema introspection, bulk data movement | 80 to 95 % |
| CQL DDL generation, CDC scaffolding | 60 to 80 % |
| Row-key decomposition | 30 to 50 % |
| Query-pattern-correct CQL schema | 0 to 30 % |
| Coprocessor replacement | 0 % |
Because of the low ceiling, HBase migration is a separate program track with its own review gates, data-model workshops, and application-side tasks. It runs in parallel with table migration; it does not block it.
Security: Ranger to OPA
Policy translator
Bifrost provides a Ranger-to-OPA policy translator that converts Ranger JSON exports to OPA policy files. The translator handles:
- Database, table, and column-level GRANT and REVOKE policies.
- User and group mappings (translated to OIDC claims).
- Row-level filters (translated to OPA data filters).
- Column masking rules.
Limitations
- Tag-based policies (Ranger with Atlas integration) require manual translation.
- Ranger audit logs are not translated; the equivalent on the target platform is OPA decision logging plus the OpenMetadata audit trail.
OPA runs as a sidecar to the Trino coordinator (localhost traffic) and is the default authorization backend on the target platform. Legacy Ranger can be preserved during migration for reuse, then sunset when OPA policies reach parity.
Components Out of Scope (Discover-Only)
The following legacy components are inventoried during discover but not automatically migrated. Bifrost produces a report; porting or replacement is a manual workstream.
| Component | Bifrost does | Recommended handling |
|---|---|---|
| Sqoop | Inventories Sqoop jobs and connection configs in the discovery report. | Rewrite to Airflow with JdbcToS3Operator or Spark JDBC ingestion. |
| Flume | Inventories Flume agents, sources, sinks, and channels. | Replace with Kafka Connect, Airbyte, or Spark Structured Streaming. |
| Kudu | Inventories Kudu tables and schemas. | Choose by workload: Apache Pinot for real-time aggregation and OLAP, ScyllaDB or Cassandra for high-update primary-key workloads, Iceberg with streaming writers for append-only analytical data. Iceberg is not a drop-in for Kudu's random-update primary-key patterns. |
| Phoenix | Inventories Phoenix secondary indexes and views on HBase. | Reimplement indexes as Solr / OpenSearch sidecars or as Cassandra materialized views during the HBase track. |
| Solr (for search, outside Ranger / Atlas) | Inventories collections and schemas. | Migrate to OpenSearch; schema translation is mostly mechanical but analyzer pipelines may need review. |
| Knox | Inventories Knox topologies and service mappings. | Replace with a Kubernetes ingress and OIDC at Trino / Airflow / Superset; no direct equivalent. |
| HDFS Federation / Router-Based Federation | Inventories federated namespaces. | The target object storage is flat; federated paths collapse into bucket prefixes during migration. |
| Workload XM (CDP) | Flagged during direct discover dependency analysis. | No direct equivalent; use Grafana + Trino query-log dashboards. |
| Cloudera Data Visualization (CDP) | Flagged during direct discover. | Migrate to Superset via hue-import for compatible saved queries; dashboards require manual rebuild. |
| Shared Data Experience (SDX) (CDP) | Flagged during direct discover. | SDX overlaps with Polaris + OpenMetadata + OPA; mapping is customer-specific and manual. |
Next Steps
- Validation and rollback — how Bifrost confirms each leg is correct.
- Operations — monitoring, capacity, backup, upgrades.
- Troubleshooting — common issues per component plus version compatibility.