Hadoop Component Migration: HDFS, YARN, Hive, Oozie, Ranger

Modernize and Direct migrations are composed of several parallel legs. This page is a single reference for every leg, from HDFS-to-object-storage through HBase-to-Cassandra. It is organized so that each section covers: target choices, what Bifrost automates, configuration highlights, and known limitations.

Storage: HDFS to S3A Object Storage

Target

The target is an S3A-compatible object storage endpoint. Bifrost's reference deployment uses Ceph via its RadosGW interface, which provides block, file, and object storage from a single cluster. Cloud deployments can use the native S3 endpoint of the chosen cloud provider.

The object storage underpins three use cases on the target platform:

Block (RBD) — stateful Kubernetes applications (Postgres, Redis, Kafka persistence).
File (CephFS) — shared scratch space for pipelines, dbt project files, Airflow DAG storage.
Object (S3A) — all lakehouse data: Iceberg tables, Parquet and ORC files, Spark checkpoints.

Tuned S3A configuration bundle

Bifrost ships a pre-tuned core-site.xml bundle that closes the majority of S3A-on-object-storage friction. These properties are set automatically on the target platform and on every Spark job Bifrost submits:

Property	Value	Rationale
`fs.s3a.path.style.access`	`true`	Required for many on-prem S3 endpoints (no virtual-host DNS).
`fs.s3a.committer.name`	`magic`	Eliminates rename-as-copy on writes.
`fs.s3a.input.fadvise`	`random`	Optimal for Parquet and ORC column reads. The legacy alias `fs.s3a.experimental.input.fadvise` is still accepted on Hadoop 3.3 or later but emits a deprecation warning.
`fs.s3a.multipart.size`	`64M`	Balance between throughput and memory.
`fs.s3a.connection.maximum`	`200`	Sufficient for heavy parallel I/O.
`fs.s3a.fast.upload.buffer`	`disk`	Avoid OOM on large uploads.
`fs.s3a.threads.max`	`64`	Match the connection pool.
`fs.s3a.endpoint`	Cluster-specific	The object storage endpoint URL.

Migration methods

Method	Use case	Throughput
DistCp with magic committer	Bulk historical data	500 MB/s to 2 GB/s per cluster
Continuous replication (vendor tools)	Live data that changes continuously	Near real-time
Incremental DistCp sync	Periodic catch-up for slowly-changing data	Same as DistCp
HDFS-compatible bridge (JuiceFS)	Applications requiring native HDFS SDK	HDFS-equivalent
Two-step via Apache Ozone	Legacy estates running Ozone	HDFS-equivalent

Tuning tip: fewer mappers at higher bandwidth per mapper outperform many mappers against the same endpoint. Raise mapper count only until HTTP 503 (SlowDown) responses start appearing; beyond that, aggregate throughput drops.

Checksum validation caveat: HDFS uses MD5-of-CRC32C checksums that are incompatible with S3 ETags. Validate via row-count parity on Parquet or ORC files rather than byte-level CRC comparison. Bifrost's validation framework handles this automatically.

DistCp failure recovery

Bifrost runs DistCp with -i -log /path/_distcp_logs. The -i option ignores per-file failures instead of aborting the whole job; the -log option writes a per-file outcome log to the specified directory. The storage migration playbook then:

Reads the failure log after the initial pass.
Retries failed files up to 3 times with exponential back-off.
Reports any files that still fail after the retries for manual investigation.
Runs row-count validation on the source and target directory trees; any count mismatch is surfaced before the storage migration is considered complete.

Partial-failure output is preserved under /var/log/bifrost/transcripts/<cluster>/<run-id>/distcp/ and includes the full per-file DistCp log.

Capacity guardrails

Target operating ceiling: 80 % pool utilization. Performance degrades sharply above 85 %.
Production readiness requires the pool to be below 70 % before the first migration wave, to leave headroom for incoming data.
Maintain at least 15 % free capacity as an operational buffer at steady state.
Erasure Coding EC(4+2) on NVMe is the reference configuration for lakehouse data (50 % raw-capacity overhead compared to 200 % for triple replication).

Sizing formulas and reference tiers are in Operations — capacity planning.

Catalog: Hive Metastore to Iceberg REST Catalog

Target selection

Bifrost supports two target catalogs. Both implement the Iceberg REST catalog specification.

Criterion	Apache Polaris	Apache Gravitino
Core model	Iceberg REST catalog plus Generic Tables for Hudi and Delta.	Federated catalog-of-catalogs.
Best fit	Primarily Hive-to-Iceberg migration.	Heterogeneous estate (Iceberg + HBase + Kafka + JDBC).
Engine support	Any Iceberg REST client.	Same; exposes Iceberg REST as a subset of its API.

Decision rule: Polaris is the default when the estate is mostly Hive on HDFS. Gravitino is the default when substantial HBase, Kafka, or JDBC warehouses must be governed alongside Iceberg.

Conversion procedures

Three conversion procedures are available:

Snapshot — safe, reversible. Creates an Iceberg table that reads source Hive files in place. No data copy. Best for validation and quick wins.
Migrate — destructive in-place conversion. Renames the original table to <name>_BACKUP_ (the backup suffix is appended, not prepended). Faster for large, stable tables.
Add files — additive import into an existing Iceberg table. Best for tables needing re-partitioning.

Bulk catalog migration

For thousands of tables with no data copy required, bifrost modernize migrate-catalog moves table references between catalogs in a single command:

bifrost modernize migrate-catalog \
  --source-type HIVE \
  --target-type REST \
  --target-uri https://polaris.example.internal/api/catalog \
  --identifier-file tables_wave3.txt

Known pitfalls

Text, RC, and SequenceFile formats are blocked by migrate. Convert to Parquet or ORC first.
Timezone drift: set spark.sql.session.timeZone=UTC to avoid inconsistent results between on-prem UTC and cloud Spark defaults.
Hive Metastore in-flight commits must be stopped before cutover (hard-freeze window).
AWS Glue targets cannot use migrate because Glue lacks RENAME; use snapshot plus rewrite instead.

Compute: YARN to Kubernetes

Target stack

Component	Role
Spark Operator	Manages the SparkApplication CRD lifecycle.
Apache YuniKorn	Hierarchical queue scheduling with gang scheduling.
Apache Celeborn	Remote shuffle service that decouples shuffle from executor lifetime.
Ilum	Spark lifecycle management, Livy-compatible proxy for existing clients.
Apache Kyuubi	Multi-tenant SQL gateway speaking the HiveThrift protocol.

Spark configuration translation

Bifrost translates Spark-on-YARN configurations to Spark-on-Kubernetes equivalents automatically. Key mappings:

YARN setting	Kubernetes equivalent
`spark.master=yarn`	`spark.master=k8s://https://k8s-api:6443`
`spark.submit.deployMode=cluster`	Same (default for CRD submissions).
`spark.yarn.queue=production`	`spark.kubernetes.scheduler.name=yunikorn` plus `spark.kubernetes.driver.label.yunikorn.apache.org/queue=root.production.etl` and `spark.kubernetes.executor.label.yunikorn.apache.org/queue=root.production.etl`. YuniKorn propagates queue placement through Kubernetes pod labels; the `spark.kubernetes..label.` Spark conf keys attach the label at submission time. Target a leaf queue (for example, `root.production.etl`), not a parent — YuniKorn rejects Spark apps submitted to a parent queue. The annotation form (`spark.kubernetes.*.annotation.yunikorn.apache.org/queue`) is retained for backward compatibility but is legacy.
`spark.executor.instances=10`	Same, or Dynamic Resource Allocation with `spark.dynamicAllocation.enabled=true`.
`spark.executor.memory=8g`	Same, plus `spark.kubernetes.executor.request.cores`.
`spark.yarn.dist.archives=...`	`spark.kubernetes.file.upload.path=s3a://...`.
HDFS locality (NODE_LOCAL)	Not applicable. Use AQE, magic committer, and compute-storage co-location in the same availability zone.

Shuffle architecture

Shuffle is the single biggest architectural difference between YARN and Kubernetes. YARN has a cluster-wide External Shuffle Service built into NodeManagers. Kubernetes does not.

Decision rule: jobs shuffling more than 50 GB use the remote shuffle service (Celeborn); smaller jobs can use local-disk shuffle with PVC reuse.

The remote shuffle service decouples shuffle data from executor lifetime. This is what enables full Dynamic Resource Allocation on Kubernetes — executors can be added or removed during a job without losing shuffle data.

Data locality

Data locality disappears when moving from HDFS to S3A. Expect a 10 % to 30 % raw-scan slowdown compared to HDFS, typically offset by elastic scaling and cheaper storage. Mitigations:

Adaptive Query Execution (AQE) — spark.sql.adaptive.enabled=true — handles skew and coalesces small partitions at runtime.
Magic committer — eliminates rename-as-copy on writes.
spark.sql.files.maxPartitionBytes between 128 MB and 256 MB.
Co-locate compute and object storage in the same availability zone.
Iceberg metadata pushdown (partition pruning, column pruning) to minimize scan width.

Query Engine: Hive / Impala to Trino

Dual-tier architecture

Trino deploys as two clusters behind a Trino Gateway:

Interactive tier — retry-policy=NONE, autoscaled on running-query JMX metrics, small JVMs (16 GB), high concurrency. Serves BI dashboards and ad-hoc SQL.
ETL tier — retry-policy=TASK with an exchange manager on object storage, larger JVMs (32 GB), adaptive plan enabled. Serves long-running transformation jobs. Fault-tolerant execution adds around 34 % overhead on short queries but dramatically improves reliability for ETL.

The gateway handles routing: queries are sent to the interactive or ETL cluster based on query properties and sessions, with seamless rolling upgrades behind it.

Table redirection

Trino's table redirection transparently resolves queries on the legacy catalog to the Iceberg catalog for migrated tables. BI tools, dbt models, and application queries work unchanged during cutover. This is the critical migration feature — it is what lets Modernize run without breaking consumers.

Authorization

OPA (Open Policy Agent) is the default authorization engine. OPA supports row-level filters and column masking, and is deployed as a sidecar in the coordinator pod.

Legacy policy systems can be preserved during migration to allow policy reuse, then sunset as OPA policies mature.

Workflows: Oozie to Airflow + dbt

Converter capabilities

Bifrost's workflow converter transforms Oozie workflows to Airflow 3 DAGs with dbt models extracted for SQL actions. Coverage by action type:

Oozie action	Airflow equivalent	Automation
Hive / Hive2	dbt model via Cosmos `DbtTaskGroup`	85 to 90 %
Spark (Java / Scala)	`SparkKubernetesOperator`	80 to 85 %
Shell	`KubernetesPodOperator`	90 %
FS (mkdir, move, delete, chmod, touchz)	`S3CreateObjectOperator`, `S3CopyObjectOperator`, `S3DeleteObjectsOperator`, or a `PythonOperator` invoking `boto3` for permissions	75 %
Email	`EmailOperator`	95 %
SSH	`SSHOperator` or `KubernetesPodOperator`	80 %
DistCp	`BashOperator` invoking `bifrost migrate-storage`	70 %
Sub-workflow	`TriggerDagRunOperator`	75 %
Fork / Join	Airflow parallel task groups	85 %
Decision	`BranchPythonOperator`	70 %

Coordinator and bundle handling

Simple time-driven coordinators (cron-like) — converted to Airflow schedule_interval. Automation: 50 to 65 %.
Data-driven coordinators — produce TODO-annotated DAG stubs with Airflow Dataset triggers. Automation: 10 to 20 %.
Bundles — not automated. Produce stub files with documentation. Automation: 0 %.

dbt integration

Hive and Hive2 SQL actions are extracted into dbt models under models/legacy/<workflow>_<action>.sql with Iceberg materialization config headers. dbt runs inside Airflow DAGs using Cosmos. The primary adapter is dbt-trino; dbt-spark covers transforms that specifically need Spark execution.

Cosmos execution mode is a deployment choice:

Local / virtualenv — dbt runs inside the Airflow worker. Supports Airflow Dataset emission, so downstream DAGs can be triggered by model completion.
Kubernetes — dbt runs as a KubernetesPodOperator task. Isolates dependencies per model but does not emit Airflow Datasets. Use time-based triggers or explicit TriggerDagRunOperator for cross-DAG dependencies.

Bifrost's converter defaults to Local mode so Dataset-aware scheduling works. Customers can switch individual DAGs to Kubernetes mode when dependency isolation outweighs Dataset support.

All converter outputs target Airflow 3 (TaskFlow API, no SubDagOperator). Airflow 2 is not supported as an output target.

Airflow on Ilum

The target Airflow deployment is managed by the Ilum platform. For details on running Airflow alongside Ilum, including LivyOperator, SparkSubmitOperator, Git Sync, and OAuth2 integration, see Airflow integration.

Governance: Atlas to OpenMetadata

Two-step migration

Step 1 — Connect. Point OpenMetadata's native Atlas connector at the live Atlas instance. Ingests tables, databases, classifications, and glossary. Each ingested object carries an origin classification that marks it as Atlas-sourced; the exact classification name depends on the OpenMetadata release and the customer's classification hierarchy.

Step 2 — Rebuild lineage. Discard Atlas lineage graphs and rebuild from scratch via OpenLineage emitters on each engine:

Spark — OpenLineage Spark listener attached to every Spark job.
Airflow — OpenLineage Airflow provider on the Airflow deployment.
Trino — OpenMetadata's Trino connector with query-log ingestion.

warning

Do not attempt to preserve Atlas lineage graphs across the migration. The underlying graph model does not round-trip to OpenMetadata's relational model. Teams that try consistently spend weeks before accepting they must re-emit from the engines.

Division of labour

The catalog (Polaris or Gravitino) governs tables and data: RBAC on catalog operations, credential vending, schema registration. OpenMetadata governs metadata, people, processes, quality, and contracts: glossary, ownership, tiers, data tests, documentation. The two are complementary, not overlapping.

User Interface: HUE to Superset

What `bifrost modernize hue-import` does

The import tool reads HUE's document database, classifies documents by type (query-hive, query-impala, query-sparksql, oozie-workflow2, search-dashboard), maps users via OIDC, creates Superset saved queries via Superset's API, converts embedded Oozie workflows through the workflow converter, and produces a migration report.

Item	Result
Saved Hive, Impala, or Spark SQL queries	Created as Superset saved queries.
Oozie workflow documents embedded in HUE	Converted via the workflow converter.
Permissions	Mapped via OIDC claims to Superset roles.
User mapping	Applied from a customer-provided mapping file.

What must be rebuilt manually

HUE's Solr-backed search dashboards (no Superset analog).
File-browser bookmarks (no Superset analog).
Editor autocomplete session details (session-specific).

Superset configuration

The Superset deployment on the target platform is pre-built with Trino, Kyuubi-Hive, and Postgres drivers installed. Authentication is via OIDC to Keycloak with AUTH_ROLES_MAPPING for group-to-role translation. For Trino queries, the trino-python-client SQLAlchemy dialect is used. See also Superset.

Data Store: HBase to Cassandra, ScyllaDB, or Bigtable

Honest assessment

HBase is the hardest migration leg with the lowest automation ceiling (40 to 55 %). HBase and Cassandra-family systems diverge in three fundamental ways:

Keying. HBase has a single lexicographically-sorted row key. Cassandra hash-partitions on a compound primary key. Composite HBase row keys must be decomposed, which requires application access-pattern knowledge.
Versioning. HBase per-cell multi-versioning (VERSIONS => N) has no Cassandra analog. Values must be flattened to latest-only or re-encoded as clustering rows.
Coprocessors. HBase Observers, Endpoints, and Phoenix secondary indexes have no Cassandra equivalent. These must be reimplemented as application logic or Spark jobs.

Target options

Target	Best fit	Key advantage
Apache Cassandra	Multi-region active-active, cloud-neutral	Largest ecosystem, CQL standard.
ScyllaDB	Latency-sensitive workloads	CQL-compatible, 2 to 10x lower latency, fewer nodes.
Google Cloud Bigtable	Customers accepting a cloud target	HBase API-compatible, zero schema translation.

Data migration pipeline

Bulk export — HBase snapshot via a table-snapshot input format (bypasses region servers).
Transform — Spark decomposes row keys into partition and clustering keys, flattens versions.
Bulk load — CQLSSTableWriter with sstableloader for throughput, or a Spark-Cassandra connector for simpler cases.
CDC bridge — custom HBase replication endpoint, Kafka, and Cassandra sink for cutover (preserves timestamps).
Reconciliation — row-level diff against the source.

Automation ceilings per aspect

Aspect	Ceiling
Schema introspection, bulk data movement	80 to 95 %
CQL DDL generation, CDC scaffolding	60 to 80 %
Row-key decomposition	30 to 50 %
Query-pattern-correct CQL schema	0 to 30 %
Coprocessor replacement	0 %

Because of the low ceiling, HBase migration is a separate program track with its own review gates, data-model workshops, and application-side tasks. It runs in parallel with table migration; it does not block it.

Security: Ranger to OPA

Policy translator

Bifrost provides a Ranger-to-OPA policy translator that converts Ranger JSON exports to OPA policy files. The translator handles:

Database, table, and column-level GRANT and REVOKE policies.
User and group mappings (translated to OIDC claims).
Row-level filters (translated to OPA data filters).
Column masking rules.

Limitations

Tag-based policies (Ranger with Atlas integration) require manual translation.
Ranger audit logs are not translated; the equivalent on the target platform is OPA decision logging plus the OpenMetadata audit trail.

OPA runs as a sidecar to the Trino coordinator (localhost traffic) and is the default authorization backend on the target platform. Legacy Ranger can be preserved during migration for reuse, then sunset when OPA policies reach parity.

Components Out of Scope (Discover-Only)

The following legacy components are inventoried during discover but not automatically migrated. Bifrost produces a report; porting or replacement is a manual workstream.

Component	Bifrost does	Recommended handling
Sqoop	Inventories Sqoop jobs and connection configs in the discovery report.	Rewrite to Airflow with `JdbcToS3Operator` or Spark JDBC ingestion.
Flume	Inventories Flume agents, sources, sinks, and channels.	Replace with Kafka Connect, Airbyte, or Spark Structured Streaming.
Kudu	Inventories Kudu tables and schemas.	Choose by workload: Apache Pinot for real-time aggregation and OLAP, ScyllaDB or Cassandra for high-update primary-key workloads, Iceberg with streaming writers for append-only analytical data. Iceberg is not a drop-in for Kudu's random-update primary-key patterns.
Phoenix	Inventories Phoenix secondary indexes and views on HBase.	Reimplement indexes as Solr / OpenSearch sidecars or as Cassandra materialized views during the HBase track.
Solr (for search, outside Ranger / Atlas)	Inventories collections and schemas.	Migrate to OpenSearch; schema translation is mostly mechanical but analyzer pipelines may need review.
Knox	Inventories Knox topologies and service mappings.	Replace with a Kubernetes ingress and OIDC at Trino / Airflow / Superset; no direct equivalent.
HDFS Federation / Router-Based Federation	Inventories federated namespaces.	The target object storage is flat; federated paths collapse into bucket prefixes during migration.
Workload XM (CDP)	Flagged during `direct discover` dependency analysis.	No direct equivalent; use Grafana + Trino query-log dashboards.
Cloudera Data Visualization (CDP)	Flagged during `direct discover`.	Migrate to Superset via `hue-import` for compatible saved queries; dashboards require manual rebuild.
Shared Data Experience (SDX) (CDP)	Flagged during `direct discover`.	SDX overlaps with Polaris + OpenMetadata + OPA; mapping is customer-specific and manual.

Next Steps

Validation and rollback — how Bifrost confirms each leg is correct.
Operations — monitoring, capacity, backup, upgrades.
Troubleshooting — common issues per component plus version compatibility.

Storage: HDFS to S3A Object Storage​

Target​

Tuned S3A configuration bundle​

Migration methods​

DistCp failure recovery​

Capacity guardrails​

Catalog: Hive Metastore to Iceberg REST Catalog​

Target selection​

Conversion procedures​

Bulk catalog migration​

Known pitfalls​

Compute: YARN to Kubernetes​

Target stack​

Spark configuration translation​

Shuffle architecture​

Data locality​

Query Engine: Hive / Impala to Trino​

Dual-tier architecture​

Table redirection​

Authorization​

Workflows: Oozie to Airflow + dbt​

Converter capabilities​

Coordinator and bundle handling​

dbt integration​

Airflow on Ilum​

Governance: Atlas to OpenMetadata​

Two-step migration​

Division of labour​

User Interface: HUE to Superset​

What bifrost modernize hue-import does​

What must be rebuilt manually​

Superset configuration​

Data Store: HBase to Cassandra, ScyllaDB, or Bigtable​

Honest assessment​

Target options​

Data migration pipeline​

Automation ceilings per aspect​

Security: Ranger to OPA​

Policy translator​

Limitations​

Components Out of Scope (Discover-Only)​

Next Steps​