Skip to main content

Hadoop Component Migration: HDFS, YARN, Hive, Oozie, Ranger

Modernize and Direct migrations are composed of several parallel legs. This page is a single reference for every leg, from HDFS-to-object-storage through HBase-to-Cassandra. It is organized so that each section covers: target choices, what Bifrost automates, configuration highlights, and known limitations.

Storage: HDFS to S3A Object Storage

Target

The target is an S3A-compatible object storage endpoint. Bifrost's reference deployment uses Ceph via its RadosGW interface, which provides block, file, and object storage from a single cluster. Cloud deployments can use the native S3 endpoint of the chosen cloud provider.

The object storage underpins three use cases on the target platform:

  • Block (RBD) — stateful Kubernetes applications (Postgres, Redis, Kafka persistence).
  • File (CephFS) — shared scratch space for pipelines, dbt project files, Airflow DAG storage.
  • Object (S3A) — all lakehouse data: Iceberg tables, Parquet and ORC files, Spark checkpoints.

Tuned S3A configuration bundle

Bifrost ships a pre-tuned core-site.xml bundle that closes the majority of S3A-on-object-storage friction. These properties are set automatically on the target platform and on every Spark job Bifrost submits:

PropertyValueRationale
fs.s3a.path.style.accesstrueRequired for many on-prem S3 endpoints (no virtual-host DNS).
fs.s3a.committer.namemagicEliminates rename-as-copy on writes.
fs.s3a.input.fadviserandomOptimal for Parquet and ORC column reads. The legacy alias fs.s3a.experimental.input.fadvise is still accepted on Hadoop 3.3 or later but emits a deprecation warning.
fs.s3a.multipart.size64MBalance between throughput and memory.
fs.s3a.connection.maximum200Sufficient for heavy parallel I/O.
fs.s3a.fast.upload.bufferdiskAvoid OOM on large uploads.
fs.s3a.threads.max64Match the connection pool.
fs.s3a.endpointCluster-specificThe object storage endpoint URL.

Migration methods

MethodUse caseThroughput
DistCp with magic committerBulk historical data500 MB/s to 2 GB/s per cluster
Continuous replication (vendor tools)Live data that changes continuouslyNear real-time
Incremental DistCp syncPeriodic catch-up for slowly-changing dataSame as DistCp
HDFS-compatible bridge (JuiceFS)Applications requiring native HDFS SDKHDFS-equivalent
Two-step via Apache OzoneLegacy estates running OzoneHDFS-equivalent

Tuning tip: fewer mappers at higher bandwidth per mapper outperform many mappers against the same endpoint. Raise mapper count only until HTTP 503 (SlowDown) responses start appearing; beyond that, aggregate throughput drops.

Checksum validation caveat: HDFS uses MD5-of-CRC32C checksums that are incompatible with S3 ETags. Validate via row-count parity on Parquet or ORC files rather than byte-level CRC comparison. Bifrost's validation framework handles this automatically.

DistCp failure recovery

Bifrost runs DistCp with -i -log /path/_distcp_logs. The -i option ignores per-file failures instead of aborting the whole job; the -log option writes a per-file outcome log to the specified directory. The storage migration playbook then:

  1. Reads the failure log after the initial pass.
  2. Retries failed files up to 3 times with exponential back-off.
  3. Reports any files that still fail after the retries for manual investigation.
  4. Runs row-count validation on the source and target directory trees; any count mismatch is surfaced before the storage migration is considered complete.

Partial-failure output is preserved under /var/log/bifrost/transcripts/<cluster>/<run-id>/distcp/ and includes the full per-file DistCp log.

Capacity guardrails

  • Target operating ceiling: 80 % pool utilization. Performance degrades sharply above 85 %.
  • Production readiness requires the pool to be below 70 % before the first migration wave, to leave headroom for incoming data.
  • Maintain at least 15 % free capacity as an operational buffer at steady state.
  • Erasure Coding EC(4+2) on NVMe is the reference configuration for lakehouse data (50 % raw-capacity overhead compared to 200 % for triple replication).

Sizing formulas and reference tiers are in Operations — capacity planning.

Catalog: Hive Metastore to Iceberg REST Catalog

Target selection

Bifrost supports two target catalogs. Both implement the Iceberg REST catalog specification.

CriterionApache PolarisApache Gravitino
Core modelIceberg REST catalog plus Generic Tables for Hudi and Delta.Federated catalog-of-catalogs.
Best fitPrimarily Hive-to-Iceberg migration.Heterogeneous estate (Iceberg + HBase + Kafka + JDBC).
Engine supportAny Iceberg REST client.Same; exposes Iceberg REST as a subset of its API.

Decision rule: Polaris is the default when the estate is mostly Hive on HDFS. Gravitino is the default when substantial HBase, Kafka, or JDBC warehouses must be governed alongside Iceberg.

Conversion procedures

Three conversion procedures are available:

  • Snapshot — safe, reversible. Creates an Iceberg table that reads source Hive files in place. No data copy. Best for validation and quick wins.
  • Migrate — destructive in-place conversion. Renames the original table to <name>_BACKUP_ (the backup suffix is appended, not prepended). Faster for large, stable tables.
  • Add files — additive import into an existing Iceberg table. Best for tables needing re-partitioning.

Bulk catalog migration

For thousands of tables with no data copy required, bifrost modernize migrate-catalog moves table references between catalogs in a single command:

bifrost modernize migrate-catalog \
--source-type HIVE \
--target-type REST \
--target-uri https://polaris.example.internal/api/catalog \
--identifier-file tables_wave3.txt

Known pitfalls

  • Text, RC, and SequenceFile formats are blocked by migrate. Convert to Parquet or ORC first.
  • Timezone drift: set spark.sql.session.timeZone=UTC to avoid inconsistent results between on-prem UTC and cloud Spark defaults.
  • Hive Metastore in-flight commits must be stopped before cutover (hard-freeze window).
  • AWS Glue targets cannot use migrate because Glue lacks RENAME; use snapshot plus rewrite instead.

Compute: YARN to Kubernetes

Target stack

ComponentRole
Spark OperatorManages the SparkApplication CRD lifecycle.
Apache YuniKornHierarchical queue scheduling with gang scheduling.
Apache CelebornRemote shuffle service that decouples shuffle from executor lifetime.
IlumSpark lifecycle management, Livy-compatible proxy for existing clients.
Apache KyuubiMulti-tenant SQL gateway speaking the HiveThrift protocol.

Spark configuration translation

Bifrost translates Spark-on-YARN configurations to Spark-on-Kubernetes equivalents automatically. Key mappings:

YARN settingKubernetes equivalent
spark.master=yarnspark.master=k8s://https://k8s-api:6443
spark.submit.deployMode=clusterSame (default for CRD submissions).
spark.yarn.queue=productionspark.kubernetes.scheduler.name=yunikorn plus spark.kubernetes.driver.label.yunikorn.apache.org/queue=root.production.etl and spark.kubernetes.executor.label.yunikorn.apache.org/queue=root.production.etl. YuniKorn propagates queue placement through Kubernetes pod labels; the spark.kubernetes.*.label.* Spark conf keys attach the label at submission time. Target a leaf queue (for example, root.production.etl), not a parent — YuniKorn rejects Spark apps submitted to a parent queue. The annotation form (spark.kubernetes.*.annotation.yunikorn.apache.org/queue) is retained for backward compatibility but is legacy.
spark.executor.instances=10Same, or Dynamic Resource Allocation with spark.dynamicAllocation.enabled=true.
spark.executor.memory=8gSame, plus spark.kubernetes.executor.request.cores.
spark.yarn.dist.archives=...spark.kubernetes.file.upload.path=s3a://....
HDFS locality (NODE_LOCAL)Not applicable. Use AQE, magic committer, and compute-storage co-location in the same availability zone.

Shuffle architecture

Shuffle is the single biggest architectural difference between YARN and Kubernetes. YARN has a cluster-wide External Shuffle Service built into NodeManagers. Kubernetes does not.

Decision rule: jobs shuffling more than 50 GB use the remote shuffle service (Celeborn); smaller jobs can use local-disk shuffle with PVC reuse.

The remote shuffle service decouples shuffle data from executor lifetime. This is what enables full Dynamic Resource Allocation on Kubernetes — executors can be added or removed during a job without losing shuffle data.

Data locality

Data locality disappears when moving from HDFS to S3A. Expect a 10 % to 30 % raw-scan slowdown compared to HDFS, typically offset by elastic scaling and cheaper storage. Mitigations:

  • Adaptive Query Execution (AQE) — spark.sql.adaptive.enabled=true — handles skew and coalesces small partitions at runtime.
  • Magic committer — eliminates rename-as-copy on writes.
  • spark.sql.files.maxPartitionBytes between 128 MB and 256 MB.
  • Co-locate compute and object storage in the same availability zone.
  • Iceberg metadata pushdown (partition pruning, column pruning) to minimize scan width.

Query Engine: Hive / Impala to Trino

Dual-tier architecture

Trino deploys as two clusters behind a Trino Gateway:

  • Interactive tierretry-policy=NONE, autoscaled on running-query JMX metrics, small JVMs (16 GB), high concurrency. Serves BI dashboards and ad-hoc SQL.
  • ETL tierretry-policy=TASK with an exchange manager on object storage, larger JVMs (32 GB), adaptive plan enabled. Serves long-running transformation jobs. Fault-tolerant execution adds around 34 % overhead on short queries but dramatically improves reliability for ETL.

The gateway handles routing: queries are sent to the interactive or ETL cluster based on query properties and sessions, with seamless rolling upgrades behind it.

Table redirection

Trino's table redirection transparently resolves queries on the legacy catalog to the Iceberg catalog for migrated tables. BI tools, dbt models, and application queries work unchanged during cutover. This is the critical migration feature — it is what lets Modernize run without breaking consumers.

Authorization

OPA (Open Policy Agent) is the default authorization engine. OPA supports row-level filters and column masking, and is deployed as a sidecar in the coordinator pod.

Legacy policy systems can be preserved during migration to allow policy reuse, then sunset as OPA policies mature.

Workflows: Oozie to Airflow + dbt

Converter capabilities

Bifrost's workflow converter transforms Oozie workflows to Airflow 3 DAGs with dbt models extracted for SQL actions. Coverage by action type:

Oozie actionAirflow equivalentAutomation
Hive / Hive2dbt model via Cosmos DbtTaskGroup85 to 90 %
Spark (Java / Scala)SparkKubernetesOperator80 to 85 %
ShellKubernetesPodOperator90 %
FS (mkdir, move, delete, chmod, touchz)S3CreateObjectOperator, S3CopyObjectOperator, S3DeleteObjectsOperator, or a PythonOperator invoking boto3 for permissions75 %
EmailEmailOperator95 %
SSHSSHOperator or KubernetesPodOperator80 %
DistCpBashOperator invoking bifrost migrate-storage70 %
Sub-workflowTriggerDagRunOperator75 %
Fork / JoinAirflow parallel task groups85 %
DecisionBranchPythonOperator70 %

Coordinator and bundle handling

  • Simple time-driven coordinators (cron-like) — converted to Airflow schedule_interval. Automation: 50 to 65 %.
  • Data-driven coordinators — produce TODO-annotated DAG stubs with Airflow Dataset triggers. Automation: 10 to 20 %.
  • Bundles — not automated. Produce stub files with documentation. Automation: 0 %.

dbt integration

Hive and Hive2 SQL actions are extracted into dbt models under models/legacy/<workflow>_<action>.sql with Iceberg materialization config headers. dbt runs inside Airflow DAGs using Cosmos. The primary adapter is dbt-trino; dbt-spark covers transforms that specifically need Spark execution.

Cosmos execution mode is a deployment choice:

  • Local / virtualenv — dbt runs inside the Airflow worker. Supports Airflow Dataset emission, so downstream DAGs can be triggered by model completion.
  • Kubernetes — dbt runs as a KubernetesPodOperator task. Isolates dependencies per model but does not emit Airflow Datasets. Use time-based triggers or explicit TriggerDagRunOperator for cross-DAG dependencies.

Bifrost's converter defaults to Local mode so Dataset-aware scheduling works. Customers can switch individual DAGs to Kubernetes mode when dependency isolation outweighs Dataset support.

All converter outputs target Airflow 3 (TaskFlow API, no SubDagOperator). Airflow 2 is not supported as an output target.

Airflow on Ilum

The target Airflow deployment is managed by the Ilum platform. For details on running Airflow alongside Ilum, including LivyOperator, SparkSubmitOperator, Git Sync, and OAuth2 integration, see Airflow integration.

Governance: Atlas to OpenMetadata

Two-step migration

Step 1 — Connect. Point OpenMetadata's native Atlas connector at the live Atlas instance. Ingests tables, databases, classifications, and glossary. Each ingested object carries an origin classification that marks it as Atlas-sourced; the exact classification name depends on the OpenMetadata release and the customer's classification hierarchy.

Step 2 — Rebuild lineage. Discard Atlas lineage graphs and rebuild from scratch via OpenLineage emitters on each engine:

  • Spark — OpenLineage Spark listener attached to every Spark job.
  • Airflow — OpenLineage Airflow provider on the Airflow deployment.
  • Trino — OpenMetadata's Trino connector with query-log ingestion.
warning

Do not attempt to preserve Atlas lineage graphs across the migration. The underlying graph model does not round-trip to OpenMetadata's relational model. Teams that try consistently spend weeks before accepting they must re-emit from the engines.

Division of labour

The catalog (Polaris or Gravitino) governs tables and data: RBAC on catalog operations, credential vending, schema registration. OpenMetadata governs metadata, people, processes, quality, and contracts: glossary, ownership, tiers, data tests, documentation. The two are complementary, not overlapping.

User Interface: HUE to Superset

What bifrost modernize hue-import does

The import tool reads HUE's document database, classifies documents by type (query-hive, query-impala, query-sparksql, oozie-workflow2, search-dashboard), maps users via OIDC, creates Superset saved queries via Superset's API, converts embedded Oozie workflows through the workflow converter, and produces a migration report.

ItemResult
Saved Hive, Impala, or Spark SQL queriesCreated as Superset saved queries.
Oozie workflow documents embedded in HUEConverted via the workflow converter.
PermissionsMapped via OIDC claims to Superset roles.
User mappingApplied from a customer-provided mapping file.

What must be rebuilt manually

  • HUE's Solr-backed search dashboards (no Superset analog).
  • File-browser bookmarks (no Superset analog).
  • Editor autocomplete session details (session-specific).

Superset configuration

The Superset deployment on the target platform is pre-built with Trino, Kyuubi-Hive, and Postgres drivers installed. Authentication is via OIDC to Keycloak with AUTH_ROLES_MAPPING for group-to-role translation. For Trino queries, the trino-python-client SQLAlchemy dialect is used. See also Superset.

Data Store: HBase to Cassandra, ScyllaDB, or Bigtable

Honest assessment

HBase is the hardest migration leg with the lowest automation ceiling (40 to 55 %). HBase and Cassandra-family systems diverge in three fundamental ways:

  • Keying. HBase has a single lexicographically-sorted row key. Cassandra hash-partitions on a compound primary key. Composite HBase row keys must be decomposed, which requires application access-pattern knowledge.
  • Versioning. HBase per-cell multi-versioning (VERSIONS => N) has no Cassandra analog. Values must be flattened to latest-only or re-encoded as clustering rows.
  • Coprocessors. HBase Observers, Endpoints, and Phoenix secondary indexes have no Cassandra equivalent. These must be reimplemented as application logic or Spark jobs.

Target options

TargetBest fitKey advantage
Apache CassandraMulti-region active-active, cloud-neutralLargest ecosystem, CQL standard.
ScyllaDBLatency-sensitive workloadsCQL-compatible, 2 to 10x lower latency, fewer nodes.
Google Cloud BigtableCustomers accepting a cloud targetHBase API-compatible, zero schema translation.

Data migration pipeline

  1. Bulk export — HBase snapshot via a table-snapshot input format (bypasses region servers).
  2. Transform — Spark decomposes row keys into partition and clustering keys, flattens versions.
  3. Bulk loadCQLSSTableWriter with sstableloader for throughput, or a Spark-Cassandra connector for simpler cases.
  4. CDC bridge — custom HBase replication endpoint, Kafka, and Cassandra sink for cutover (preserves timestamps).
  5. Reconciliation — row-level diff against the source.

Automation ceilings per aspect

AspectCeiling
Schema introspection, bulk data movement80 to 95 %
CQL DDL generation, CDC scaffolding60 to 80 %
Row-key decomposition30 to 50 %
Query-pattern-correct CQL schema0 to 30 %
Coprocessor replacement0 %

Because of the low ceiling, HBase migration is a separate program track with its own review gates, data-model workshops, and application-side tasks. It runs in parallel with table migration; it does not block it.

Security: Ranger to OPA

Policy translator

Bifrost provides a Ranger-to-OPA policy translator that converts Ranger JSON exports to OPA policy files. The translator handles:

  • Database, table, and column-level GRANT and REVOKE policies.
  • User and group mappings (translated to OIDC claims).
  • Row-level filters (translated to OPA data filters).
  • Column masking rules.

Limitations

  • Tag-based policies (Ranger with Atlas integration) require manual translation.
  • Ranger audit logs are not translated; the equivalent on the target platform is OPA decision logging plus the OpenMetadata audit trail.

OPA runs as a sidecar to the Trino coordinator (localhost traffic) and is the default authorization backend on the target platform. Legacy Ranger can be preserved during migration for reuse, then sunset when OPA policies reach parity.

Components Out of Scope (Discover-Only)

The following legacy components are inventoried during discover but not automatically migrated. Bifrost produces a report; porting or replacement is a manual workstream.

ComponentBifrost doesRecommended handling
SqoopInventories Sqoop jobs and connection configs in the discovery report.Rewrite to Airflow with JdbcToS3Operator or Spark JDBC ingestion.
FlumeInventories Flume agents, sources, sinks, and channels.Replace with Kafka Connect, Airbyte, or Spark Structured Streaming.
KuduInventories Kudu tables and schemas.Choose by workload: Apache Pinot for real-time aggregation and OLAP, ScyllaDB or Cassandra for high-update primary-key workloads, Iceberg with streaming writers for append-only analytical data. Iceberg is not a drop-in for Kudu's random-update primary-key patterns.
PhoenixInventories Phoenix secondary indexes and views on HBase.Reimplement indexes as Solr / OpenSearch sidecars or as Cassandra materialized views during the HBase track.
Solr (for search, outside Ranger / Atlas)Inventories collections and schemas.Migrate to OpenSearch; schema translation is mostly mechanical but analyzer pipelines may need review.
KnoxInventories Knox topologies and service mappings.Replace with a Kubernetes ingress and OIDC at Trino / Airflow / Superset; no direct equivalent.
HDFS Federation / Router-Based FederationInventories federated namespaces.The target object storage is flat; federated paths collapse into bucket prefixes during migration.
Workload XM (CDP)Flagged during direct discover dependency analysis.No direct equivalent; use Grafana + Trino query-log dashboards.
Cloudera Data Visualization (CDP)Flagged during direct discover.Migrate to Superset via hue-import for compatible saved queries; dashboards require manual rebuild.
Shared Data Experience (SDX) (CDP)Flagged during direct discover.SDX overlaps with Polaris + OpenMetadata + OPA; mapping is customer-specific and manual.

Next Steps