Hadoop Migration Validation and Rollback Procedures
Every Bifrost migration is designed to be safe, repeatable, and reversible. This page describes the mechanisms that make that possible: the validation framework that confirms correctness at every step, the decision engine that turns validation results into go-or-no-go verdicts, and the rollback procedures available at each phase.
Validation Framework
Bifrost validates at multiple levels. The appropriate set of checks runs automatically for the current phase; nothing is skipped silently.
Row-count parity
For every migrated table, Bifrost compares source and target row counts per partition. The default tolerance is 0 % for immutable tables; for append-only or streaming tables the tolerance is configurable.
Value-level data diff
Row-count parity alone does not catch value drift. Bifrost uses hash-tree sampling for value-level validation:
- Partitions are hashed in a binary tree.
- Only divergent branches are expanded.
- Verification cost is O(log n) rather than O(n) for a full-table scan.
The default sample rate is 1 % of partitions. For business-critical tables, customers can raise this to 5 % or 10 % at the cost of additional run time. Sample rate is configurable per table or globally:
data_diff:
row_count_tolerance: 0.0
sample_rate: 0.01
hash_algorithm: xxhash64
partition_parallelism: 8
timeout_per_partition: 300
Query parity
Validates that production queries return the same results, within the same time envelope, on the source engine and the target engine. Run with a representative query set:
bifrost modernize validate \
--type query-parity \
--query-file benchmark_queries.sql \
--source-engine hive \
--target-engine trino
The tolerance for latency regression is 1.3x (queries on Trino can be up to 30 % slower than on Hive before triggering a warning). Value-match tolerance is 0 % by default.
Schema comparison
Validates column names, types, nullability, partition specs, and sort orders match between source and target. Catches subtle drift that would not show up in row-count parity alone.
Decision Engine Gates
At the end of each phase, the decision engine runs a set of checks and returns one of three verdicts:
- PROCEED — every critical check passes. The next phase is allowed. Production gates typically still require explicit human approval.
- WARN — a non-critical check failed. Results are logged and sent to the notification channels, but progression is not blocked.
- ABORT — a critical check failed, or two or more abort triggers fired. The rollback mechanism for the current phase is invoked automatically, without human intervention.
Classic gates
Classic migration enforces the following critical checks at its phase gates:
| Check | Effect if failed |
|---|---|
| HDFS fsck reports no corrupt blocks | ABORT |
| HDFS fsck reports no missing blocks | ABORT |
| Live DataNodes at least 95 % of expected | ABORT |
| HBase reports no dead region servers | ABORT |
| Hive Metastore table count matches baseline | ABORT |
| Policy database count matches baseline | ABORT |
| Kerberos authentication succeeds | ABORT |
| TLS handshake succeeds | ABORT |
Warning checks (non-blocking):
- HDFS under-replicated blocks less than 1000.
- YARN reports no unhealthy nodes.
- Kafka reports no under-replicated partitions.
- TeraSort duration no more than 20 % slower than baseline.
Abort triggers (two or more cause automatic rollback):
- NameNode not active by the gate time.
- Safe mode not exited by the gate time.
- Two or more critical check failures.
- Any DataNode data loss detected.
Modernize and Direct gates
Critical checks at Modernize and Direct gates:
| Check | Threshold |
|---|---|
| Table row-count parity | Must match exactly |
| Data-diff sample pass rate | > 99.99 % |
| Iceberg snapshot consistent | Yes |
| Trino query parity ratio | < 1.3x latency regression |
| Object storage cluster health | HEALTH_OK |
| Spark job completion rate | >= 99 % |
Abort triggers:
- Any table row-count mismatch.
- Object storage cluster in HEALTH_ERR.
- Catalog service unreachable.
- Trino coordinator in crash loop.
Wave-level failure semantics
During Modernize and Direct wave execution, table failures are handled at the table level, not the wave level:
- The failed table is quarantined (marked
FAILEDin the migration record). - The wave continues; other tables proceed normally.
- A wave is marked
COMPLETEonly when every non-quarantined table passes validation. - Quarantined tables do not block the wave, but must be resolved before the cluster can be decommissioned.
- Retry policy is 3 attempts with exponential back-off: 5 minutes, then 15 minutes, then 45 minutes.
- After 3 failures, the table requires manual intervention.
Rollback — Path 1 (Classic)
Every phase is reversible until finalize. The rollback window remains open from the start of the migration through the pre-finalization soak and closes only when bifrost classic finalize --confirm-irreversible runs. Rollback reuses the same decision engine as forward migration, so the cluster is only considered healthy once post-rollback validation passes.
Four rollback windows exist, in stage order:
| Stage | When it applies | Mechanism | Typical duration |
|---|---|---|---|
| After package swap | Target distribution installed, services not yet started | Remove target packages, restore source distribution from local cache, restart cluster-manager agent | ~2 hours |
| After services started | Services running on target distribution, post-migration validation failed | Stop services, run HDFS rollback, restore HMS and policy databases from pg_dump, reinstall source distribution, restart via cluster manager | ~4 hours |
| After pre-finalization monitoring (5-day soak) | Soak period in progress; issue surfaces after initial validation passed | Same mechanism as "after services started" — the window stays open for the full soak | ~4 hours |
| After finalization | bifrost classic finalize has run | Not reversible. hdfs dfsadmin -finalizeUpgrade has deleted the previous/ directory; the on-disk predecessor blocks are gone. | N/A |
Bifrost waits at least 5 business days before finalizing specifically so operators can still choose not to finalize if the migrated cluster misbehaves during the soak. Finalize is a deliberate, manual step with a required --confirm-irreversible flag.
After package swap (services not started)
The fastest rollback window. The source distribution packages are still in the local cache from the backup phase; nothing started on the target distribution yet.
- Remove target-distribution packages.
- Restore source-distribution packages from the on-node cache.
- Restart the cluster-manager agent.
- Services come up on the source distribution as if the swap had not happened.
After services started (validation failed)
Services already started on the target distribution and post-migration validation did not pass.
- Stop target-distribution services.
- Revert HDFS. For shrink-and-grow, run
hdfs namenode -rollingUpgrade rollback. For stop-and-swap, restart the NameNode with the-rollbackstartup argument. - Restore the Hive Metastore database from
pg_dump. - Restore the policy database from
pg_dump. - Reinstall the source-distribution packages from the local cache.
- Restart services via the cluster manager.
After pre-finalization monitoring (5-day soak)
Validation already passed but a production issue surfaces during the post-migration soak (typically 5 business days). The rollback window remains open for the entire soak and uses the same mechanism as "after services started". This is the window customers run against most often in practice, because real operational problems often appear only hours or days after cutover.
After finalization
Not reversible. Once bifrost classic finalize has run, hdfs dfsadmin -finalizeUpgrade has removed the previous/ directory on the NameNode and every DataNode. Neither the NameNode -rollback startup argument nor -rollingUpgrade rollback is available — both depend on previous/ being present. The cluster-manager database and HMS/policy database backups Bifrost captured during the backup phase have also been removed.
This is the exact reason Bifrost requires the 5-day soak before offering the finalize command. Skipping or shortening the soak trades operational safety for cleanup time.
Partial rollback — shrink-and-grow variant
In shrink-and-grow runs, some DataNodes migrate while others remain on the source distribution. Any of the four stages above can be hit with only part of the cluster on the target distribution; the revert is scoped to the migrated nodes:
- Decommission the migrated nodes.
- Wait for HDFS replication to complete.
- Revert those nodes to the source distribution using the appropriate stage-specific mechanism.
- Recommission.
Rollback command
# Roll the whole cluster back to a specific phase
bifrost classic rollback --cluster PROD01 --to-phase backup
# Roll a single node back (shrink-and-grow)
bifrost classic rollback --cluster PROD01 --node dn-042.example.internal
Rollback — Paths 2 and 3 (Modernize / Direct)
Modernize and Direct rollbacks are fundamentally different from Classic rollbacks because the legacy environment is still running in parallel with the target during most of the migration.
Table-level rollback
Instantaneous. Iceberg metadata is swapped back; the table reverts to its pre-migration state in microseconds. Bifrost executes this automatically when validation fails, and the same revert can be triggered manually after the fact:
# Revert a single migrated table to its pre-migration state
bifrost modernize rollback --table production_db.customers
# Revert every table in a wave
bifrost modernize rollback --wave 3
Table redirection is disabled for the affected tables as part of the revert; queries resume against the legacy catalog until the table is migrated again.
Service-level rollback
Service rollbacks use Helm rollback. The previous Helm release of any Kubernetes service (Trino, Polaris, Spark Operator, Airflow, and so on) can be reinstated with a single helm rollback command.
Storage-level rollback
HDFS data is not deleted during Modernize; it remains intact until explicit decommission. During the dual-read bridge phase, the legacy path is always available as a fallback. Only bifrost modernize decommission (after the silence period) makes the legacy storage unavailable.
Per-phase rollback matrix (Modernize and Direct)
| Phase | Failure symptom | Revert command | State implication | Time to recover |
|---|---|---|---|---|
land fails part-way | Helm release partially applied | helm rollback <release> per component that rolled; re-run bifrost modernize land --status | Target platform returns to the previous Helm release; no source data is touched | Minutes per component |
bridge fails | Trino table redirection misconfigured, or DistCp warm-sync loop erroring | bifrost modernize bridge --disable issues a Helm upgrade of the Trino catalog config to drop the redirection rules; a Trino coordinator reload follows. Fix the config and re-run bridge. | Queries fall back to the legacy catalog only | 5 to 15 minutes (Helm upgrade + coordinator reload) |
migrate-table fails validation | Data-diff or query-parity below threshold | Automatic: Bifrost reverts via rollback --table <name>. Manual: same command | Table redirection disabled; legacy table is authoritative again | Microseconds (Iceberg metadata swap) |
migrate-wave partial failure | Some tables passed, some failed | Quarantined tables auto-revert; passed tables remain migrated | Wave marked INCOMPLETE until quarantined tables resolve | Per quarantined table |
convert-workflow produces broken DAG | Airflow DAG import error, or runtime failure on cutover | Pause the Airflow DAG; keep Oozie workflow active on legacy cluster; re-run converter with corrected rulesets | No production impact — legacy workflow still authoritative | Minutes to hours |
hue-import partial | Some queries or dashboards failed to import | Inspect hue-import migration report; re-run for specific documents with --user-mapping-file fixes | Legacy HUE still available | Minutes |
decommission refused | Bifrost detects residual access during the silence period | Investigate the access source (reported in decommission log); re-run decommission --dry-run after remediation | No state change (decommission never executed) | N/A |
All revert operations are recorded in the migration ledger and surfaced through bifrost modernize status.
Irreversible Finalize
Every migration has a single, irreversible final step that removes rollback assets and ends the ability to revert.
Classic finalize
bifrost classic finalize --cluster PROD01 --confirm-irreversible
Finalize removes:
- The source distribution package cache on every node.
- LVM snapshots of NameNode metadata volumes.
- Baseline captures and backup databases.
- Temporary rollback keytabs and certificates.
Bifrost recommends a 5-business-day soak of clean operation before running finalize. The --confirm-irreversible flag is required and cannot be bypassed.
Modernize and Direct decommission
Modernize and Direct do not have a single "finalize" step. Instead, each legacy service is decommissioned individually after its silence period passes:
# Decommission HDFS after 30 days of confirmed silence
bifrost modernize decommission \
--service hdfs \
--cluster PROD01 \
--after-silence 30d
The final irreversible step for a Direct migration is the Cloudera Manager shutdown:
bifrost direct decommission \
--service cloudera-manager \
--cm-host cm.example.internal \
--confirm-irreversible
This step ends the Cloudera subscription requirement and cannot be reversed.
Summary: Verdict, Scope, and Rollback
The following table captures the decision engine's granularity across paths:
| Scope | Verdict applies to | Rollback scope |
|---|---|---|
| Phase gate (Classic) | Entire cluster | Entire cluster |
| Table migration (Modernize / Direct) | Single table | Single table (Iceberg metadata swap) |
| Wave validation (Modernize / Direct) | All tables in the wave | Per-table (only failed tables revert) |
| Storage migration (Modernize / Direct) | DistCp job | Re-run from last checkpoint |
| Workflow conversion (Modernize / Direct) | Single workflow | No rollback needed (re-run converter) |
Next Steps
- Operations — monitoring, capacity planning, and the production readiness checklist.
- Troubleshooting — common issues and their resolutions.
- CLI reference — every command, every flag, every option.