Hadoop Migration Validation and Rollback Procedures

Every Bifrost migration is designed to be safe, repeatable, and reversible. This page describes the mechanisms that make that possible: the validation framework that confirms correctness at every step, the decision engine that turns validation results into go-or-no-go verdicts, and the rollback procedures available at each phase.

Validation Framework

Bifrost validates at multiple levels. The appropriate set of checks runs automatically for the current phase; nothing is skipped silently.

Row-count parity

For every migrated table, Bifrost compares source and target row counts per partition. The default tolerance is 0 % for immutable tables; for append-only or streaming tables the tolerance is configurable.

Value-level data diff

Row-count parity alone does not catch value drift. Bifrost uses hash-tree sampling for value-level validation:

Partitions are hashed in a binary tree.
Only divergent branches are expanded.
Verification cost is O(log n) rather than O(n) for a full-table scan.

The default sample rate is 1 % of partitions. For business-critical tables, customers can raise this to 5 % or 10 % at the cost of additional run time. Sample rate is configurable per table or globally:

data_diff:
  row_count_tolerance: 0.0
  sample_rate: 0.01
  hash_algorithm: xxhash64
  partition_parallelism: 8
  timeout_per_partition: 300

Query parity

Validates that production queries return the same results, within the same time envelope, on the source engine and the target engine. Run with a representative query set:

bifrost modernize validate \
  --type query-parity \
  --query-file benchmark_queries.sql \
  --source-engine hive \
  --target-engine trino

The tolerance for latency regression is 1.3x (queries on Trino can be up to 30 % slower than on Hive before triggering a warning). Value-match tolerance is 0 % by default.

Schema comparison

Validates column names, types, nullability, partition specs, and sort orders match between source and target. Catches subtle drift that would not show up in row-count parity alone.

Decision Engine Gates

At the end of each phase, the decision engine runs a set of checks and returns one of three verdicts:

PROCEED — every critical check passes. The next phase is allowed. Production gates typically still require explicit human approval.
WARN — a non-critical check failed. Results are logged and sent to the notification channels, but progression is not blocked.
ABORT — a critical check failed, or two or more abort triggers fired. The rollback mechanism for the current phase is invoked automatically, without human intervention.

Classic gates

Classic migration enforces the following critical checks at its phase gates:

Check	Effect if failed
HDFS fsck reports no corrupt blocks	ABORT
HDFS fsck reports no missing blocks	ABORT
Live DataNodes at least 95 % of expected	ABORT
HBase reports no dead region servers	ABORT
Hive Metastore table count matches baseline	ABORT
Policy database count matches baseline	ABORT
Kerberos authentication succeeds	ABORT
TLS handshake succeeds	ABORT

Warning checks (non-blocking):

HDFS under-replicated blocks less than 1000.
YARN reports no unhealthy nodes.
Kafka reports no under-replicated partitions.
TeraSort duration no more than 20 % slower than baseline.

Abort triggers (two or more cause automatic rollback):

NameNode not active by the gate time.
Safe mode not exited by the gate time.
Two or more critical check failures.
Any DataNode data loss detected.

Modernize and Direct gates

Critical checks at Modernize and Direct gates:

Check	Threshold
Table row-count parity	Must match exactly
Data-diff sample pass rate	> 99.99 %
Iceberg snapshot consistent	Yes
Trino query parity ratio	< 1.3x latency regression
Object storage cluster health	HEALTH_OK
Spark job completion rate	>= 99 %

Abort triggers:

Any table row-count mismatch.
Object storage cluster in HEALTH_ERR.
Catalog service unreachable.
Trino coordinator in crash loop.

Wave-level failure semantics

During Modernize and Direct wave execution, table failures are handled at the table level, not the wave level:

The failed table is quarantined (marked FAILED in the migration record).
The wave continues; other tables proceed normally.
A wave is marked COMPLETE only when every non-quarantined table passes validation.
Quarantined tables do not block the wave, but must be resolved before the cluster can be decommissioned.
Retry policy is 3 attempts with exponential back-off: 5 minutes, then 15 minutes, then 45 minutes.
After 3 failures, the table requires manual intervention.

Rollback — Path 1 (Classic)

Every phase is reversible until finalize. The rollback window remains open from the start of the migration through the pre-finalization soak and closes only when bifrost classic finalize --confirm-irreversible runs. Rollback reuses the same decision engine as forward migration, so the cluster is only considered healthy once post-rollback validation passes.

Four rollback windows exist, in stage order:

Stage	When it applies	Mechanism	Typical duration
After package swap	Target distribution installed, services not yet started	Remove target packages, restore source distribution from local cache, restart cluster-manager agent	~2 hours
After services started	Services running on target distribution, post-migration validation failed	Stop services, run HDFS rollback, restore HMS and policy databases from `pg_dump`, reinstall source distribution, restart via cluster manager	~4 hours
After pre-finalization monitoring (5-day soak)	Soak period in progress; issue surfaces after initial validation passed	Same mechanism as "after services started" — the window stays open for the full soak	~4 hours
After finalization	`bifrost classic finalize` has run	Not reversible. `hdfs dfsadmin -finalizeUpgrade` has deleted the `previous/` directory; the on-disk predecessor blocks are gone.	N/A

Bifrost waits at least 5 business days before finalizing specifically so operators can still choose not to finalize if the migrated cluster misbehaves during the soak. Finalize is a deliberate, manual step with a required --confirm-irreversible flag.

After package swap (services not started)

The fastest rollback window. The source distribution packages are still in the local cache from the backup phase; nothing started on the target distribution yet.

Remove target-distribution packages.
Restore source-distribution packages from the on-node cache.
Restart the cluster-manager agent.
Services come up on the source distribution as if the swap had not happened.

After services started (validation failed)

Services already started on the target distribution and post-migration validation did not pass.

Stop target-distribution services.
Revert HDFS. For shrink-and-grow, run hdfs namenode -rollingUpgrade rollback. For stop-and-swap, restart the NameNode with the -rollback startup argument.
Restore the Hive Metastore database from pg_dump.
Restore the policy database from pg_dump.
Reinstall the source-distribution packages from the local cache.
Restart services via the cluster manager.

After pre-finalization monitoring (5-day soak)

Validation already passed but a production issue surfaces during the post-migration soak (typically 5 business days). The rollback window remains open for the entire soak and uses the same mechanism as "after services started". This is the window customers run against most often in practice, because real operational problems often appear only hours or days after cutover.

After finalization

Not reversible. Once bifrost classic finalize has run, hdfs dfsadmin -finalizeUpgrade has removed the previous/ directory on the NameNode and every DataNode. Neither the NameNode -rollback startup argument nor -rollingUpgrade rollback is available — both depend on previous/ being present. The cluster-manager database and HMS/policy database backups Bifrost captured during the backup phase have also been removed.

This is the exact reason Bifrost requires the 5-day soak before offering the finalize command. Skipping or shortening the soak trades operational safety for cleanup time.

Partial rollback — shrink-and-grow variant

In shrink-and-grow runs, some DataNodes migrate while others remain on the source distribution. Any of the four stages above can be hit with only part of the cluster on the target distribution; the revert is scoped to the migrated nodes:

Decommission the migrated nodes.
Wait for HDFS replication to complete.
Revert those nodes to the source distribution using the appropriate stage-specific mechanism.
Recommission.

Rollback command

# Roll the whole cluster back to a specific phase
bifrost classic rollback --cluster PROD01 --to-phase backup

# Roll a single node back (shrink-and-grow)
bifrost classic rollback --cluster PROD01 --node dn-042.example.internal

Rollback — Paths 2 and 3 (Modernize / Direct)

Modernize and Direct rollbacks are fundamentally different from Classic rollbacks because the legacy environment is still running in parallel with the target during most of the migration.

Table-level rollback

Instantaneous. Iceberg metadata is swapped back; the table reverts to its pre-migration state in microseconds. Bifrost executes this automatically when validation fails, and the same revert can be triggered manually after the fact:

# Revert a single migrated table to its pre-migration state
bifrost modernize rollback --table production_db.customers

# Revert every table in a wave
bifrost modernize rollback --wave 3

Table redirection is disabled for the affected tables as part of the revert; queries resume against the legacy catalog until the table is migrated again.

Service-level rollback

Service rollbacks use Helm rollback. The previous Helm release of any Kubernetes service (Trino, Polaris, Spark Operator, Airflow, and so on) can be reinstated with a single helm rollback command.

Storage-level rollback

HDFS data is not deleted during Modernize; it remains intact until explicit decommission. During the dual-read bridge phase, the legacy path is always available as a fallback. Only bifrost modernize decommission (after the silence period) makes the legacy storage unavailable.

Per-phase rollback matrix (Modernize and Direct)

Phase	Failure symptom	Revert command	State implication	Time to recover
`land` fails part-way	Helm release partially applied	`helm rollback <release>` per component that rolled; re-run `bifrost modernize land --status`	Target platform returns to the previous Helm release; no source data is touched	Minutes per component
`bridge` fails	Trino table redirection misconfigured, or DistCp warm-sync loop erroring	`bifrost modernize bridge --disable` issues a Helm upgrade of the Trino catalog config to drop the redirection rules; a Trino coordinator reload follows. Fix the config and re-run `bridge`.	Queries fall back to the legacy catalog only	5 to 15 minutes (Helm upgrade + coordinator reload)
`migrate-table` fails validation	Data-diff or query-parity below threshold	Automatic: Bifrost reverts via `rollback --table <name>`. Manual: same command	Table redirection disabled; legacy table is authoritative again	Microseconds (Iceberg metadata swap)
`migrate-wave` partial failure	Some tables passed, some failed	Quarantined tables auto-revert; passed tables remain migrated	Wave marked `INCOMPLETE` until quarantined tables resolve	Per quarantined table
`convert-workflow` produces broken DAG	Airflow DAG import error, or runtime failure on cutover	Pause the Airflow DAG; keep Oozie workflow active on legacy cluster; re-run converter with corrected rulesets	No production impact — legacy workflow still authoritative	Minutes to hours
`hue-import` partial	Some queries or dashboards failed to import	Inspect `hue-import` migration report; re-run for specific documents with `--user-mapping-file` fixes	Legacy HUE still available	Minutes
`decommission` refused	Bifrost detects residual access during the silence period	Investigate the access source (reported in decommission log); re-run `decommission --dry-run` after remediation	No state change (decommission never executed)	N/A

All revert operations are recorded in the migration ledger and surfaced through bifrost modernize status.

Irreversible Finalize

Every migration has a single, irreversible final step that removes rollback assets and ends the ability to revert.

Classic finalize

bifrost classic finalize --cluster PROD01 --confirm-irreversible

Finalize removes:

The source distribution package cache on every node.
LVM snapshots of NameNode metadata volumes.
Baseline captures and backup databases.
Temporary rollback keytabs and certificates.

Bifrost recommends a 5-business-day soak of clean operation before running finalize. The --confirm-irreversible flag is required and cannot be bypassed.

Modernize and Direct decommission

Modernize and Direct do not have a single "finalize" step. Instead, each legacy service is decommissioned individually after its silence period passes:

# Decommission HDFS after 30 days of confirmed silence
bifrost modernize decommission \
  --service hdfs \
  --cluster PROD01 \
  --after-silence 30d

The final irreversible step for a Direct migration is the Cloudera Manager shutdown:

bifrost direct decommission \
  --service cloudera-manager \
  --cm-host cm.example.internal \
  --confirm-irreversible

This step ends the Cloudera subscription requirement and cannot be reversed.

Summary: Verdict, Scope, and Rollback

The following table captures the decision engine's granularity across paths:

Scope	Verdict applies to	Rollback scope
Phase gate (Classic)	Entire cluster	Entire cluster
Table migration (Modernize / Direct)	Single table	Single table (Iceberg metadata swap)
Wave validation (Modernize / Direct)	All tables in the wave	Per-table (only failed tables revert)
Storage migration (Modernize / Direct)	DistCp job	Re-run from last checkpoint
Workflow conversion (Modernize / Direct)	Single workflow	No rollback needed (re-run converter)

Next Steps

Operations — monitoring, capacity planning, and the production readiness checklist.
Troubleshooting — common issues and their resolutions.
CLI reference — every command, every flag, every option.

Validation Framework​

Row-count parity​

Value-level data diff​

Query parity​

Schema comparison​

Decision Engine Gates​

Classic gates​

Modernize and Direct gates​

Wave-level failure semantics​

Rollback — Path 1 (Classic)​

After package swap (services not started)​

After services started (validation failed)​

After pre-finalization monitoring (5-day soak)​

After finalization​

Partial rollback — shrink-and-grow variant​

Rollback command​

Rollback — Paths 2 and 3 (Modernize / Direct)​

Table-level rollback​

Service-level rollback​

Storage-level rollback​

Per-phase rollback matrix (Modernize and Direct)​

Irreversible Finalize​

Classic finalize​

Modernize and Direct decommission​

Summary: Verdict, Scope, and Rollback​

Next Steps​