मुख्य विषयवस्तु में जाएं

Hadoop Migration Validation and Rollback Procedures

Every Bifrost migration is designed to be safe, repeatable, and reversible. This page describes the mechanisms that make that possible: the validation framework that confirms correctness at every step, the decision engine that turns validation results into go-or-no-go verdicts, and the rollback procedures available at each phase.

Validation Framework

Bifrost validates at multiple levels. The appropriate set of checks runs automatically for the current phase; nothing is skipped silently.

Row-count parity

For every migrated table, Bifrost compares source and target row counts per partition. The default tolerance is 0 % for immutable tables; for append-only or streaming tables the tolerance is configurable.

Value-level data diff

Row-count parity alone does not catch value drift. Bifrost uses hash-tree sampling for value-level validation:

  • Partitions are hashed in a binary tree.
  • Only divergent branches are expanded.
  • Verification cost is O(log n) rather than O(n) for a full-table scan.

The default sample rate is 1 % of partitions. For business-critical tables, customers can raise this to 5 % or 10 % at the cost of additional run time. Sample rate is configurable per table or globally:

data_diff: 
row_count_tolerance: 0.0
sample_rate: 0.01
hash_algorithm: xxhash64
partition_parallelism: 8
timeout_per_partition: 300

Query parity

Validates that production queries return the same results, within the same time envelope, on the source engine and the target engine. Run with a representative query set:

bifrost modernize validate \
--type query-parity \
--query-file benchmark_queries.sql \
--source-engine hive \
--target-engine trino

The tolerance for latency regression is 1.3x (queries on Trino can be up to 30 % slower than on Hive before triggering a warning). Value-match tolerance is 0 % by default.

Schema comparison

Validates column names, types, nullability, partition specs, and sort orders match between source and target. Catches subtle drift that would not show up in row-count parity alone.

Decision Engine Gates

At the end of each phase, the decision engine runs a set of checks and returns one of three verdicts:

  • PROCEED — every critical check passes. The next phase is allowed. Production gates typically still require explicit human approval.
  • WARN — a non-critical check failed. Results are logged and sent to the notification channels, but progression is not blocked.
  • ABORT — a critical check failed, or two or more abort triggers fired. The rollback mechanism for the current phase is invoked automatically, without human intervention.

Classic gates

Classic migration enforces the following critical checks at its phase gates:

CheckEffect if failed
HDFS fsck reports no corrupt blocksABORT
HDFS fsck reports no missing blocksABORT
Live DataNodes at least 95 % of expectedABORT
HBase reports no dead region serversABORT
Hive Metastore table count matches baselineABORT
Policy database count matches baselineABORT
Kerberos authentication succeedsABORT
TLS handshake succeedsABORT

Warning checks (non-blocking):

  • HDFS under-replicated blocks less than 1000.
  • YARN reports no unhealthy nodes.
  • Kafka reports no under-replicated partitions.
  • TeraSort duration no more than 20 % slower than baseline.

Abort triggers (two or more cause automatic rollback):

  • NameNode not active by the gate time.
  • Safe mode not exited by the gate time.
  • Two or more critical check failures.
  • Any DataNode data loss detected.

Modernize and Direct gates

Critical checks at Modernize and Direct gates:

CheckThreshold
Table row-count parityMust match exactly
Data-diff sample pass rate> 99.99 %
Iceberg snapshot consistentहाँ
Trino query parity ratio< 1.3x latency regression
Object storage cluster healthHEALTH_OK
Spark job completion rate>= 99 %

Abort triggers:

  • Any table row-count mismatch.
  • Object storage cluster in HEALTH_ERR.
  • Catalog service unreachable.
  • Trino coordinator in crash loop.

Wave-level failure semantics

During Modernize and Direct wave execution, table failures are handled at the table level, not the wave level:

  1. The failed table is quarantined (marked FAILED in the migration record).
  2. The wave continues; other tables proceed normally.
  3. A wave is marked COMPLETE only when every non-quarantined table passes validation.
  4. Quarantined tables do not block the wave, but must be resolved before the cluster can be decommissioned.
  5. Retry policy is 3 attempts with exponential back-off: 5 minutes, then 15 minutes, then 45 minutes.
  6. After 3 failures, the table requires manual intervention.

Rollback — Path 1 (Classic)

Every phase is reversible until finalize. The rollback window remains open from the start of the migration through the pre-finalization soak and closes only when bifrost classic finalize --confirm-irreversible runs. Rollback reuses the same decision engine as forward migration, so the cluster is only considered healthy once post-rollback validation passes.

Four rollback windows exist, in stage order:

StageWhen it appliesMechanismTypical duration
After package swapTarget distribution installed, services not yet startedRemove target packages, restore source distribution from local cache, restart cluster-manager agent~2 hours
After services startedServices running on target distribution, post-migration validation failedStop services, run HDFS rollback, restore HMS and policy databases from pg_dump, reinstall source distribution, restart via cluster manager~4 hours
After pre-finalization monitoring (5-day soak)Soak period in progress; issue surfaces after initial validation passedSame mechanism as "after services started" — the window stays open for the full soak~4 hours
After finalizationbifrost classic finalize has runNot reversible. hdfs dfsadmin -finalizeUpgrade has deleted the previous/ directory; the on-disk predecessor blocks are gone.N/A

Bifrost waits at least 5 business days before finalizing specifically so operators can still choose not to finalize if the migrated cluster misbehaves during the soak. Finalize is a deliberate, manual step with a required --confirm-irreversible flag.

After package swap (services not started)

The fastest rollback window. The source distribution packages are still in the local cache from the backup phase; nothing started on the target distribution yet.

  • Remove target-distribution packages.
  • Restore source-distribution packages from the on-node cache.
  • Restart the cluster-manager agent.
  • Services come up on the source distribution as if the swap had not happened.

After services started (validation failed)

Services already started on the target distribution and post-migration validation did not pass.

  • Stop target-distribution services.
  • Revert HDFS. For shrink-and-grow, run hdfs namenode -rollingUpgrade rollback. For stop-and-swap, restart the NameNode with the -rollback startup argument.
  • Restore the Hive Metastore database from pg_dump.
  • Restore the policy database from pg_dump.
  • Reinstall the source-distribution packages from the local cache.
  • Restart services via the cluster manager.

After pre-finalization monitoring (5-day soak)

Validation already passed but a production issue surfaces during the post-migration soak (typically 5 business days). The rollback window remains open for the entire soak and uses the same mechanism as "after services started". This is the window customers run against most often in practice, because real operational problems often appear only hours or days after cutover.

After finalization

Not reversible. Once bifrost classic finalize has run, hdfs dfsadmin -finalizeUpgrade has removed the previous/ directory on the NameNode and every DataNode. Neither the NameNode -rollback startup argument nor -rollingUpgrade rollback is available — both depend on previous/ being present. The cluster-manager database and HMS/policy database backups Bifrost captured during the backup phase have also been removed.

This is the exact reason Bifrost requires the 5-day soak before offering the finalize command. Skipping or shortening the soak trades operational safety for cleanup time.

Partial rollback — shrink-and-grow variant

In shrink-and-grow runs, some DataNodes migrate while others remain on the source distribution. Any of the four stages above can be hit with only part of the cluster on the target distribution; the revert is scoped to the migrated nodes:

  • Decommission the migrated nodes.
  • Wait for HDFS replication to complete.
  • Revert those nodes to the source distribution using the appropriate stage-specific mechanism.
  • Recommission.

Rollback command

# Roll the whole cluster back to a specific phase
bifrost classic rollback --cluster PROD01 --to-phase backup

# Roll a single node back (shrink-and-grow)
bifrost classic rollback --cluster PROD01 --node dn-042.example.internal

Rollback — Paths 2 and 3 (Modernize / Direct)

Modernize and Direct rollbacks are fundamentally different from Classic rollbacks because the legacy environment is still running in parallel with the target during most of the migration.

Table-level rollback

Instantaneous. Iceberg metadata is swapped back; the table reverts to its pre-migration state in microseconds. Bifrost executes this automatically when validation fails, and the same revert can be triggered manually after the fact:

# Revert a single migrated table to its pre-migration state
bifrost modernize rollback --table production_db.customers

# Revert every table in a wave
bifrost modernize rollback --wave 3

Table redirection is disabled for the affected tables as part of the revert; queries resume against the legacy catalog until the table is migrated again.

Service-level rollback

Service rollbacks use Helm rollback. The previous Helm release of any Kubernetes service (Trino, Polaris, Spark Operator, Airflow, and so on) can be reinstated with a single helm rollbackआज्ञा।

Storage-level rollback

HDFS data is not deleted during Modernize; it remains intact until explicit decommission. During the dual-read bridge phase, the legacy path is always available as a fallback. Only bifrost modernize decommission (after the silence period) makes the legacy storage unavailable.

Per-phase rollback matrix (Modernize and Direct)

PhaseFailure symptomRevert commandState implicationTime to recover
land fails part-wayHelm release partially appliedhelm rollback <release> per component that rolled; re-run bifrost modernize land --statusTarget platform returns to the previous Helm release; no source data is touchedMinutes per component
bridge failsTrino table redirection misconfigured, or DistCp warm-sync loop erroringbifrost modernize bridge --disable issues a Helm upgrade of the Trino catalog config to drop the redirection rules; a Trino coordinator reload follows. Fix the config and re-run bridge. Queries fall back to the legacy catalog only5 to 15 minutes (Helm upgrade + coordinator reload)
migrate-table fails validationData-diff or query-parity below thresholdAutomatic: Bifrost reverts via rollback --table <name>. Manual: same commandTable redirection disabled; legacy table is authoritative againMicroseconds (Iceberg metadata swap)
migrate-wave partial failureSome tables passed, some failedQuarantined tables auto-revert; passed tables remain migratedWave marked INCOMPLETE until quarantined tables resolvePer quarantined table
convert-workflow produces broken DAGAirflow DAG import error, or runtime failure on cutoverPause the Airflow DAG; keep Oozie workflow active on legacy cluster; re-run converter with corrected rulesetsNo production impact — legacy workflow still authoritativeMinutes to hours
hue-import partialSome queries or dashboards failed to importInspect hue-import migration report; re-run for specific documents with --user-mapping-file fixesLegacy HUE still availableMinutes
decommission refusedBifrost detects residual access during the silence periodInvestigate the access source (reported in decommission log); re-run decommission --dry-run after remediationNo state change (decommission never executed)N/A

All revert operations are recorded in the migration ledger and surfaced through bifrost modernize status.

Irreversible Finalize

Every migration has a single, irreversible final step that removes rollback assets and ends the ability to revert.

Classic finalize

bifrost classic finalize --cluster PROD01 --confirm-irreversible

Finalize removes:

  • The source distribution package cache on every node.
  • LVM snapshots of NameNode metadata volumes.
  • Baseline captures and backup databases.
  • Temporary rollback keytabs and certificates.

Bifrost recommends a 5-business-day soak of clean operation before running finalize.वही --confirm-irreversible flag is required and cannot be bypassed.

Modernize and Direct decommission

Modernize and Direct do not have a single "finalize" step. Instead, each legacy service is decommissioned individually after its silence period passes:

# Decommission HDFS after 30 days of confirmed silence
bifrost modernize decommission \
--service hdfs \
--cluster PROD01 \
--after-silence 30d

The final irreversible step for a Direct migration is the Cloudera Manager shutdown:

bifrost direct decommission \
--service cloudera-manager \
--cm-host cm.example.internal \
--confirm-irreversible

This step ends the Cloudera subscription requirement and cannot be reversed.

Summary: Verdict, Scope, and Rollback

The following table captures the decision engine's granularity across paths:

ScopeVerdict applies toRollback scope
Phase gate (Classic)Entire clusterEntire cluster
Table migration (Modernize / Direct)Single tableSingle table (Iceberg metadata swap)
Wave validation (Modernize / Direct)All tables in the wavePer-table (only failed tables revert)
Storage migration (Modernize / Direct)DistCp jobRe-run from last checkpoint
Workflow conversion (Modernize / Direct)Single workflowNo rollback needed (re-run converter)

अगले कदम