मुख्य विषयवस्तु में जाएं

Hadoop to Kubernetes Migration Procedure (Wave-Based)

Modernize replaces each Hadoop component with a Kubernetes-native equivalent on the Ilum platform. Unlike Classic (weekend cutover), Modernize is a multi-month, phased program that uses Apache Iceberg as the bridge format and the strangler-facade pattern for zero downtime. Tables migrate one at a time; revert at the table level is an atomic metadata operation that completes in microseconds.

When to Use Modernize

Modernize is the right path when:

  • The estate already runs open-source Hadoop (any distribution) and is ready to move to Kubernetes.
  • A prior Classic replatform has been completed and the next modernization step is planned.
  • An existing Kubernetes footprint is in place and data workloads can be consolidated onto it.
  • The goal is to arrive at a Kubernetes-native lakehouse with Apache Iceberg, Trino, Airflow, and object storage.

Modernize is appropriate for programs with a 2- to 5-month horizon and a phased modernization budget.

Key Principles

Modernize is designed around five principles. Each principle has a direct impact on how the program is planned and executed.

Apache Iceberg as the bridge format

Every Hive table is first converted to Iceberg before any storage or compute migration. Iceberg's engine-agnostic design means both legacy Hive or Spark-on-YARN and modern Trino or Spark-on-Kubernetes can read the same tables simultaneously.

Strangler facade for zero downtime

Trino table redirection routes consumers to the correct catalog per table. BI tools, dbt models, and application queries continue to work unchanged while the underlying storage moves from HDFS to object storage.

Table-by-table with instantaneous revert

Tables migrate independently. If validation fails for one table, the revert is an atomic Iceberg metadata swap that completes in microseconds, not a multi-hour data copy.

Dual compatibility via Ilum

Ilum runs Spark on both YARN (through its Livy-compatible proxy) and Kubernetes. This allows step-by-step workload migration without breaking existing job-submission clients.

Honest automation ceilings

Every migration leg has a documented automation ceiling. Bifrost communicates exactly what it can automate and what requires human judgment. See Overview — automation ceilings.

Five-Phase Playbook

Modernize executes as five phases. Each phase can be run iteratively, re-entered after correction, and extended as the estate grows.

Phase 0 — Discover and score

Auto-inventory the estate. Score tables and jobs by criticality, complexity, and freshness. Produce the phased migration plan with wave assignments and TCO projections.

Command: bifrost modernize discover. Output tree: see Getting started — first discovery run.

Phase 1 — Land the target

Deploy the complete Kubernetes platform stack. Validate every component healthy. The target topology is described in Target platform architecture.

Command: bifrost modernize land.

Phase 1.5 — Validate-pre (platform pre-flight)

Before configuring the dual-read bridge, run the platform pre-flight to confirm the target is healthy enough to accept migration traffic. Non-destructive; safe to re-run as often as needed.

Checks include: object storage cluster in HEALTH_OK, Iceberg REST catalog reachable and accepting OAuth2 client_credentials, Trino interactive and ETL clusters healthy, Airflow scheduler heartbeat current (informational; gates Phase 4), OIDC flow via authorization-code server flow and client_credentials, OPA sidecar reachable with the Bifrost-supplied test-policy bundle returning the expected verdict, Kubernetes pods can reach legacy HDFS, container registry reachable.

Command: bifrost modernize validate-pre. See CLI reference.

Phase 2 — Dual-read bridge

Configure Trino with both the legacy catalog and the Iceberg REST catalog. Enable table redirection. Start the DistCp warm-sync. This is the point where the strangler facade becomes active; from here on, every migrated table is transparent to its consumers.

Command: bifrost modernize bridge.

Phase 3 — Table-by-table migration

For each wave:

  1. Convert tables to Iceberg using the chosen strategy (snapshot, migrateनहीं तो add_files).
  2. Migrate data to object storage if the strategy requires it.
  3. Validate via data-diff (row-count parity and value-level sampling).
  4. Enable table redirection once validation passes.
  5. Preserve the legacy copy for instantaneous revert during the silence period.

Commands: bifrost modernize migrate-tableऔर bifrost modernize migrate-wave.

Phase 4 — Workload conversion

Convert Oozie workflows to Airflow 3 DAGs with dbt models. Translate Spark-on-YARN configurations to Spark-on-Kubernetes. HBase migration runs on a separate, slower track (see Component migrations — data store).

Commands: bifrost modernize convert-workflowऔर bifrost modernize hue-import.

Phase 5 — Decommission

Mark YARN and HDFS as read-only. Enforce a 30-day silence period (or longer, depending on change-management policy). Validate that no production reads or writes have touched the legacy services during the silence period. Decommission physical hardware or repurpose it as Kubernetes worker nodes.

Command: bifrost modernize decommission.

Wave-Based Migration

Waves are how Modernize scales. A wave is a group of tables that migrate together, with internal parallelism controlled by --parallel. Waves are ordered by criticality and complexity; early waves are quick wins, later waves are core business tables.

Wave scoring

Each table is scored on three dimensions during discovery:

  • Criticality — downstream lineage depth. A table with no downstream consumers is low-criticality; a fact table that every report depends on is high-criticality.
  • Complexity — migration difficulty. Parquet Hive tables with standard schemas are low-complexity. Text-format tables, Avro tables with schema evolution history, and tables with custom SerDes are higher-complexity. Tables whose sources are MapReduce or Pig are scored higher than Spark sources.
  • Freshness — last-access timestamp. Tables not accessed in the last 12 months are candidates for retirement rather than migration.

The migration plan (modernize/plans/<cluster>/migration_plan.yaml) assigns each table to a wave. The default is eight waves, which works well for estates up to a few thousand tables. Larger estates (5 000+ tables) typically raise this to 10 or 12 waves; the value is set at planning time via --waves.

Wave example

waves: 
- परिचय : 1
नाम : "Quick wins"
tables:
- नाम : production_db.dim_date
strategy: snapshot
priority: low
- नाम : production_db.dim_currency
strategy: snapshot
priority: low
- परिचय : 2
नाम : "Core dimensions"
tables:
- नाम : production_db.dim_customer
strategy: migrate
priority: high
- नाम : production_db.dim_product
strategy: migrate
priority: high

Running a wave

# Migrate wave 3 with up to 8 tables in parallel
bifrost modernize migrate-wave --wave 3 --parallel 8

# Check wave status
bifrost modernize migrate-wave --wave 3 --status

When a table fails mid-wave, the table is quarantined (marked FAILED) and the wave continues with the remaining tables. A wave is marked COMPLETE only when every non-quarantined table passes validation. Quarantined tables do not block wave completion but must be resolved before the cluster can be decommissioned. Retry policy is 3 attempts with exponential back-off (5 min, 15 min, 45 min); after 3 failures, manual intervention is required.

Table Migration Strategies

Three strategies are available. The right one depends on table size, format, and partitioning needs.

Snapshot

Creates an Iceberg table that reads the existing Hive files in place. No data is copied. Fully reversible.

  • Best for validation, quick wins, and tables that are already in Parquet or ORC.
  • Revert is instantaneous via Iceberg metadata swap.
  • Limitations the Iceberg table points at source files until migrate is run later. If the source is deleted, the Iceberg table breaks.

Migrate

Destructive in-place conversion. The original Hive table is renamed to <name>_BACKUP_ (suffix appended, matching the Iceberg migrate procedure) and a new Iceberg table takes its place.

  • Best for large, stable tables that have already been validated via snapshot.
  • Faster than snapshot-then-rewrite because metadata already points at the right files.
  • Limitations blocked on non-Parquet, non-ORC, non-Avro formats (Text, RC, SequenceFile must be converted first).

Add-files

Additive import into an existing Iceberg table. Best for re-partitioning during migration.

  • Best for tables that need to change partition scheme as part of the migration (for example, moving from per-day partitions to Iceberg hidden-partitioning on year(event_date)).
  • प्रयोग --partition-spec to express the new partitioning.

उदाहरण:

bifrost modernize migrate-table \
--table production_db.events \
--strategy add_files \
--partition-spec "year(event_date)"

Bulk catalog migration

For thousands of tables with no data copy required, bifrost modernize migrate-catalog moves table references between catalogs in a single command. Use this when the estate is large and most tables can take the default strategy.

Dual-Read Bridge

The dual-read bridge is the mechanism that makes Modernize safe to execute incrementally. It has three components:

  • Two Trino catalogs — one pointing at the legacy Hive Metastore, one pointing at the Iceberg REST catalog.
  • Table redirection — per-table configuration that routes queries to the correct catalog.
  • Periodic sync — DistCp warm-sync from HDFS to object storage at a configurable interval.

The legacy catalog remains authoritative until a table migrates. After migration and validation, table redirection flips for that table, and its queries transparently land on the Iceberg copy. The legacy table is preserved until the silence period elapses, so revert remains a metadata operation.

देखना Target platform architecture — dual-read bridge for the topology diagram.

Silence Period and Decommission

Modernize does not decommission anything by default. Every migrated service stays available alongside its replacement for a silence period — by default 30 days — during which Bifrost verifies that no production reads or writes have touched the legacy service.

The silence period is enforced at decommission time:

bifrost modernize decommission \
--service hdfs \
--cluster PROD01 \
--after-silence 30d

If Bifrost detects access during the silence period, decommission is refused. This protects against forgotten clients, third-party integrations, and scheduled jobs that have not been migrated.

Status Dashboard

bifrost modernize status produces a text-mode snapshot of progress:

Bifrost Modernize Status: PROD01 -> production
==================================================

Storage Migration: ################.... 78% (179 TB / 230 TB)
Table Migration: ##############...... 68% (2,847 / 4,187 tables)
Workload Conversion: ########............ 42% (156 / 371 workflows)
Service Replacement: ##################.. 88% (7 / 8 services)

Current Wave: 5 of 8
Tables in Progress: 23
Tables Blocked: 2 (manual review required)

Active Issues:
WARN production_db.risk_scores: 22 % slower on Trino (within tolerance)
TODO etl-risk-model coordinator: manual Airflow Dataset definition needed
BLOCK production_db.hbase_audit: HBase schema review pending

Parallel to the text output, Bifrost ships a set of Grafana dashboards that are deployed by bifrost modernize land:

डैशबोर्ड Purpose
Migration ProgressTables migrated per wave, storage migrated, workloads converted, services replaced. Executive-level view.
Data QualityRow-count parity per table, data-diff pass rates, schema drift alerts.
Object Storage HealthOSD latency heatmaps, pool utilization, recovery progress, IOPS.
Trino PerformanceQuery latency p50/p95/p99, queue depth, CPU and memory per cluster.
Spark Job MetricsJob duration trends, executor utilization, shuffle data volume.
Airflow OperationsDAG success rates, task duration, scheduler lag.

अगले कदम