मुख्य विषयवस्तु में जाएं

Hadoop Migration Troubleshooting and Compatibility Reference

This page gathers the most frequent operational issues encountered during Bifrost migrations, along with the supported version matrix and a glossary of the terms used throughout the documentation. If your symptom is not listed here, consult Operations for monitoring entry points or contact [ईमेल संरक्षित] .

Common Classic (Path 1) Issues

Discover fails with an API timeout

Symptom: bifrost classic discover times out while talking to the cluster manager.

कारण: The deployment export endpoint is memory-intensive on clusters with many services. The default 30-second timeout is too short.

Fix: Increase --cm-timeout to 120 seconds or more. Clusters with 100+ services commonly need this. Also check the manager's heap usage; a manager approaching heap exhaustion will produce inconsistent timeouts.

fsimage parsing fails in validate-pre but works manually

Symptom: bifrost classic validate-pre reports an fsimage parse failure, but running hdfs oiv manually against the same fsimage on the source NameNode succeeds.

कारण: The fsimage compatibility test runs a standalone Hadoop installation at the लक्ष्य version on isolated test infrastructure, loads the production fsimage into its NameNode, and calls hdfs oiv plus a safe-mode boot cycle. Either (a) that standalone installation is not the expected target version, (b) the fsimage was truncated during copy, or (c) the test host has insufficient heap to load the full fsimage.

Fix: Confirm that the target Hadoop installation under /opt/bifrost/test-hadoop matches the documented target version (check hadoop version). Compare the fsimage file size and MD5 between the test host and the live NameNode. If the fsimage exceeds ~2 GB, set HADOOP_HEAPSIZE_MAX=16g on the test host before re-running bifrost classic validate-pre.

Package swap fails on some nodes

Symptom: The swap phase aborts mid-play with a package installation error on a subset of nodes.

कारण: A locked RPM database (another package manager is running), insufficient disk space in /usr, or a network issue reaching the local package mirror.

Fix: The inline rescue block automatically reverts failed nodes to the source distribution. Inspect /var/log/bifrost/bifrost.log for the specific error. Clear the RPM database lock, free disk space, or fix the mirror and re-run the swap; Bifrost's idempotency means already-migrated nodes are skipped.

NameNode does not exit safe mode

Symptom: After the start phase, the NameNode remains in safe mode past the gate time.

कारण: Insufficient DataNodes reporting, corrupt blocks requiring manual hdfs fsck repair, or insufficient heap for processing a large fsimage.

Fix: Check hdfs dfsadmin -report to confirm DataNode count. Run hdfs fsck / to identify corrupt blocks. Increase the NameNode heap in hadoop-env.sh for unusually large fsimages. If the gate fires before the NameNode exits safe mode, Bifrost will have rolled the cluster back; investigate root cause and retry after correction.

Common Modernize and Direct Issues

S3A writes fail with 403 Forbidden

Symptom: DistCp or Spark writes to the object storage endpoint return HTTP 403.

कारण: Incorrect credentials in core-site.xml, missing bucket, missing write permission, or (for Direct) an expired Kerberos ticket on the source HDFS.

Fix: Verify credentials in the rendered core-site.xml. Confirm the bucket exists and the user has write permission. For Kerberos sources, verify the keytab with klist -kt /var/lib/bifrost/keytabs/bifrost.keytab.

Trino returns different results than Hive

Symptom: Query parity validation fails — a query that worked on Hive returns different values on Trino.

कारण: Timezone settings, decimal precision handling, or NULL semantics differ between engines.

Fix:

  • Set spark.sql.session.timeZone=UTC globally and ensure Trino sessions use UTC.
  • Verify decimal precision: Hive and Trino differ in rounding behaviour; adjust column types if necessary.
  • NULL handling: Hive treats empty strings as NULL in some contexts; Trino does not. Explicit NULLIF(col, '') may be required in affected queries.

Remote shuffle failures

Symptom: Spark jobs fail with shuffle read or write errors under load.

कारण: The remote shuffle workers are out of disk, under-provisioned, or unreachable from executors.

Fix:

  • Check shuffle worker disk space.
  • Increase the shuffle flusher buffer size for large shuffle workloads.
  • Verify network connectivity between Spark executors and shuffle workers (NetworkPolicy changes can break this).
  • Confirm shuffle-master HA status.

migrate-table hangs

Symptom: bifrost modernize migrate-table runs for far longer than expected and does not complete.

कारण: Insufficient executor memory for large Parquet files, S3A connection pool exhaustion, or catalog timeout.

Fix:

  • Check the Spark driver logs via kubectl logs <driver-pod>.
  • Increase executor memory for tables with large row groups.
  • Raise fs.s3a.connection.maximum if connection pool exhaustion is suspected.
  • Check catalog pod health if catalog calls are slow or failing.

Version Compatibility Matrix

Bifrost ships with tested combinations of the following components. Using versions outside these ranges is not supported.

ComponentVersionIceberg versionNotes
अपाचे स्पार्क 3.5.x1.5.x, 1.6.x, 1.7.xSpark 3.5 is the default supported line.
अपाचे स्पार्क 4.0.x1.10.x or laterSpark 4.0 requires the iceberg-spark-runtime-4.0_2.13 artifact, first published in Iceberg 1.10.0.
त्रिगुण 477 to 4801.5.x, 1.6.x, 1.7.xIceberg connector bundled; REST catalog supported.
Apache Polaris1.3.0 Iceberg REST spec v1
Apache Gravitino0.8.xIceberg REST spec v1Exposes Iceberg REST as a subset of its API.
अपाचे आइसबर्ग 1.5.2, 1.6.1, 1.10.0+N/APick the Iceberg release compatible with the chosen Spark line.
Object storage (Ceph)Rook 1.15.5+ for Squid (Ceph 19.2); Rook 1.17 supports Reef and Squid; Rook 1.18-1.19 support Reef and Squid (Quincy dropped)N/ASquid support landed mid-1.15 series; verify against Rook release notes for the exact patch version.
अपाचे एयरफ्लो 3.0.xN/AKubernetesExecutor.
Apache Celeborn0.5.xN/ASupports Spark 2.4 through 3.5. For Spark 4.x, use Celeborn 0.6.x (which introduced Spark 4.0 support).
YuniKorn1.5.x or 1.6.xN/AGang scheduling, hierarchical queues.
Spark Operator2.1.x or 2.2.x (for Spark 3.5); 2.3.x (for Spark 4.0)N/AEach Spark Operator release is pinned 1:1 to a base Spark version.
इलम 6.3.xN/ASupports Spark 3.5 and 4.x.

Critical version constraints

  • Spark with Iceberg. Spark 3.5 requires iceberg-spark-runtime-3.5_2.12. Spark 4.0 requires iceberg-spark-runtime-4.0_2.13, which first appeared in Iceberg 1.10.0. Mixing Spark and Iceberg-runtime artifact versions causes NoSuchMethodError at class-load time.
  • Trino with credential vending. Trino's iceberg.rest-catalog.vended-credentials-enabled=true is supported by Trino releases that ship the full Iceberg REST catalog OAuth2 property set. The tested range is Trino 477-480 (see the compatibility matrix above); production deployments should run the most recent 48x release to pick up security fixes. Verify against the active Trino security advisories for the Trino version in use. Azure vended credentials are not yet supported.
  • Celeborn with Spark. Celeborn 0.5.x supports Spark 2.4 through 3.5. Spark 4.0 and 4.1 require Celeborn 0.6.x. Set spark.shuffle.manager=org.apache.celeborn.client.spark.CelebornShuffleManager.
  • Iceberg table format version. Use v2 tables for row-level deletes (DELETE, UPDATE, MERGE). Trino 411 or later and Spark 3.4 or later support v2 writes; v1 tables are read-only for delete operations.
  • Spark Operator with Spark. Each Spark Operator minor release is bound to a specific Spark base image. Do not mix Spark Operator 2.1.x/2.2.x with a Spark 4.0 image, or Spark Operator 2.3.x with a Spark 3.5 image.

Wave-Level Failure Semantics

Recap of how Modernize and Direct handle wave failures. The full discussion is in Validation and rollback.

  • A failed table is quarantined (marked FAILED), not the whole wave.
  • Other tables in the wave continue normally.
  • A wave is marked COMPLETE only when every non-quarantined table passes validation.
  • Quarantined tables do not block wave completion but must be resolved before decommission.
  • Retry policy is 3 attempts with exponential back-off (5 minutes, 15 minutes, 45 minutes).
  • After 3 failures, manual intervention is required.

CI Integration

Every Bifrost command is suitable for CI pipeline use. Exit codes follow a stable contract:

कोड Meaning
0 Success.
1 Generic failure (configuration error, unexpected exception, network fault).
2 ABORT verdict from the decision engine — automated rollback triggered where applicable.
3 Validation failed but did not trigger ABORT (for example, a WARN verdict treated as strict by --strict).
4 Required precondition unmet (missing inventory, unreachable cluster manager, expired credentials).

Wrap Bifrost commands in a script that distinguishes codes 2 और 3 से 1 to route notifications correctly. --no-notify suppresses Slack and IT service management channels for dry-runs and local validation.

Getting Help

Glossary

TermDefinition
AQEAdaptive Query Execution. Spark 3.x feature that dynamically adjusts query plans at runtime based on actual data statistics (partition coalescing, skew handling).
CelebornThe remote shuffle service used on the target platform. Decouples shuffle data from Spark executor lifetime, enabling safe Dynamic Resource Allocation on Kubernetes.
CRDCustom Resource Definition. A Kubernetes extension mechanism that allows defining custom object types (for example, SparkApplication, TableMigration).
Credential vendingThe practice of an Iceberg catalog generating temporary, scoped storage credentials per request, eliminating long-lived storage keys.
Decision engineThe component of Bifrost that evaluates automated go/no-go checks at each phase gate and returns PROCEED, WARN, or ABORT.
DistCpDistributed Copy. Tool for large-scale parallel data movement between file systems.
DRADynamic Resource Allocation. Spark's ability to add and remove executors during job execution based on workload demand.
Dual-read bridgeThe strangler-facade configuration during Modernize that exposes both legacy and migrated tables through a single Trino interface.
Erasure coding (EC)A data protection method that splits data into fragments and generates parity fragments. EC(4+2) means 4 data + 2 parity fragments, providing 50 % raw-capacity overhead (vs. 200 % for triple replication).
ESO (External Secrets Operator)A Kubernetes operator that synchronizes secrets from external stores (HashiCorp Vault, Azure Key Vault, AWS Secrets Manager, GCP Secret Manager, and others) into native Kubernetes Secret resources.
जूसएफएस A distributed file system that exposes object storage through a native HDFS SDK. Used as a compatibility bridge for applications that require HDFS API semantics against an S3-backed store.
Exchange managerTrino's mechanism for spilling intermediate data to external storage during fault-tolerant execution, enabling task-level retries.
Gang schedulingA scheduling pattern in which all pods of a job (driver + executors) are scheduled atomically. Prevents deadlocks where a driver is scheduled but no executors can be placed.
Magic committerAn S3A committer that uses multipart upload to write data directly to its final location, eliminating the rename-as-copy pathology that plagues naive S3 writers.
OPAOpen Policy Agent. The policy engine used by Trino on the target platform for fine-grained SQL authorization.
Silence periodA configurable window (default 30 days) during which Bifrost verifies that no production reads or writes touch a legacy service before it is decommissioned.
Strangler facadeA migration pattern in which a facade (Trino table redirection) gradually routes traffic from legacy systems to modern replacements, allowing incremental migration without breaking consumers.
Table redirectionA Trino feature in which queries to one catalog are transparently redirected to another catalog for specific tables, enabling gradual migration without changing SQL.
WaveA group of tables that migrate together, with internal parallelism, as a single unit of Modernize execution.
YuniKornThe Kubernetes batch scheduler used on the target platform for YARN-style hierarchical queue management, fair scheduling, and gang scheduling.

अगले कदम