मुख्य विषयवस्तु में जाएं

Hadoop Migration: First Discovery Run

This page walks through the first hours of working with Bifrost: confirming prerequisites, obtaining the distribution, installing it on a controller host, understanding the configuration-as-data model, and running a first discovery pass against a non-production cluster.

आवश्यकताएँ

Bifrost runs on a dedicated controller host that has network access to the source cluster and, for Modernize and Direct paths, to the target Kubernetes cluster. The controller host does not need to be part of either cluster.

Controller host

  • Operating system: RHEL 8.x, Rocky Linux 8.x, or Ubuntu 22.04 or later.
  • Python 3.9 or later with requests, pyyaml, lxmlऔर jinja2 installed as system packages.
  • Git 2.x.
  • SSH access to every node in the source cluster, using key-based authentication.
  • A local or NFS-shared directory for state files, logs, and rollback assets.

Source environment

  • Administrative credentials for the source cluster management plane (for example, Cloudera Manager API credentials for Classic and Direct paths).
  • Network reachability from the controller host to the source cluster management API and to every node managed by it.
  • For Classic and Direct paths, sufficient disk space on each source node to cache distribution packages used for rollback.

Identity prerequisites (Modernize and Direct only)

Bifrost deploys Keycloak as part of bifrost modernize land and uses it as the central OIDC Identity Provider for Trino, Airflow, Superset, OpenMetadata, and the Polaris catalog. Two options are supported:

  • Bifrost-managed Keycloak (default).वही land step installs Keycloak alongside the other platform components, creates the required realms and clients, and wires every service to it automatically. Customers add their users through the Keycloak admin console or via an आयात job.
  • Customer-managed IdP via Keycloak federation. If the customer already runs an enterprise IdP (Okta, Azure AD, Google Workspace, Ping), the shipped Keycloak is configured as an identity broker: it federates to the customer IdP for user authentication while keeping the service-to-service OAuth2 clients local. Bifrost exposes a wrapper values path (keycloak.federation.*में bifrost modernize land deployment configuration) that is compiled into a realm-import JSON at land time — realm import is the upstream Keycloak mechanism, driven through the keycloakConfigCli component of the underlying Helm chart. Customers can alternatively supply their own realm-import JSON directly via keycloak.keycloakConfigCli.configuration.

Running Bifrost without any Keycloak is not supported — service-to-service OAuth2 between Trino, Polaris, Airflow, and Spark on Ilum depends on it.

DNS and ingress prerequisites (Modernize and Direct only)

The target platform exposes services under a customer-specific domain (the global.domain Helm value). Before bifrost modernize land:

  • Provision DNS A or CNAME records pointing <service>.lakehouse.<your-domain> at the ingress controller (or wildcard *.lakehouse.<your-domain>).
  • Supply a wildcard TLS certificate or enable cert-manager for per-host certificates. Let's Encrypt works for internet-reachable deployments; for air-gapped deployments, configure cert-manager with an internal CA issuer.
  • Open firewall rules from end-user networks to the Keycloak hostname — every browser-based login reaches Keycloak directly.

Target Kubernetes (Modernize and Direct only)

  • कुबेक्टल 1.28 or later, configured with administrative access to the target cluster.
  • helm 3.14 or later.
  • A reachable container registry for Spark, Trino, Airflow, and other platform images.
  • A Kubernetes cluster with sufficient capacity for the target Ilum deployment. The recommended minimum for a dev or staging deployment is 12 worker nodes, 48 vCPUs, and 192 GB of RAM. Production sizing is described in Operations — capacity planning.

Obtaining Bifrost

Bifrost is delivered through the Ilum Enterprise distribution channel. Customers receive the Bifrost archive and accompanying credentials when the Ilum Enterprise subscription is activated.

To request access, contact your Ilum account representative or email [ईमेल संरक्षित] .

नोट

Bifrost is not available through the public Ilum Helm charts. It is not an open-source component and is distributed only to customers with an active Ilum Enterprise agreement.

संस्थापन

Unpack the Bifrost distribution onto the controller host. Install the declared dependencies, then confirm that the bifrost command is on the path.

# Unpack the Bifrost distribution
tar -xzf bifrost-<version>.tar.gz
cd bifrost-<version>

# Install declared dependencies for the migration framework
./bin/bifrost install-deps

# Verify the CLI
bifrost --version
bifrost --help

वही install-deps step pulls in the migration framework dependencies and prepares a local workspace at the project root. No external network access is required during migration itself; all dependencies are resolved up front.

For Modernize and Direct paths, verify that the controller host can reach the target Kubernetes cluster before proceeding.

kubectl cluster-info
helm version

Air-gapped deployments

For Modernize and Direct on air-gapped networks, three mirrors must be prepared in advance:

  • Container registry mirror — pull and push every Spark, Trino, Airflow, Polaris, OpenMetadata, Keycloak, Superset, and Ilum image used by the release. The full image list is printed by bifrost modernize land --print-images so customers can script the mirroring pass.
  • Helm chart mirror — publish the Bifrost-bundled Helm charts to an internal chart repository (for example, Harbor or ChartMuseum). Point helm repo add at the mirror in the deployment configuration.
  • OIDC offline configuration — Keycloak's realm export is prepared on the controller host and loaded during land, so no outbound reachability is required.

Classic paths already ship an on-node distribution package cache and do not need network access during the critical migration window.

Configuration-as-Data

Bifrost follows a configuration-as-data principle. The migration logic is generic; cluster-specific settings live in YAML inventory files under inventories/. A single Bifrost installation drives migrations for any number of clusters. The only thing that changes between clusters is the inventory.

A Classic or Direct inventory describes the source cluster topology:

# inventories/prod01/hosts.yml
all:
vars:
cluster_name: PROD01
nameservice: prod01- ns
zk_quorum: "zk1.example.internal:2181,zk2.example.internal:2181,zk3.example.internal:2181"
kerberos_realm: EXAMPLE.INTERNAL

children:
namenodes:
hosts:
nn-01.example.internal: { nn_id: nn1, active: सच्चा }
nn-02.example.internal: { nn_id: nn2, active: गलत }
datanodes:
hosts:
dn-001.example.internal: { rack: /rack- 01 }
dn-002.example.internal: { rack: /rack- 01 }
# ... one entry per DataNode
journalnodes:
hosts:
jn-01.example.internal: { }
jn-02.example.internal: { }
jn-03.example.internal: { }
छत्ता :
hosts:
hive-01.example.internal: { }
kafka_brokers:
hosts:
kafka-01.example.internal: { broker_id: 1 }
kafka-02.example.internal: { broker_id: 2 }
kafka-03.example.internal: { broker_id: 3 }

A Modernize or Direct target configuration describes the Kubernetes platform that Bifrost will land:

# modernize/values-production.yaml
व्‍यापक :
storageClass: rook- ceph- block
डोमेन : lakehouse.example.internal
oidc:
issuerUrl: एचटीटीपीएस : //keycloak.example.internal/realms/lakehouse
क्लाइंटआईडी : bifrost

rookCeph:
सक्षम : सच्चा
गुच्‍छा :
objectStore:
नाम : lakehouse- rgw
pool:
erasureCoded:
dataChunks: 4
codingChunks: 2

polaris:
सक्षम : सच्चा
प्रतिकृति गणना : 2

ट्रिनो :
इंटरैक्टिव :
प्रतिकृतियों : 3
jvm:
maxHeapSize: "16G"
कॉन्फिग :
retryPolicy: NONE
etl:
प्रतिकृतियों : 2
jvm:
maxHeapSize: "32G"
कॉन्फिग :
retryPolicy: TASK

airflow:
executor: KubernetesExecutor

openmetadata:
सक्षम : सच्चा
superset:
सक्षम : सच्चा
कीक्लोक :
सक्षम : सच्चा

Inventory and values files should be version-controlled. Every Bifrost run reads the current inventory state and records which version produced which result.

Your First Discovery Run

Discovery is non-destructive. It reads the source cluster and produces an inventory, a topology snapshot, and (for Modernize and Direct paths) a scored migration plan. Discovery is safe to run on live production systems.

Classic or Direct discovery against Cloudera CDP

bifrost classic discover \
--cluster PROD01 \
--cm-host cm.example.internal \
--cm-port 7183 \
--cm-user admin \
--cm-pass "$CM_PASSWORD" \
--output inventories/prod01/

Discovery produces a structured inventory tree:

inventories/prod01/
hosts.yml # Topology inventory
group_vars/
all.yml # Cluster-wide variables
namenodes.yml
datanodes.yml
hbase.yml
edge_nodes.yml
discovery/
topology.json # Hosts, roles, rack assignments
services.json # Services, versions, health
configs/ # Per-service configuration export
security.json # Kerberos principals, keytab locations, TLS certs
encryption_zones.json # HDFS encryption zones and key names
data_dirs.json # Per-host data directory paths and usage
client_deps.json # JDBC driver classes and classpath references
version_report.json # Source vs. target version alignment

Modernize discovery against an open-source Hadoop estate

bifrost modernize discover --source-cluster PROD01

Modernize discovery inventories tables, workloads, and policies, and produces a scored migration plan:

modernize/plans/prod01/
estate_inventory.json # Tables, jobs, services
migration_plan.yaml # Scored plan with wave assignments
wave_assignments.csv # Per-table wave assignment and complexity
tco_report.md # TCO delta projections
risk_assessment.md # Per-component risk analysis
manual_review_items.json # Items requiring human judgment

Each table and job is scored on three dimensions:

  • Criticality — downstream lineage depth (how many other tables and jobs depend on it).
  • Complexity — migration difficulty, derived from workload type (MapReduce and Pig score high; Spark scores low).
  • Freshness — last-access timestamp. Tables not accessed in more than 12 months are flagged as candidates for retirement rather than migration.

Review the discovery output

Review the following before proceeding to planning and execution.

  • migration_plan.yaml — wave assignments and strategies. Structured YAML, one block per wave, with per-table strategy (snapshot/ migrate/ add_files), priority, and criticality score.
  • manual_review_items.json — items that exceed Bifrost's automation ceiling. Each entry carries component, source_object, reason, and a recommended handling track (for example, custom UDF → "manual port to Trino plugin").
  • tco_report.md — markdown narrative projecting monthly operating cost for the legacy and target platforms, broken down by compute, storage, licensing, and support. Drawn from the estate inventory and customer-supplied unit costs.
  • risk_assessment.md — markdown table of risks per component (for example, "HBase tables with active coprocessors"), each with severity, affected objects, and mitigation recommendations.

Discovery can be re-run as many times as needed without affecting the source cluster.

अगले कदम