अपाचे स्पार्क
Apache Spark is the default execution engine for distributed data processing in Ilum. It runs on Kubernetes (with native CRD-based pod orchestration) or Apache Hadoop Yarn, and is exposed through batch jobs, interactive sessions, in-app SQL notebooks, and the Apache Kyuubi SQL gateway.
Ilum bundles Apache Spark 4.x by default, with Spark 3.x available for legacy workloads.
When to use Spark
Spark is the right engine for:
- Large-scale ETL and data transformation pipelines.
- Machine learning workloads using Spark ML or MLlib.
- Complex joins and aggregations across large datasets.
- Streaming workloads with Spark Structured Streaming.
- Workloads that benefit from horizontal scaling across many executors.
For interactive analytics on medium-to-large data, consider त्रिगुण . For small-data and local execution, consider डकडीबी . For low-latency stream processing, consider Apache Flink.
Execution model
Spark runs as a driver and a configurable number of executors:
- Driver pod: One per job. Coordinates execution, holds the Spark session, and tracks task state.
- Executor pods: Provisioned dynamically based on workload. Run individual tasks in parallel and hold cached data.
Ilum manages the full pod lifecycle, including image selection, resource limits, dynamic allocation, and cleanup on completion.
Workload types
Spark powers four kinds of workloads in Ilum:
- नौकरियों : One-shot batch executions.
- सेवाएँ : Long-running interactive Spark sessions that execute code on demand without per-call initialization overhead.
- कार्यक्रम : Cron-driven recurring jobs.
- Requests: Ad-hoc submissions through the REST API or UI.
All four are managed through the वर्कलोड section of the Ilum UI.
Supported catalogs
Spark connects to all four Ilum catalogs:
- हाइव मेटास्टोर (default)
- प्रोजेक्ट नेस्सी (Iceberg with Git-style branching)
- एकता कैटलॉग (Databricks-compatible governance)
- DuckLake (DuckDB-native, primarily used by DuckDB)
Supported table formats
Spark reads and writes:
- डेल्टा झील : ACID transactions, time travel, schema evolution.
- अपाचे आइसबर्ग : Partition evolution, hidden partitioning.
- अपाचे हुदी : Record-level upserts, incremental processing.
- Parquet, ORC, सीएसवी , JSON, Avro: Standard file formats.
वही इलम टेबल्स abstraction lets you read and write Delta, Iceberg, and Hudi using the same Spark API.
संरूपण
Spark configuration is managed through Helm values and per-cluster settings:
इलम कोर :
उत्तेजक गुण :
सक्षम : सच्चा
गुच्छा :
defaults:
spark.dynamicAllocation.enabled: "सच"
spark.dynamicAllocation.minExecutors: "1"
spark.dynamicAllocation.maxExecutors: "20"
spark.dynamicAllocation.executorIdleTimeout: "60s"
Per-cluster overrides are configured in the Workloads > Clusters UI and apply to all Spark jobs targeting that cluster.
स्पार्क कनेक्ट
Spark Connect provides a client-server architecture for remote Spark execution. Ilum deploys Spark Connect servers as standard jobs and includes a Kubernetes-aware proxy that allows Spark Connect endpoints to be reached across cluster boundaries.
संदर्भ लें स्पार्क कनेक्ट for details.
Submitting a Spark job
For a step-by-step walkthrough, refer to Run a simple Spark job.