Exporting Notebooks to Pipelines · JPE

वही JupyterLab Pipeline Exporter (JPE) is a JupyterLab 4 extension that promotes an interactive Sparkmagic notebook into a production artifact - an Airflow DAG, or a native Ilum Spark job in single, service, or cron mode - without leaving the notebook UI. The notebook is parsed in place, Jupyter-only constructs are stripped, and the resulting code is shipped to the configured backend.

JPE ships pre-installed in the helm_jupyter sub-chart. On a standard Ilum deployment, every backend URL is auto-wired from in-cluster Service DNS; no manual configuration is required.

Deployment Targets

JPE offers two export targets, each with its own execution modes. Both target cards are enabled when their backend responds at panel load; the backend is probed and reflected as a status badge.

Target	Modes	Mechanism	Backing API
Ilum Spark job	single · service · cron	Native Ilum. The selected mode determines how the notebook is packaged and submitted to `इलम कोर`.	`POST /api/v1/job/submit`, `POST /api/v1/group`, `POST /api/v1/schedule`
Airflow DAG	per-cell · batch	Renders a DAG Python file via Jinja2 using the Livy operator, pushes it to Gitea, and lets git-sync deliver it to the Airflow dag-processor. Optional auto-trigger after visibility.	Gitea Contents API + Airflow REST API (`/api/v2/dags`)

Three further targets - SDP (Spark Declarative Pipelines), dbt projectऔर DuckDB via Quack - are visible in the panel as coming soon and are not yet selectable.

Independently of the target, any notebook can be downloaded as a Spark-runnable .py रेती for inspection or manual स्पार्क-सबमिट; this download is always available, including fully offline.

Execution Modes

Ilum Spark job

वही Ilum Spark job target submits to इलम कोर in one of three modes, chosen with the mode toggle on the target card:

single - Wraps the notebook as a standalone pyFile and submits it as a one-shot job on an ephemeral Spark application that is stopped automatically when the job completes. (POST /api/v1/job/submit)
सेवा - Wraps the notebook as an इलम जॉब subclass and registers a long-running Ilum service with a warm Spark driver, then runs the notebook against it on demand. (POST /api/v1/group, executed via POST /api/v1/group//job/execute)
cron - Registers the service together with a cron expression. Each fire is a single-shot Spark job triggered by इलम कोर's internal scheduler - no Kubernetes CronJob is created. (POST /api/v1/schedule)

Airflow DAG

वही Airflow DAG target renders in one of two shapes:

per_cell - Default. Each code cell becomes one PythonOperator task chained sequentially; all tasks share a single Livy session opened by a leading livy_session task and closed by a trailing livy_cleanup task. Fine-grained DAG with individually retriable cells. More Livy round-trips than batch.
batch - A single PythonOperator task (run_notebook) runs the whole notebook inside one Livy session as one Spark batch job. Minimizes session startup overhead at the cost of cell-level retriability.

Cell Tags

Two aspects of the generated Airflow DAG are controlled per cell through JupyterLab cell tags. Tags are read from the notebook metadata; they are inert during interactive execution and only consumed at export time.

Tag	Applies to	Effect
`task:`	`per_cell` task id	Replaces the generated id (`cell_3`नहीं तो `c2_c3` for merged cells) with a stable, readable `task_id`.
`retries:`	Per-task retry count	Overrides the DAG-level default retry count for that task.

Naming per-cell tasks

में per_cell mode each task defaults to a generated id. Add a task: cell tag (for example task:load_raw_sales) to assign a stable task_id. The name must be a valid Python identifier; invalid tags are ignored and fall back to the generated id. The active panel surfaces a Name your tasks hint while per_cell is selected, and the export preview shows a tagged-cell counter.

Per-task retry count

Generated DAGs carry a DAG-level default of three retries with exponential backoff. To override the retry count for an individual task, add a retries: cell tag, where is a non-negative integer (for example retries:5). A flaky ingestion step can be given more retries while a deterministic transform is set to retries:0 to fail fast.

The value is applied per task:

में per_cell mode the tag sets retries= on that cell's PythonOperator, overriding the DAG default for that task only.
में batch mode there is a single run_notebook task; the first valid retries: tag among the exported cells applies to it.

A tag whose value is not a non-negative integer (for example retries:-1, retries:abc, retries:2.5) is ignored, and the task keeps the DAG-level default. Surrounding whitespace is tolerated (retries: 3). When multiple retries: tags are present on merged cells, the first valid one wins.

# Cell tags: task:load_raw_sales, retries:5
raw = उत्तेजक गुण.पढ़ना.लकड़ी("s3a://landing/sales/")
raw.लिखना.saveAsTable("bronze.sales")

The cell above renders as:

load_raw_sales = PythonOperator(
    task_id="load_raw_sales",
    python_callable=_submit_statement,
    op_kwargs={"livy_conn_id": "ilum-livy-proxy", "code": "..."},
    retries=5,
)

The shipped example notebooks (Pipeline_Exporter_Showcase, Pipeline_Exporter_Ilum_Service_Showcase, and the docker-compose quickstarts) demonstrate both tags, assigning higher retry counts to I/O-bound steps and retries:0 to parameter cells.

Operating Modes

Each backend is probed when the panel loads (≤1.5 s) and reflected as a status badge on the target card.

Standalone mode - No remote backend responds. Both target cards are dashed-bordered with a tooltip, and only the .py download is offered. Typical for a pip install jupyterlab-pipeline-exporter setup whose backend URLs have not been pointed at an Ilum or Airflow instance yet.
Connected mode - Some backends are reachable. Cards light up independently.
Ilum bundled mode - All backends auto-discovered: इलम-कोर: 9888, ilum-airflow-api-server:8080, एलम-गिटिया-एचटीटीपी:3000. DAG auto-trigger mints HS512 tokens locally from AIRFLOW_JWT_SECRET (the FAB+OAuth /auth/token endpoint is incompatible with Airflow 3.x).

Notebook Sanitization

A Sparkmagic notebook contains constructs that crash an Airflow worker - no IPython display backend, no %जादू resolver, no shell. JPE strips these before generating any artifact and reports every stripped line with its cell index and line number.

Removed by default:

Line magics - %manage_spark, %matplotlib, %load_ext
Shell escapes - !pip install
Display calls - print(...), display(...) with no terminal sink in batch execution
Secret-looking literals - AWS keys, JWT payloads, common password patterns. The request is rejected with HTTP 400 unless allow_secrets: true is set.

Stripped lines appear in the rejected list of every response and in the preview panel. Set keep_rejections: true to retain a magic or display call (for example a फ़ोटो that writes to a log aggregator).

नोक

प्रयोग Preview before committing. It returns the fully rendered DAG or pyFile together with the rejection report - without pushing to Gitea or Ilum.

Cell-kind filtering and the exported-cell count

वही enabledCellKinds setting decides which cell kinds (उत्तेजक गुण, पिस्पार्क, एसक्यूएल, scala, plain) are exported. The JupyterLab panel defaults to ["spark", "pyspark"], so a Sparkmagic notebook exports only its %%स्पार्क / %%pyspark cells; plain Python and %%sql cells are dropped and listed in rejected जैसा filtered-cell:. A parameter cell is always retained regardless of the filter.

A cell that passes the kind filter but sanitizes to an empty body - for example a setup cell containing only line magics such as %load_ext स्पार्कमैजिक.मैजिक और %manage_spark - produces no task. It is reported in rejected जैसा emptied-cell and excluded from the count.

Because of this, the cells value returned by an export is the number of cells that actually became tasks (or, for batch and pyFile targets, statements) in the generated artifact - not the raw count of code cells in the notebook. Skipped cells, kind-filtered cells, and emptied setup cells do not contribute to it. If every exportable cell is filtered out, the request is rejected with HTTP 422 so an empty pipeline is never produced.

Pipeline Parameters

JPE recognizes parameter cells through two conventions:

A cell tagged parameters (the papermill convention)
A cell whose first non-blank line is # @pipeline-params (kebab) or # @pipeline_params (snake). The marker is a plain Python comment, so the cell executes interactively in any IPython kernel without raising UsageError: Cell magic … not found

Variables assigned in such a cell are emitted as config.get('', ) lookups in the generated code, so the same notebook runs unchanged interactively (literal default) and in production (values injected by Airflow or the Ilum service request).

# Tagged: parameters
output_table = "gold.daily_kpi"
run_date = "2026-05-21"
threshold = 0.95

In the generated artifact the same variables resolve to:

output_table = कॉन्फिग.मिलना("output_table", "gold.daily_kpi")
run_date = कॉन्फिग.मिलना("run_date", "2026-05-21")
threshold = कॉन्फिग.मिलना("threshold", 0.95)

History and Re-Execution

Every submission is recorded in the History tab with the export specification, generated code, and rejection report. The same record is persisted to ~/.jupyter/jpe-audit.jsonl for audit.

For Ilum service submissions JPE also keeps a re-execution payload - the service can be re-invoked from the History tab without re-uploading the pyFile or recreating the service.

संरूपण

All settings are in Settings → Plugin Settings → Pipeline Exporter. Defaults target in-cluster Ilum services; everything is overridable.

Setting	चूक	या क़िस्‍म
`ilumApiUrl`	`http://ilum-core:9888/api/v1`	Ilum-core REST API used by single, service, and cron targets.
`airflowApiUrl`	`http://ilum-airflow-api-server:8080`	Airflow REST API used for DAG visibility checks and auto-trigger.
`gitApiUrl`	`http://ilum-gitea-http:3000/api/v1`	Git provider API used to push generated DAGs.
`gitProvider`	`गीता`	Git provider type - currently Gitea in the bundled deployment.
`defaultClusterId`	`चूक`	Ilum cluster name pre-selected in the dialog.
`defaultLivyConnId`	`ilum-livy-proxy`	Airflow connection ID used by generated operators.
`defaultMode`	`per_cell`	Airflow export shape pre-selected in the dialog (`per_cell` नहीं तो `batch`).
`sparkImages`	6 presets	Spark image dropdown - Cluster default (empty, inherits the cluster's image) plus Spark 4.1.2 with Delta, Sedona, Iceberg, or Trino, and Spark 3.5.8 with Nessie + Sedona.
`enabledCellKinds`	`["spark", "pyspark"]`	Cell kinds (`उत्तेजक गुण`, `पिस्पार्क`, `एसक्यूएल`, `scala`, `plain`) included in the export. Cells outside this list are dropped and reported.

Override precedence (highest wins):

Request payload fields - per submission.
Pod environment variables (ILUM_API_URL, AIRFLOW_API_URL, GITEA_API_URL, GITEA_TOKEN, …).
Plugin settings (JupyterLab Settings editor).

नोट

On a standard Ilum deployment the Helm chart wires every URL, secret, and JWT into the Jupyter pod from helm_jupyter/values.yaml. The Settings editor is only needed for non-bundled deployments or per-user overrides.

airflowIntegration.enabled (default सच्चा) gates the AIRFLOW_JWT_SECRET / AIRFLOW_API_URL env block. Set it गलत on deployments without Airflow so the disabled state is explicit at template render time rather than a silently empty token at runtime.
git.existingSecret alone gates the GITEA_USERNAME / GITEA_PASSWORD env vars JPE uses to push DAGs. The init-container that seeds the work dir into Gitea is gated separately by git.initialCommit.enabled, so a slim deploy without the gitea sub-chart can keep JPE's Gitea credentials without forcing the init loop to run. git.enabled is retained as a deprecated alias for git.initialCommit.enabled.

जल्दी शुरू

Submit as an Ilum Service
Submit as an Airflow DAG

Open the notebook in JupyterLab.
Click the Pipeline Exporter icon in the left sidebar (rank 102, below the file browser).
Pick the Ilum Spark job target card. Select सेवा mode.
Enter a service name (for example daily-kpi-svc) and an optional parameter list.
क्लिक करना Create Ilum Service. A green banner confirms that the service has been created.
Open the History tab. The new service entry exposes a Run action.
Observe the run under Workloads → Services इलम यूआई में।

# Generated wrapper (excerpt)
कक्षा NotebookService(इलम जॉब):
    डीईएफ़ चलाना(स्वयं, उत्तेजक गुण, कॉन्फिग):
        output_table = कॉन्फिग.मिलना("output_table", "gold.daily_kpi")
        # ... notebook body, sanitized

Open the notebook in JupyterLab.
Click the Pipeline Exporter icon in the left sidebar.
Pick the Airflow DAG target card. Select per-cell (default) or batch mode.
Enter a DAG identifier (for example etl_pipeline_daily) and a schedule expression (for example @daily).
Leave Auto-push to Gitea और Auto-trigger DAG enabled to fire a first run as soon as the DAG is visible.
क्लिक करना Create DAG. JPE renders the file, pushes it to Gitea, and (optionally) waits for the DAG to become visible (≈40 s) before triggering it.
Open the Airflow UI at /बाहरी/वायु प्रवाह/ and find the new DAG.

# Generated DAG (batch mode, excerpt)
के साथ DAG(
    dag_id="etl_pipeline_daily",
    schedule="@daily",
    start_date=दिनांक समय(2026, 1, 1),
    catchup=False,
) जैसा dag:
    चलाना = PythonOperator(
        task_id="run_notebook",
        python_callable=_livy_run,
        op_kwargs={
            "livy_conn_id": "ilum-livy-proxy",
            "code": NOTEBOOK_CODE,
        },
    )

Generated Artifacts

Target · mode	Artifact	स्थान
Airflow DAG	`.py`	Gitea repository (`//:/`) - picked up by `git-sync` and parsed by `dag-processor`.
Ilum Spark job · single	`pipeline_exp_/main.py`	Uploaded to Ilum as part of the job submission; visible at `/workloads/details/job/` इलम यूआई में।
Ilum Spark job · service	`NotebookService` class in module `pipeline_exp_`	Registered as the Ilum service's main class FQN; invoked via `POST /api/v1/group//job/execute`.
Ilum Spark job · cron	Service + schedule entry	Same as service plus a row in `/schedules`; each fire is a single-shot job triggered by `इलम कोर`.
`.py` download (any target)	Wrapped `.py` रेती	Browser download - useful for inspection, manual `स्पार्क-सबमिट`, or offline development.

चेतावनी

Service and cron modes wrap the notebook as an इलम जॉब subclass, whose base class comes from the ilum-python-job package (from ilum.api import IlumJob). The Spark image used to run the service must provide this package - the Ilum Spark runtime images ship it. If a custom image is used in service or cron mode, install ilum-python-job into it; otherwise the import fails at driver startup. The single mode and the Airflow DAG target do not require it.

Apache Airflow integration - DAG orchestration, LivyOperator, SparkSubmitOperator, Git Sync.
इलम में ज्यूपिटरलैब - the host environment for JPE.
How to use Notebooks in Ilum - Sparkmagic, session management, and visualization patterns.

Deployment Targets​

Execution Modes​

Ilum Spark job​

Airflow DAG​

Cell Tags​

Naming per-cell tasks​

Per-task retry count​

Operating Modes​

Notebook Sanitization​

Cell-kind filtering and the exported-cell count​

Pipeline Parameters​

History and Re-Execution​

संरूपण​

जल्दी शुरू​

Generated Artifacts​

Related​