Parallelization policy

SPECTRAX-GK parallelization claims are separated by workload class and by the identity gates that currently exist. Treat this page as the short policy; the long artifact history remains in Performance and runnable examples remain in Examples.

For release notes and manuscripts, read this page together with Release Scope and Claim Boundaries. Independent scans and ensembles are the current production path. Whole-state nonlinear sharding and nonlinear domain or velocity-space decomposition remain diagnostic correctness/profiler paths until they pass workload-specific identity, conservation, transport-window, and matched profiler gates.

Strategy registry

The metadata API exposes a JSON-friendly strategy table. Release-ready independent-work rows are intentionally ordered first: independent_ky_scan, then uq_ensemble.

name

readiness

independent_work

changes_solver_layout

independent_ky_scan

release_ready

true

false

uq_ensemble

release_ready

true

false

whole_state_kx_ky

diagnostic

false

true

velocity_species_hermite

diagnostic

false

true

fft_axis_domain

diagnostic

false

true

Production path: independent work

Production-ready parallelism is currently scoped to independent solver calls:

  • independent k_y scans;

  • quasilinear calibration grids;

  • finite-difference and sensitivity batches;

  • UQ and ensemble workloads.

Use spectraxgk.ky_scan_batches and spectraxgk.batch_map for JAX-array workloads, and spectraxgk.independent_map for file-backed Python tasks. These helpers preserve serial ordering and restrict communication to result aggregation. Any timing claim from this path must be paired with a serial numerical-identity gate for the reported observables, such as gamma, omega, quasilinear weights, or covariance summaries.

For UQ and optimization portfolios, spectraxgk.independent_ensemble_provenance_gate is the compact production-readiness check. It runs the same member function serially and through independent_map, verifies numerical identity and result ordering, checks that oversubscribed worker requests clip to the ensemble size, reconstructs the deterministic independent-work decomposition, and probes IndependentMapExecutionError metadata for worker failures. This is a provenance and identity gate only; it does not make a nonlinear domain-decomposition speedup claim.

Runtime k_y scans can request the same independent-worker policy directly from TOML. This is a scan orchestration path, not a solver-layout sharding path:

[parallel]
strategy = "batch"
axis = "ky"
num_devices = 4      # or batch_size = 4
backend = "auto"     # "thread" or "process" are explicit alternatives

When command-line scan workers are not set explicitly, strategy = "batch" with axis = "ky" resolves to independent per-k_y solver calls and records the resolved worker policy in runtime scan artifacts.

The large tracked artifacts use real solver work rather than synthetic sleeps: docs/_static/independent_ky_scan_scaling_large.json covers Cyclone linear k_y scans, and docs/_static/quasilinear_uq_ensemble_scaling_large.json covers a late-time linear/quasilinear UQ ensemble. These are the figures to cite for current parallelization speedup claims.

Production closure status

The release status artifact combines the production scaling evidence and the diagnostic decomposition gates into one machine-readable claim boundary:

SPECTRAX-GK parallelization closure status

docs/_static/parallelization_completion_status.json reports the release production-completion percentage and the status of each lane. For the current tracked artifacts, production independent-work parallelization is closed: independent k_y scans reach 7.18x on eight CPU workers and 1.88x on two RTX A4000 GPUs, while the quasilinear/UQ ensemble reaches 5.41x on CPU and 1.71x on GPU. The same status now embeds the independent UQ/optimization provenance gate for serial-vs-parallel ordering, worker clipping, exception metadata, and deterministic reconstruction. Whole-state nonlinear sharding and FFT-axis decomposition remain diagnostic, not production nonlinear speedup claims.

Regenerate the closure status after refreshing any scaling artifact:

python tools/build_parallelization_completion_status.py

The lower-level decomposition-contract status is generated separately. It is useful when editing orchestration code because it checks deterministic shard assignment, serial reconstruction identity, and claim-level separation without rerunning large profiles.

python tools/build_parallel_decomposition_status.py
Parallel decomposition contract status

This status passes for production independent k_y and UQ portfolios and for a diagnostic nonlinear state-domain partition. Passing the diagnostic row does not imply runtime nonlinear domain decomposition: it only proves that the metadata split/reassemble contract is internally consistent and correctly scoped as non-production.

Diagnostic path: whole-state nonlinear sharding

Fixed-step whole-state nonlinear sharding is diagnostic-only. The integrate_nonlinear_sharded / TimeConfig.state_sharding path is useful for control-flow validation, state-axis identity gates, profiler localization, and testing candidate layouts. It is not a production nonlinear domain decomposition or multi-GPU speedup claim. Do not use it as evidence for a whole-state nonlinear sharding speedup; it has no scoped speedup claim until separate identity gates and matched profiler artifacts exist for that exact workload.

In particular, current whole-state sharding does not close the communication problem for nonlinear FFTs, halo exchange, conservation checks, or benchmark-size transport runs. z-axis FFT sharding is not release-gated until it has a separate communication/layout design and a passing identity gate.

The large CPU/GPU sweep in docs/_static/nonlinear_sharding_strong_scaling_large.json confirms the policy: the final state is identity-correct, but logical-CPU speedup saturates near 1.39x and the current two-GPU path is slower than one GPU for the tracked larger fixed-step case. That artifact is therefore valuable engineering evidence, not a production nonlinear speedup result. The combined artifact is intentionally fail-closed: identity_passed may be true while speedup_passed is false, with explicit speedup_blockers naming the backend/device row that regressed. The companion gate docs/_static/nonlinear_sharding_production_speedup_gate.json is the only artifact that may promote whole-state nonlinear sharding wording beyond diagnostic/profiler evidence, and only for the exact workload it gates. The fast checker tools/check_parallel_scaling_artifacts.py now validates that gate, its CSV sidecar, its CPU/GPU source rows, and its required-backend blockers without rerunning long CPU or GPU profilers.

The next decomposition step is also gated, but still diagnostic. The artifact docs/_static/nonlinear_domain_parallel_identity_gate.json exercises a deterministic local nonlinear state update with one-cell halo chunks and checks the decomposed result against the serial update before enabling that prototype path. This validates the fail-closed identity-gate contract for a bounded local stencil. The report records the gate name, plan-validity status, and any explicit blocker reasons such as noncanonical axes, incomplete chunk coverage, or serial/decomposed shape mismatches; any blocker disables the decomposed prototype path even if the arrays being compared are numerically equal. The same JSON now embeds a stricter transport-window sub-gate, nonlinear_domain_transport_window_identity, that advances the serial and halo-decomposed prototypes over a short fixed-step window and compares final state identity, boundary identity, mass-trace identity, free-energy-proxy trace identity, and boundary-flux-proxy trace identity. The drift values in that sub-gate are serial-vs-decomposed agreement checks for the damped diagnostic stencil; they are not production conservation claims. The artifact still does not validate distributed FFTs, field solves, runtime routing, benchmark transport acceptance, or speedup.

The spectral communication layer now has the same fail-closed treatment. The artifact docs/_static/nonlinear_spectral_communication_identity_gate.json uses deterministic complex spectral coefficients in (N_l,N_m,N_y,N_x,N_z) layout, applies the split/reassemble and axis-transpose operations that a distributed FFT route would need, and compares three serial observables against the communicated layout: FFT forward/inverse round trip, pseudo-spectral nonlinear bracket, and spectral field-solve layout. Passing this gate promotes fft_axis_domain from blocked to diagnostic. It still does not add runtime distributed FFT routing, conservation checks, nonlinear transport-window acceptance, profiler evidence, or any speedup claim.

The package also exposes spectraxgk.nonlinear_parallel.nonlinear_spectral_rhs_identity_gate as the next diagnostic micro-route. This gate owns logical row-major (k_y,k_x) tiles, reconstructs them, recomputes the spectral field and pseudo-spectral bracket, and compares the serial nonlinear RHS contribution -\{\phi,g\} against the tile-reassembled route. It is useful because it exercises the field/bracket/RHS dataflow instead of only layout round trips. It remains fail-closed and diagnostic-only: logical tiles are reconstructed for identity validation, not executed through a production pjit/shard_map distributed FFT path.

Before nonlinear domain decomposition can be promoted beyond this diagnostic state, the runtime route must pass all of the following gates on the same workload family that appears in the speedup figure:

  • full nonlinear RHS identity for dG, phi, the nonlinear bracket, density/field-solve layout, Hermitian projection, and dealiasing;

  • fixed-step serial-vs-decomposed integration identity for final state, final fields, final RHS, and per-step scalar traces;

  • boundary/interface identity for owned and halo cells, not only a global norm;

  • conservation agreement for density/mass, a free-energy-like diagnostic, zonal response, and heat-flux proxies;

  • post-transient transport-window agreement for Cyclone, KBM, and at least one stellarator smoke case;

  • CPU serial, CPU decomposed, one-GPU serial, and two-GPU decomposed parity under the same observable contract;

  • matched profiler artifacts for the exact backend, device count, software stack, grid, warmups/repeats, and identity tolerance being claimed.

Until those gates exist, nonlinear decomposition work can be documented as diagnostic engineering evidence only, even if a new profile shows positive timing on one machine.

Velocity-space communication gates

Velocity-space decomposition is gated from the bottom up. The accepted planning contract is species-first and Hermite-second, with explicit communication flags for field reductions/broadcasts and Hermite ghost exchange. Each added runtime path must preserve those contracts before being used for performance claims.

The currently gated communication and call-graph layers are:

  • species/Hermite velocity-sharding planner metadata;

  • nearest-neighbor Hermite ghost exchange;

  • Hermite-sharded field reduction and electrostatic field reduction;

  • Hermite streaming-ladder coefficients;

  • periodic streaming microkernel and streaming-only linear-RHS call graph;

  • electrostatic streaming, drift-slice, diamagnetic-drive, and composed single-species periodic electrostatic linear-slices gates.

These gates validate communication and numerical identity for bounded linear or microkernel paths. They do not validate collisions, linked boundaries, electromagnetic terms, multi-species nonlinear field solves, nonlinear brackets, or nonlinear transport speedup unless those paths have their own identity gates and profiler artifacts.

Claim rules

Use the following rules when writing docs, release notes, or papers:

  • Call independent k_y/UQ/ensemble batching the production-ready parallelization path when the serial identity gate is current.

  • For runtime scan TOMLs, use [parallel] strategy = "batch" with axis = "ky" only for independent k_y scan orchestration.

  • Call whole-state nonlinear sharding a diagnostic correctness/profiler gate, not production nonlinear parallelism unless the exact workload has passed its identity and profiler promotion gates.

  • Call velocity-space shard_map work communication-gated and opt-in until the relevant full-RHS and workload gates are closed.

  • Do not claim nonlinear speedup from sharding, velocity decomposition, spectral toggles, or linear-slice profiles without passing identity gates and fresh profiler artifacts for the exact workload, backend, device count, software stack, and identity tolerance being claimed.

  • Keep speedup plots separate from identity gates: identity establishes correctness; profiler artifacts establish only the scoped timing claim they measure.

Large-run scaling acceptance checklist

A CPU/GPU strong-scaling result is release-ready only when the tracked artifacts satisfy all of the following:

  • the combined *_large JSON/CSV/PNG/PDF files point back to split CPU and GPU source artifacts for the same workload family;

  • every split artifact records the actual problem size, backend, requested device counts, warmup/repeat policy, and positive per-worker or per-profile timing samples;

  • every row has identity_gate_pass = true and compares against the one-worker or one-device serial reference for the observable being claimed;

  • nonlinear whole-state sharding rows embed the per-device profiler/profile payload, including trace-request status, serial timing stats, sharded timing stats, selected axis, and final-state error metrics;

  • any speedup wording names the exact backend, device count, workload, grid, software stack, identity tolerance, and artifact files that produced it.

If any item is missing, the result can be kept as local engineering evidence only. In particular, whole-state nonlinear sharding remains not a production nonlinear speedup claim, even when the embedded profile reports a positive engineering timing ratio. Promoting that lane requires fresh profiler artifacts for the exact workload plus full nonlinear identity, conservation, field-solve, FFT/bracket communication, and transport-window gates.

Fast artifact contract check

Before editing scaling docs or manifests, run the checked-in artifact contract:

python tools/check_parallel_scaling_artifacts.py

This command does not rerun large profiles and does not enforce any minimum speedup. It validates that the tracked JSON/CSV/PNG/PDF sidecars exist, the parallel_scaling manifest lists them, split CPU/GPU source artifacts are attached where required, numerical identity gates pass, error fields are finite, and timing/profiler payloads are positive and scoped to their documented claim.

Release artifact policy

The release-gated parallelization artifacts are grouped by what they are allowed to support:

Artifact family

Primary files

Claim allowed

Claim not allowed

Independent k_y scans

independent_ky_scan_scaling_large.{json,csv,png,pdf}

Production parallelization for independent linear scans when gamma/omega identity is current.

Nonlinear domain decomposition or nonlinear transport speedup.

Quasilinear/UQ ensembles

quasilinear_uq_ensemble_scaling_large.{json,csv,png,pdf}

Production batching for independent reduced-feature and UQ workloads.

Promoted absolute nonlinear heat-flux prediction.

Whole-state nonlinear sharding

nonlinear_sharding_strong_scaling_large.{json,csv,png,pdf}

Correctness and profiler-direction evidence for the current pjit state-axis layout.

Production nonlinear multi-GPU speedup.

Prototype nonlinear state-domain gate

nonlinear_domain_parallel_identity_gate.{json,png}

Fail-closed serial-vs-halo-decomposed identity evidence for one bounded local stencil, including the embedded transport-window proxy traces.

Distributed FFT, field-solve, production conservation, transport-runtime, or speedup claims.

Prototype nonlinear spectral communication gate

nonlinear_spectral_communication_identity_gate.{json,png}

Fail-closed split/reassemble identity evidence for FFT round trip, pseudo-spectral bracket, and spectral field-solve layout.

Runtime distributed FFT routing, nonlinear conservation, transport-window, or speedup claims.

Velocity-space linear slices

linear_rhs_parallel_slices_sweep.{json,png,pdf}

Bounded engineering evidence for opt-in electrostatic linear RHS slices.

Electromagnetic, linked-boundary, collision, or nonlinear speedup.

Both tools/performance_optimization_manifest.toml and tools/validation_coverage_manifest.toml list these artifacts explicitly. The tests require the manifests, files, and claim scopes to stay synchronized, so deleting or silently reinterpreting a scaling artifact fails the fast parallelization gate.