Parallelization policy
SPECTRAX-GK parallelization claims are separated by workload class and by the identity gates that currently exist. Treat this page as the short policy; the long artifact history remains in Performance and runnable examples remain in Examples.
For release notes and manuscripts, read this page together with Release Scope and Claim Boundaries. Independent scans and ensembles are the current production path. Whole-state nonlinear sharding and nonlinear domain or velocity-space decomposition remain diagnostic correctness/profiler paths until they pass workload-specific identity, conservation, transport-window, and matched profiler gates.
Strategy registry
The metadata API exposes a JSON-friendly strategy table. Release-ready
independent-work rows are intentionally ordered first: independent_ky_scan,
then uq_ensemble.
|
|
|
|
|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Production path: independent work
Production-ready parallelism is currently scoped to independent solver calls:
independent
k_yscans;quasilinear calibration grids;
finite-difference and sensitivity batches;
UQ and ensemble workloads.
Use spectraxgk.ky_scan_batches and spectraxgk.batch_map for JAX-array
workloads, and spectraxgk.independent_map for file-backed Python tasks.
These helpers preserve serial ordering and restrict communication to result
aggregation. Any timing claim from this path must be paired with a serial
numerical-identity gate for the reported observables, such as gamma,
omega, quasilinear weights, or covariance summaries.
For UQ and optimization portfolios, spectraxgk.independent_ensemble_provenance_gate
is the compact production-readiness check. It runs the same member function
serially and through independent_map, verifies numerical identity and result
ordering, checks that oversubscribed worker requests clip to the ensemble size,
reconstructs the deterministic independent-work decomposition, and probes
IndependentMapExecutionError metadata for worker failures. This is a
provenance and identity gate only; it does not make a nonlinear
domain-decomposition speedup claim.
Runtime k_y scans can request the same independent-worker policy directly
from TOML. This is a scan orchestration path, not a solver-layout sharding path:
[parallel]
strategy = "batch"
axis = "ky"
num_devices = 4 # or batch_size = 4
backend = "auto" # "thread" or "process" are explicit alternatives
When command-line scan workers are not set explicitly, strategy = "batch"
with axis = "ky" resolves to independent per-k_y solver calls and
records the resolved worker policy in runtime scan artifacts.
The large tracked artifacts use real solver work rather than synthetic sleeps:
docs/_static/independent_ky_scan_scaling_large.json covers Cyclone linear
k_y scans, and docs/_static/quasilinear_uq_ensemble_scaling_large.json
covers a late-time linear/quasilinear UQ ensemble. These are the figures to cite
for current parallelization speedup claims.
Production closure status
The release status artifact combines the production scaling evidence and the diagnostic decomposition gates into one machine-readable claim boundary:
docs/_static/parallelization_completion_status.json reports the release
production-completion percentage and the status of each lane. For the current
tracked artifacts, production independent-work parallelization is closed:
independent k_y scans reach 7.18x on eight CPU workers and 1.88x on
two RTX A4000 GPUs, while the quasilinear/UQ ensemble reaches 5.41x on CPU
and 1.71x on GPU. The same status now embeds the independent
UQ/optimization provenance gate for serial-vs-parallel ordering, worker
clipping, exception metadata, and deterministic reconstruction. Whole-state
nonlinear sharding and FFT-axis decomposition remain diagnostic, not production
nonlinear speedup claims.
Regenerate the closure status after refreshing any scaling artifact:
python tools/build_parallelization_completion_status.py
The lower-level decomposition-contract status is generated separately. It is useful when editing orchestration code because it checks deterministic shard assignment, serial reconstruction identity, and claim-level separation without rerunning large profiles.
python tools/build_parallel_decomposition_status.py
This status passes for production independent k_y and UQ portfolios and
for a diagnostic nonlinear state-domain partition. Passing the diagnostic row
does not imply runtime nonlinear domain decomposition: it only proves that the
metadata split/reassemble contract is internally consistent and correctly
scoped as non-production.
Diagnostic path: whole-state nonlinear sharding
Fixed-step whole-state nonlinear sharding is diagnostic-only. The
integrate_nonlinear_sharded / TimeConfig.state_sharding path is useful
for control-flow validation, state-axis identity gates, profiler localization,
and testing candidate layouts. It is not a production nonlinear domain
decomposition or multi-GPU speedup claim. Do not use it as evidence for a
whole-state nonlinear sharding speedup; it has no scoped speedup claim until
separate identity gates and matched profiler artifacts exist for that exact
workload.
In particular, current whole-state sharding does not close the communication
problem for nonlinear FFTs, halo exchange, conservation checks, or benchmark-size
transport runs. z-axis FFT sharding is not release-gated until it has a
separate communication/layout design and a passing identity gate.
The large CPU/GPU sweep in
docs/_static/nonlinear_sharding_strong_scaling_large.json confirms the
policy: the final state is identity-correct, but logical-CPU speedup saturates
near 1.39x and the current two-GPU path is slower than one GPU for the
tracked larger fixed-step case. That artifact is therefore valuable engineering
evidence, not a production nonlinear speedup result.
The combined artifact is intentionally fail-closed: identity_passed may be
true while speedup_passed is false, with explicit speedup_blockers naming
the backend/device row that regressed.
The companion gate
docs/_static/nonlinear_sharding_production_speedup_gate.json is the only
artifact that may promote whole-state nonlinear sharding wording beyond
diagnostic/profiler evidence, and only for the exact workload it gates. The fast
checker
tools/check_parallel_scaling_artifacts.py now validates that gate, its CSV
sidecar, its CPU/GPU source rows, and its required-backend blockers without
rerunning long CPU or GPU profilers.
The next decomposition step is also gated, but still diagnostic. The artifact
docs/_static/nonlinear_domain_parallel_identity_gate.json exercises a
deterministic local nonlinear state update with one-cell halo chunks and checks
the decomposed result against the serial update before enabling that prototype
path. This validates the fail-closed identity-gate contract for a bounded local
stencil. The report records the gate name, plan-validity status, and any
explicit blocker reasons such as noncanonical axes, incomplete chunk coverage,
or serial/decomposed shape mismatches; any blocker disables the decomposed
prototype path even if the arrays being compared are numerically equal. The
same JSON now embeds a stricter transport-window sub-gate,
nonlinear_domain_transport_window_identity, that advances the serial and
halo-decomposed prototypes over a short fixed-step window and compares final
state identity, boundary identity, mass-trace identity, free-energy-proxy trace
identity, and boundary-flux-proxy trace identity. The drift values in that
sub-gate are serial-vs-decomposed agreement checks for the damped diagnostic
stencil; they are not production conservation claims. The artifact still does
not validate distributed FFTs, field solves, runtime routing, benchmark
transport acceptance, or speedup.
The spectral communication layer now has the same fail-closed treatment. The
artifact docs/_static/nonlinear_spectral_communication_identity_gate.json
uses deterministic complex spectral coefficients in (N_l,N_m,N_y,N_x,N_z)
layout, applies the split/reassemble and axis-transpose operations that a
distributed FFT route would need, and compares three serial observables against
the communicated layout: FFT forward/inverse round trip, pseudo-spectral
nonlinear bracket, and spectral field-solve layout. Passing this gate promotes
fft_axis_domain from blocked to diagnostic. It still does not add runtime
distributed FFT routing, conservation checks, nonlinear transport-window
acceptance, profiler evidence, or any speedup claim.
The package also exposes
spectraxgk.nonlinear_parallel.nonlinear_spectral_rhs_identity_gate as the
next diagnostic micro-route. This gate owns logical row-major (k_y,k_x)
tiles, reconstructs them, recomputes the spectral field and pseudo-spectral
bracket, and compares the serial nonlinear RHS contribution -\{\phi,g\}
against the tile-reassembled route. It is useful because it exercises the
field/bracket/RHS dataflow instead of only layout round trips. It remains
fail-closed and diagnostic-only: logical tiles are reconstructed for identity
validation, not executed through a production pjit/shard_map distributed
FFT path.
Before nonlinear domain decomposition can be promoted beyond this diagnostic state, the runtime route must pass all of the following gates on the same workload family that appears in the speedup figure:
full nonlinear RHS identity for
dG,phi, the nonlinear bracket, density/field-solve layout, Hermitian projection, and dealiasing;fixed-step serial-vs-decomposed integration identity for final state, final fields, final RHS, and per-step scalar traces;
boundary/interface identity for owned and halo cells, not only a global norm;
conservation agreement for density/mass, a free-energy-like diagnostic, zonal response, and heat-flux proxies;
post-transient transport-window agreement for Cyclone, KBM, and at least one stellarator smoke case;
CPU serial, CPU decomposed, one-GPU serial, and two-GPU decomposed parity under the same observable contract;
matched profiler artifacts for the exact backend, device count, software stack, grid, warmups/repeats, and identity tolerance being claimed.
Until those gates exist, nonlinear decomposition work can be documented as diagnostic engineering evidence only, even if a new profile shows positive timing on one machine.
Velocity-space communication gates
Velocity-space decomposition is gated from the bottom up. The accepted planning contract is species-first and Hermite-second, with explicit communication flags for field reductions/broadcasts and Hermite ghost exchange. Each added runtime path must preserve those contracts before being used for performance claims.
The currently gated communication and call-graph layers are:
species/Hermite velocity-sharding planner metadata;
nearest-neighbor Hermite ghost exchange;
Hermite-sharded field reduction and electrostatic field reduction;
Hermite streaming-ladder coefficients;
periodic streaming microkernel and streaming-only linear-RHS call graph;
electrostatic streaming, drift-slice, diamagnetic-drive, and composed single-species periodic electrostatic linear-slices gates.
These gates validate communication and numerical identity for bounded linear or microkernel paths. They do not validate collisions, linked boundaries, electromagnetic terms, multi-species nonlinear field solves, nonlinear brackets, or nonlinear transport speedup unless those paths have their own identity gates and profiler artifacts.
Claim rules
Use the following rules when writing docs, release notes, or papers:
Call independent
k_y/UQ/ensemble batching the production-ready parallelization path when the serial identity gate is current.For runtime scan TOMLs, use
[parallel] strategy = "batch"withaxis = "ky"only for independentk_yscan orchestration.Call whole-state nonlinear sharding a diagnostic correctness/profiler gate, not production nonlinear parallelism unless the exact workload has passed its identity and profiler promotion gates.
Call velocity-space
shard_mapwork communication-gated and opt-in until the relevant full-RHS and workload gates are closed.Do not claim nonlinear speedup from sharding, velocity decomposition, spectral toggles, or linear-slice profiles without passing identity gates and fresh profiler artifacts for the exact workload, backend, device count, software stack, and identity tolerance being claimed.
Keep speedup plots separate from identity gates: identity establishes correctness; profiler artifacts establish only the scoped timing claim they measure.
Large-run scaling acceptance checklist
A CPU/GPU strong-scaling result is release-ready only when the tracked artifacts satisfy all of the following:
the combined
*_largeJSON/CSV/PNG/PDF files point back to split CPU and GPU source artifacts for the same workload family;every split artifact records the actual problem size, backend, requested device counts, warmup/repeat policy, and positive per-worker or per-profile timing samples;
every row has
identity_gate_pass = trueand compares against the one-worker or one-device serial reference for the observable being claimed;nonlinear whole-state sharding rows embed the per-device profiler/profile payload, including trace-request status, serial timing stats, sharded timing stats, selected axis, and final-state error metrics;
any speedup wording names the exact backend, device count, workload, grid, software stack, identity tolerance, and artifact files that produced it.
If any item is missing, the result can be kept as local engineering evidence only. In particular, whole-state nonlinear sharding remains not a production nonlinear speedup claim, even when the embedded profile reports a positive engineering timing ratio. Promoting that lane requires fresh profiler artifacts for the exact workload plus full nonlinear identity, conservation, field-solve, FFT/bracket communication, and transport-window gates.
Fast artifact contract check
Before editing scaling docs or manifests, run the checked-in artifact contract:
python tools/check_parallel_scaling_artifacts.py
This command does not rerun large profiles and does not enforce any minimum
speedup. It validates that the tracked JSON/CSV/PNG/PDF sidecars exist, the
parallel_scaling manifest lists them, split CPU/GPU source artifacts are
attached where required, numerical identity gates pass, error fields are finite,
and timing/profiler payloads are positive and scoped to their documented claim.
Release artifact policy
The release-gated parallelization artifacts are grouped by what they are allowed to support:
Artifact family |
Primary files |
Claim allowed |
Claim not allowed |
|---|---|---|---|
Independent |
|
Production parallelization for independent linear scans when
|
Nonlinear domain decomposition or nonlinear transport speedup. |
Quasilinear/UQ ensembles |
|
Production batching for independent reduced-feature and UQ workloads. |
Promoted absolute nonlinear heat-flux prediction. |
Whole-state nonlinear sharding |
|
Correctness and profiler-direction evidence for the current |
Production nonlinear multi-GPU speedup. |
Prototype nonlinear state-domain gate |
|
Fail-closed serial-vs-halo-decomposed identity evidence for one bounded local stencil, including the embedded transport-window proxy traces. |
Distributed FFT, field-solve, production conservation, transport-runtime, or speedup claims. |
Prototype nonlinear spectral communication gate |
|
Fail-closed split/reassemble identity evidence for FFT round trip, pseudo-spectral bracket, and spectral field-solve layout. |
Runtime distributed FFT routing, nonlinear conservation, transport-window, or speedup claims. |
Velocity-space linear slices |
|
Bounded engineering evidence for opt-in electrostatic linear RHS slices. |
Electromagnetic, linked-boundary, collision, or nonlinear speedup. |
Both tools/performance_optimization_manifest.toml and
tools/validation_coverage_manifest.toml list these artifacts explicitly.
The tests require the manifests, files, and claim scopes to stay synchronized,
so deleting or silently reinterpreting a scaling artifact fails the fast
parallelization gate.