Parallelization policy
======================

SPECTRAX-GK parallelization claims are separated by workload class and by the
identity gates that currently exist. Treat this page as the short policy; the
long artifact history remains in :doc:`performance` and runnable examples remain
in :doc:`examples`.

For release notes and manuscripts, read this page together with
:doc:`release_scope`. Independent scans and ensembles are the current
production path. Whole-state nonlinear sharding and velocity-space
decomposition are correctness/profiler development paths until they pass
workload-specific identity, conservation, and profiler gates.

Strategy registry
-----------------

The metadata API exposes a JSON-friendly strategy table. Release-ready
independent-work rows are intentionally ordered first: ``independent_ky_scan``,
then ``uq_ensemble``.

.. list-table::
   :header-rows: 1
   :widths: 28 18 18 24

   * - ``name``
     - ``readiness``
     - ``independent_work``
     - ``changes_solver_layout``
   * - ``independent_ky_scan``
     - ``release_ready``
     - ``true``
     - ``false``
   * - ``uq_ensemble``
     - ``release_ready``
     - ``true``
     - ``false``
   * - ``whole_state_kx_ky``
     - ``diagnostic``
     - ``false``
     - ``true``
   * - ``velocity_species_hermite``
     - ``diagnostic``
     - ``false``
     - ``true``
   * - ``fft_axis_domain``
     - ``diagnostic``
     - ``false``
     - ``true``

Production path: independent work
---------------------------------

Production-ready parallelism is currently scoped to independent solver calls:

- independent ``k_y`` scans;
- quasilinear calibration grids;
- finite-difference and sensitivity batches;
- UQ and ensemble workloads.

Use ``spectraxgk.ky_scan_batches`` and ``spectraxgk.batch_map`` for JAX-array
workloads, and ``spectraxgk.independent_map`` for file-backed Python tasks.
These helpers preserve serial ordering and restrict communication to result
aggregation. Any timing claim from this path must be paired with a serial
numerical-identity gate for the reported observables, such as ``gamma``,
``omega``, quasilinear weights, or covariance summaries.

For UQ and optimization portfolios, ``spectraxgk.independent_ensemble_provenance_gate``
is the compact production-readiness check. It runs the same member function
serially and through ``independent_map``, verifies numerical identity and result
ordering, checks that oversubscribed worker requests clip to the ensemble size,
reconstructs the deterministic independent-work decomposition, and probes
``IndependentMapExecutionError`` metadata for worker failures. This is a
provenance and identity gate only; it does not make a nonlinear
domain-decomposition speedup claim.

Runtime ``k_y`` scans can request the same independent-worker policy directly
from TOML. This is a scan orchestration path, not a solver-layout sharding path:

.. code-block:: toml

   [parallel]
   strategy = "batch"
   axis = "ky"
   num_devices = 4      # or batch_size = 4
   backend = "auto"     # "thread" or "process" are explicit alternatives

When command-line scan workers are not set explicitly, ``strategy = "batch"``
with ``axis = "ky"`` resolves to independent per-``k_y`` solver calls and
records the resolved worker policy in runtime scan artifacts.

The large tracked artifacts use real solver work rather than synthetic sleeps:
``docs/_static/independent_ky_scan_scaling_large.json`` covers Cyclone linear
``k_y`` scans, and ``docs/_static/quasilinear_uq_ensemble_scaling_large.json``
covers a late-time linear/quasilinear UQ ensemble. These are the figures to cite
for current parallelization speedup claims.

Production closure status
-------------------------

The release status artifact combines the production scaling evidence and the
diagnostic decomposition gates into one machine-readable claim boundary:

.. image:: _static/parallelization_completion_status.png
   :alt: SPECTRAX-GK parallelization closure status
   :align: center

``docs/_static/parallelization_completion_status.json`` reports the release
production-completion percentage and the status of each lane. For the current
tracked artifacts, production independent-work parallelization is closed:
independent ``k_y`` scans reach ``7.18x`` on eight CPU workers and ``1.88x`` on
two RTX A4000 GPUs, while the quasilinear/UQ ensemble reaches ``5.41x`` on CPU
and ``1.71x`` on GPU. The same status now embeds the independent
UQ/optimization provenance gate for serial-vs-parallel ordering, worker
clipping, exception metadata, and deterministic reconstruction. Whole-state
nonlinear sharding and FFT-axis decomposition remain diagnostic, not production
nonlinear speedup claims.

Regenerate the closure status after refreshing any scaling artifact:

.. code-block:: bash

   python tools/build_parallelization_completion_status.py

The lower-level decomposition-contract status is generated separately. It is
useful when editing orchestration code because it checks deterministic shard
assignment, serial reconstruction identity, and claim-level separation without
rerunning large profiles.

.. code-block:: bash

   python tools/build_parallel_decomposition_status.py

.. image:: _static/parallel_decomposition_status.png
   :alt: Parallel decomposition contract status
   :align: center

This status passes for production independent ``k_y`` and UQ portfolios and
for a diagnostic nonlinear state-domain partition. Passing the diagnostic row
does not imply runtime nonlinear domain decomposition: it only proves that the
metadata split/reassemble contract is internally consistent and correctly
scoped as non-production.

Diagnostic path: whole-state nonlinear sharding
-----------------------------------------------

Fixed-step whole-state nonlinear sharding is diagnostic-only. The
``integrate_nonlinear_sharded`` / ``TimeConfig.state_sharding`` path is useful
for control-flow validation, state-axis identity gates, profiler localization,
and testing candidate layouts. It is not a production nonlinear domain
decomposition or multi-GPU speedup claim. Do not use it as evidence for a
whole-state nonlinear sharding speedup; it has no scoped speedup claim until
separate identity gates and matched profiler artifacts exist for that exact
workload.

In particular, current whole-state sharding does not close the communication
problem for nonlinear FFTs, halo exchange, conservation checks, or benchmark-size
transport runs. ``z``-axis FFT sharding is not release-gated until it has a
separate communication/layout design and a passing identity gate.

The large CPU/GPU sweep in
``docs/_static/nonlinear_sharding_strong_scaling_large.json`` confirms the
policy: the final state is identity-correct, but logical-CPU speedup saturates
near ``1.39x`` and the current two-GPU path is slower than one GPU for the
tracked larger fixed-step case. That artifact is therefore valuable engineering
evidence, not a production nonlinear speedup result.
The combined artifact is intentionally fail-closed: ``identity_passed`` may be
true while ``speedup_passed`` is false, with explicit ``speedup_blockers`` naming
the backend/device row that regressed.

The next decomposition step is also gated, but still diagnostic. The artifact
``docs/_static/nonlinear_domain_parallel_identity_gate.json`` exercises a
deterministic local nonlinear state update with one-cell halo chunks and checks
the decomposed result against the serial update before enabling that prototype
path. This validates the fail-closed identity-gate contract for a bounded local
stencil. The report records the gate name, plan-validity status, and any
explicit blocker reasons such as noncanonical axes, incomplete chunk coverage,
or serial/decomposed shape mismatches; any blocker disables the decomposed
prototype path even if the arrays being compared are numerically equal. The
same JSON now embeds a stricter transport-window sub-gate,
``nonlinear_domain_transport_window_identity``, that advances the serial and
halo-decomposed prototypes over a short fixed-step window and compares final
state identity, boundary identity, mass-trace identity, free-energy-proxy trace
identity, and boundary-flux-proxy trace identity. The drift values in that
sub-gate are serial-vs-decomposed agreement checks for the damped diagnostic
stencil; they are not production conservation claims. The artifact still does
not validate distributed FFTs, field solves, runtime routing, benchmark
transport acceptance, or speedup.

The spectral communication layer now has the same fail-closed treatment. The
artifact ``docs/_static/nonlinear_spectral_communication_identity_gate.json``
uses deterministic complex spectral coefficients in ``(N_l,N_m,N_y,N_x,N_z)``
layout, applies the split/reassemble and axis-transpose operations that a
distributed FFT route would need, and compares three serial observables against
the communicated layout: FFT forward/inverse round trip, pseudo-spectral
nonlinear bracket, and spectral field-solve layout. Passing this gate promotes
``fft_axis_domain`` from blocked to diagnostic. It still does not add runtime
distributed FFT routing, conservation checks, nonlinear transport-window
acceptance, profiler evidence, or any speedup claim.

Before nonlinear domain decomposition can be promoted beyond this diagnostic
state, the runtime route must pass all of the following gates on the same
workload family that appears in the speedup figure:

- full nonlinear RHS identity for ``dG``, ``phi``, the nonlinear bracket,
  density/field-solve layout, Hermitian projection, and dealiasing;
- fixed-step serial-vs-decomposed integration identity for final state,
  final fields, final RHS, and per-step scalar traces;
- boundary/interface identity for owned and halo cells, not only a global norm;
- conservation agreement for density/mass, a free-energy-like diagnostic,
  zonal response, and heat-flux proxies;
- post-transient transport-window agreement for Cyclone, KBM, and at least one
  stellarator smoke case;
- CPU serial, CPU decomposed, one-GPU serial, and two-GPU decomposed parity
  under the same observable contract;
- matched profiler artifacts for the exact backend, device count, software
  stack, grid, warmups/repeats, and identity tolerance being claimed.

Until those gates exist, nonlinear decomposition work can be documented as
engineering evidence only, even if a new profile shows positive timing on one
machine.

Velocity-space communication gates
----------------------------------

Velocity-space decomposition is gated from the bottom up. The accepted planning
contract is species-first and Hermite-second, with explicit communication flags
for field reductions/broadcasts and Hermite ghost exchange. Each added runtime
path must preserve those contracts before being used for performance claims.

The currently gated communication and call-graph layers are:

- species/Hermite velocity-sharding planner metadata;
- nearest-neighbor Hermite ghost exchange;
- Hermite-sharded field reduction and electrostatic field reduction;
- Hermite streaming-ladder coefficients;
- periodic streaming microkernel and streaming-only linear-RHS call graph;
- electrostatic streaming, drift-slice, diamagnetic-drive, and composed
  single-species periodic electrostatic linear-slices gates.

These gates validate communication and numerical identity for bounded linear or
microkernel paths. They do not validate collisions, linked boundaries,
electromagnetic terms, multi-species nonlinear field solves, nonlinear brackets,
or nonlinear transport speedup unless those paths have their own identity gates
and profiler artifacts.

Claim rules
-----------

Use the following rules when writing docs, release notes, or papers:

- Call independent ``k_y``/UQ/ensemble batching the production-ready
  parallelization path when the serial identity gate is current.
- For runtime scan TOMLs, use ``[parallel] strategy = "batch"`` with
  ``axis = "ky"`` only for independent ``k_y`` scan orchestration.
- Call whole-state nonlinear sharding a diagnostic correctness/profiler gate,
  not production nonlinear parallelism.
- Call velocity-space ``shard_map`` work communication-gated and opt-in until
  the relevant full-RHS and workload gates are closed.
- Do not claim nonlinear speedup from sharding, velocity decomposition, spectral
  toggles, or linear-slice profiles without fresh profiler artifacts for the
  exact workload, backend, device count, software stack, and identity tolerance
  being claimed.
- Keep speedup plots separate from identity gates: identity establishes
  correctness; profiler artifacts establish only the scoped timing claim they
  measure.

Large-run scaling acceptance checklist
--------------------------------------

A CPU/GPU strong-scaling result is release-ready only when the tracked
artifacts satisfy all of the following:

- the combined ``*_large`` JSON/CSV/PNG/PDF files point back to split CPU and
  GPU source artifacts for the same workload family;
- every split artifact records the actual problem size, backend, requested
  device counts, warmup/repeat policy, and positive per-worker or per-profile
  timing samples;
- every row has ``identity_gate_pass = true`` and compares against the
  one-worker or one-device serial reference for the observable being claimed;
- nonlinear whole-state sharding rows embed the per-device profiler/profile
  payload, including trace-request status, serial timing stats, sharded timing
  stats, selected axis, and final-state error metrics;
- any speedup wording names the exact backend, device count, workload, grid,
  software stack, identity tolerance, and artifact files that produced it.

If any item is missing, the result can be kept as local engineering evidence
only. In particular, whole-state nonlinear sharding remains not a production
nonlinear speedup claim, even when the embedded profile reports a positive
engineering timing ratio. Promoting that lane requires fresh profiler artifacts
for the exact workload plus full nonlinear identity, conservation, field-solve,
FFT/bracket communication, and transport-window gates.

Fast artifact contract check
----------------------------

Before editing scaling docs or manifests, run the checked-in artifact contract:

.. code-block:: bash

   python tools/check_parallel_scaling_artifacts.py

This command does not rerun large profiles and does not enforce any minimum
speedup. It validates that the tracked JSON/CSV/PNG/PDF sidecars exist, the
``parallel_scaling`` manifest lists them, split CPU/GPU source artifacts are
attached where required, numerical identity gates pass, error fields are finite,
and timing/profiler payloads are positive and scoped to their documented claim.

Release artifact policy
-----------------------

The release-gated parallelization artifacts are grouped by what they are
allowed to support:

.. list-table::
   :header-rows: 1
   :widths: 28 24 24 24

   * - Artifact family
     - Primary files
     - Claim allowed
     - Claim not allowed
   * - Independent ``k_y`` scans
     - ``independent_ky_scan_scaling_large.{json,csv,png,pdf}``
     - Production parallelization for independent linear scans when
       ``gamma``/``omega`` identity is current.
     - Nonlinear domain decomposition or nonlinear transport speedup.
   * - Quasilinear/UQ ensembles
     - ``quasilinear_uq_ensemble_scaling_large.{json,csv,png,pdf}``
     - Production batching for independent reduced-feature and UQ workloads.
     - Promoted absolute nonlinear heat-flux prediction.
   * - Whole-state nonlinear sharding
     - ``nonlinear_sharding_strong_scaling_large.{json,csv,png,pdf}``
     - Correctness and profiler-direction evidence for the current ``pjit``
       state-axis layout.
     - Production nonlinear multi-GPU speedup.
   * - Prototype nonlinear state-domain gate
     - ``nonlinear_domain_parallel_identity_gate.{json,png}``
     - Fail-closed serial-vs-halo-decomposed identity evidence for one bounded
       local stencil, including the embedded transport-window proxy traces.
     - Distributed FFT, field-solve, production conservation, transport-runtime,
       or speedup claims.
   * - Prototype nonlinear spectral communication gate
     - ``nonlinear_spectral_communication_identity_gate.{json,png}``
     - Fail-closed split/reassemble identity evidence for FFT round trip,
       pseudo-spectral bracket, and spectral field-solve layout.
     - Runtime distributed FFT routing, nonlinear conservation,
       transport-window, or speedup claims.
   * - Velocity-space linear slices
     - ``linear_rhs_parallel_slices_sweep.{json,png,pdf}``
     - Bounded engineering evidence for opt-in electrostatic linear RHS slices.
     - Electromagnetic, linked-boundary, collision, or nonlinear speedup.

Both ``tools/performance_optimization_manifest.toml`` and
``tools/validation_coverage_manifest.toml`` list these artifacts explicitly.
The tests require the manifests, files, and claim scopes to stay synchronized,
so deleting or silently reinterpreting a scaling artifact fails the fast
parallelization gate.