Performance
===========

JAX performance model
---------------------

SPECTRAX-GK uses JAX to compile array kernels ahead of time, enabling
vectorized, accelerator-ready performance while retaining automatic
differentiation. The linear operator and time integrator are designed to be
``jit``-friendly and to avoid Python-side loops in performance-critical paths.

The linear solver precomputes geometry-dependent arrays (gyroaverage
coefficients, drift components, mirror term, and zero-mode masks) in a ``LinearCache`` to
avoid recomputing them at each time step. This cache is reused inside the JIT
compiled integrator.

Cache profiling
---------------

We include a small timing harness that compares cached and uncached RHS
evaluation on a modest grid:

.. code-block:: bash

   python tools/profile_linear_cache.py

On a reference CPU run (Nx=Ny=16, Nz=32, Nl=2, Nm=4), this reported:

.. code-block:: text

   uncached_s=0.000426
   cached_s=0.000455
   speedup=0.94x

The exact speedup depends on hardware and problem size. As more geometry and
operator terms are cached (cv/gb/bgrad, hyper ratios), the overhead balance may
shift; in this run the cached path was roughly cost-neutral.

No speedup claim should be made from a local profile, scaling panel, or
parallelization artifact unless the matching numerical-identity gate and
hardware-specific profiler artifact are cited together. Current production
parallelization claims are limited to independent-work batching; whole-state
nonlinear sharding and nonlinear domain decomposition remain diagnostic unless
their workload-specific identity and profiler promotion gates pass.

Nonlinear profiling
-------------------

For end-to-end nonlinear performance, use the dedicated Cyclone profiling
driver. It supports Perfetto traces, XLA HLO dumps, and memory snapshots.

.. code-block:: bash

   python tools/profile_nonlinear_cyclone.py \
     --trace-dir /tmp/spectrax_nl_trace \
     --xla-dump-dir /tmp/spectrax_nl_xla \
     --steps 400 --dt 0.0377 --Nl 4 --Nm 8

The trace directory can be opened with Perfetto. For GPU profiling, set
``JAX_PLATFORM_NAME=gpu`` before invoking the script.
JAX writes the trace under
``<trace-dir>/plugins/profile/<timestamp>/*.trace.json.gz`` together with the
corresponding ``*.xplane.pb`` metadata; the same directory can be opened in
XProf, while the optional ``memory.prof`` snapshot can be inspected with
``pprof`` or XProf's memory tooling.

JAX/XProf operational notes
---------------------------

Two JAX runtime details matter when reading short-run performance numbers:

- JAX's persistent compilation cache can remove repeated recompilation cost for
  fixed signatures. For repeated local profiling runs, set
  ``JAX_COMPILATION_CACHE_DIR`` before the first compilation. This is useful
  for engineering sweeps, but the shipped runtime panel should remain a cold
  end-to-end measurement unless stated otherwise.
- JAX GPU runs preallocate most device memory by default. When diagnosing an
  out-of-memory failure on a shared machine, use
  ``XLA_PYTHON_CLIENT_PREALLOCATE=false`` or a reduced
  ``XLA_PYTHON_CLIENT_MEM_FRACTION`` during the profiling run. Those knobs are
  useful for debugging and tracing, but they should not silently change the
  published benchmark contract.

Recent nonlinear profiling (Cyclone, benchmark-locked config)
-------------------------------------------------------------

Reference run configuration (March 4, 2026):

- ``ky=0.3``, ``Nl=4``, ``Nm=8``
- ``dt=0.01``, ``steps=400``
- ``sample_stride=10``, ``diagnostics_stride=10``
- ``tools/profile_nonlinear_cyclone.py`` with the tracked Cyclone runtime config

CPU profiling (Apple CPU, JAX CPU backend):

.. code-block:: text

   warmup_time_s=117.803
   run_time_s=109.147

GPU profiling (A100-class GPU, JAX CUDA backend):

.. code-block:: text

   warmup_time_s=38.950
   run_time_s=21.350

HLO summary (``jit_scan.*_after_optimizations``):

- CPU: ``fft=623``, ``scatter=72``, ``gather=375``, ``dot=88``, ``fusion=1053``
- GPU: ``fft=440``, ``scatter=30``, ``gather=322``, ``dot=44``, ``fusion=831``

The nonlinear RHS remains FFT-heavy with nontrivial gather/scatter density.
Primary optimization targets are the FFT pipeline (channel stacking, reuse of
real-space gradients) and scatter removal in linked-FFT paths.

GPU memory report (jit_scan module):

- Total bytes used: ``228.21 MiB`` (XLA memory usage report).

Nonlinear benchmark harness
---------------------------

To capture per-step runtime and end-of-run diagnostics, use the nonlinear
benchmark harness:

.. code-block:: bash

   python tools/benchmark_nonlinear_suite.py --steps 200 --dt 0.0377 \
     --out /tmp/spectrax_nl_bench.csv

The harness records scalar diagnostics through the compact diagnostics path, so
it measures runtime without materializing mode-resolved history arrays unless a
separate publication artifact explicitly requests them.

To test the optional spectral nonlinear mode (no Laguerre quadrature grid):

.. code-block:: bash

   python tools/benchmark_nonlinear_suite.py --laguerre-mode spectral

You can optionally pass a reference-code log file to compare runtime per step:

.. code-block:: bash

   python tools/benchmark_nonlinear_suite.py --gx-log /path/to/gx_run.out

RHS kernel profile (nonlinear Cyclone)
--------------------------------------

The RHS split profiler measures field solve, nonlinear bracket, linear RHS, and
full RHS kernels after compilation:

.. code-block:: bash

   python tools/profile_nonlinear_step_split.py \
     --config examples/nonlinear/axisymmetric/runtime_cyclone_nonlinear_short.toml \
     --repeats 10 \
     --out docs/_static/nonlinear_rhs_profile_gpu.csv

.. image:: _static/nonlinear_rhs_profile.png
   :alt: SPECTRAX-GK nonlinear RHS kernel profile
   :align: center

The current bounded Cyclone profile separates CPU and ``office`` GPU timings
for default grid-mode and optional spectral-mode nonlinear brackets. The
machine-readable companion ``docs/_static/nonlinear_rhs_profile.json`` records
the dominant measured kernel, kernel fractions relative to the full RHS, and
grid-to-spectral speedups for each backend. The May 9, 2026 refresh used the
same short-case 10-repeat harness on local CPU and one ``office`` RTX A4000
with ``CUDA_VISIBLE_DEVICES=0`` after the linear-RHS fast-path and linked-FFT
refactor tranche. The GPU environment reported
``jax==0.6.2``/``jaxlib==0.6.2``; these are profiler-local hot-path
measurements, not a broad production runtime claim. The refreshed GPU
grid-mode split is:

.. code-block:: text

   field_solve=4.65e-4 s
   nonlinear_bracket=3.36e-3 s
   linear_rhs=6.13e-3 s
   full_rhs=9.66e-3 s

The same GPU profile with ``laguerre_mode="spectral"`` measured
``nonlinear_bracket=1.50e-3 s`` and ``full_rhs=6.38e-3 s``. CPU full-RHS
timings from the same harness were ``1.01e-1 s`` for grid mode and
``7.73e-2 s`` for spectral mode. The short-harness spectral full-RHS ratios
are now ``1.30`` on CPU and ``1.51`` on GPU for this Cyclone case, while the
nonlinear-bracket-only ratios are ``1.54`` on CPU and ``2.24`` on GPU. The
spectral mode therefore remains an opt-in mode guarded by the case-level
parity gate below rather than a global default.

The dominant remaining warm-throughput cost is the compiled linear RHS, with
the nonlinear FFT pipeline still relevant for larger production grids. The next
performance step is to repeat this split on larger benchmark-size cases and
then use profiler traces to decide whether fusion, layout changes, or
production decomposition give the largest verified win.

Benchmark-size Cyclone Miller RHS profile
-----------------------------------------

The larger Cyclone Miller profile uses the shipped nonlinear Miller input with
``Nx=192``, ``Ny=64``, ``Nz=24``, ``Nl=4``, and ``Nm=8``. This is still a
single compiled-RHS split profile rather than a full transport-average runtime
claim, but it is large enough to expose a different bottleneck balance than the
short Cyclone case.

.. code-block:: bash

   python tools/profile_nonlinear_step_split.py \
     --config examples/nonlinear/axisymmetric/runtime_cyclone_nonlinear_miller.toml \
     --repeats 5 \
     --out docs/_static/nonlinear_rhs_profile_miller_cpu.csv

.. image:: _static/nonlinear_rhs_profile_miller.png
   :alt: SPECTRAX-GK nonlinear RHS kernel profile on the Cyclone Miller benchmark-size case
   :align: center

The May 10, 2026 local CPU refresh after the independent-worker
parallelization tranche measured CPU full-RHS timings of ``3.48e-1 s`` in grid
mode and ``2.20e-1 s`` in spectral Laguerre mode, with measured sub-kernels
``linear_rhs=1.24e-1 s`` and ``nonlinear_bracket=9.89e-2 s`` in grid mode. On
one ``office`` RTX A4000, the tracked artifact still records corresponding
full-RHS timings of ``1.28e-2 s`` and ``1.48e-2 s``. Spectral mode reduces the
GPU nonlinear bracket by ``1.63x``, but the full GPU RHS remains faster in grid
mode because the optimized Laguerre transform removes enough layout overhead
while preserving the production grid-quadrature convention. The CPU refresh
continues to point the next optimization pass at linear-RHS fusion/cache
layout and larger-grid bracket decomposition, not at claiming a broad
nonlinear speedup from spectral mode alone.

The full fused nonlinear-RHS trace companion is generated with:

.. code-block:: bash

   python tools/profile_full_nonlinear_rhs_trace.py \
     --config examples/nonlinear/axisymmetric/runtime_cyclone_nonlinear_miller.toml \
     --ky 0.3 \
     --Nl 4 \
     --Nm 8 \
     --repeats 5 \
     --summary-json docs/_static/full_nonlinear_rhs_trace_summary.json

The tracked local CPU artifact
``docs/_static/full_nonlinear_rhs_trace_summary.json`` reports
``warm_seconds=3.35e-1`` and ``3343`` HLO lines. The matched one-RTX-A4000
artifact ``docs/_static/full_nonlinear_rhs_trace_gpu_summary.json`` reports
``warm_seconds=1.28e-2`` and ``3336`` HLO lines. The GPU token triage is
dominated by reshapes (``1545``), broadcasts (``1822``), multiplies (``871``),
FFTs (``229``), slices (``215``), and reductions (``132``). This confirms that
the next nonlinear performance tranche should target fused layout and bracket
data movement rather than claiming a new runtime speedup from the linear-RHS
specialization alone. The same tranche removed a duplicated non-Laguerre field
mask from the nonlinear bracket path and then replaced the Laguerre-grid
``moveaxis``/``tensordot``/``moveaxis`` transforms with precision-controlled
``einsum`` calls. Transform-only CPU/GPU probes showed exact agreement with
the previous algebra at the tested precision; on one RTX A4000 the full fused
nonlinear-RHS trace improved from ``1.49e-2 s`` to ``1.28e-2 s`` while
transposes dropped from ``44`` to ``32``. This is a bounded profiler-state
source improvement, not a full transport runtime claim.

Runtime-mode stellarator RHS smoke profile
------------------------------------------

The release-performance gate also tracks W7-X and HSX at their documented
adiabatic-electron nonlinear runtime mode (``Nx=96``, ``Ny=96``, ``Nz=48``,
``Nl=4``, ``Nm=8``, and the runtime ``k_y=1/21``). These are not full
transport-average timings; they are single-state RHS split profiles used to
verify that the optimized grid-Laguerre path, VMEC/EIK geometry inputs, and
CPU/GPU hot-path accounting remain consistent on non-axisymmetric cases.

.. image:: _static/nonlinear_rhs_profile_stellarator_runtime.png
   :alt: SPECTRAX-GK nonlinear RHS kernel profile on W7-X and HSX runtime-mode stellarator cases
   :align: center

The tracked artifact
``docs/_static/nonlinear_rhs_profile_stellarator_runtime.json`` reports W7-X
CPU/GPU full-RHS timings of ``3.09e-1 s`` and ``2.73e-2 s`` and HSX CPU/GPU
full-RHS timings of ``3.09e-1 s`` and ``2.71e-2 s``. In both stellarator
runtime-mode profiles the GPU row is dominated by the nonlinear bracket
(``~59-60%`` of the measured full RHS), with the linear RHS contributing
``~42%``. That closes the release-level performance evidence: CPU/GPU
profiler artifacts are current, the sharding identity gate is tracked, and the
documentation makes only bounded profiler claims. Further nonlinear speedup
work should target larger-state bracket/linear-RHS layout and must keep using
identity gates before any production runtime claim.

Linear RHS term profile
-----------------------

The linear RHS split profiler drills into the compiled linear contribution used
inside nonlinear runs:

.. code-block:: bash

   python tools/profile_linear_rhs_terms.py \
     --config examples/nonlinear/axisymmetric/runtime_cyclone_nonlinear.toml \
     --ky 0.3 \
     --Nl 4 \
     --Nm 8 \
     --repeats 8 \
     --out docs/_static/linear_rhs_terms_profile_cpu.csv \
     --summary-json docs/_static/linear_rhs_terms_profile.json

After the zero-collision fast path and linked-FFT refactor, the May 9, 2026
CPU Cyclone artifact reports ``full_linear_rhs=1.08e-1 s`` for the compiled
full linear RHS call in this profiling harness. The independently timed term
kernels sum to ``1.68e-2 s``; this remaining gap is a localization signal, not
a speedup claim, because the full path recomputes the field solve, ``H``
assembly, and all weighted contributions as one compiled graph. The largest
standalone terms are hypercollisions (``2.39e-3 s``), linked ``|k_z|`` setup
(``2.38e-3 s``), and streaming (``2.34e-3 s``). The accepted
zero-collision branch now costs ``1.11e-3 s`` in the standalone CPU timing and
is guarded by the state-window identity gate below.

The active-state CPU companion
``docs/_static/linear_rhs_terms_profile_z_wave_cpu.json`` profiles the same
state after injecting a resolved parallel perturbation. There the hypercollision
and linked ``|k_z|`` norms are both ``2.35e-4`` and the linked ``|k_z|`` path
costs ``2.33e-3 s`` on CPU. This is the artifact that should be used for
linked-``|k_z|`` optimization decisions; the initial-state profile is only a
zero-source baseline.

For the larger Cyclone Miller benchmark-size RHS profile above, the active-state
CPU companion is
``docs/_static/linear_rhs_terms_profile_miller_cpu.json``. It uses the same
``Nl=4``, ``Nm=8``, ``k_y=0.3`` state as the nonlinear Miller profiler and
reports ``full_linear_rhs=2.93e-1 s`` with independently timed terms summing to
``4.83e-2 s``. The largest nonzero standalone row is streaming
(``7.33e-3 s``), followed by linked ``\partial_z`` (``6.39e-3 s``), linked
``|k_z|`` (``6.18e-3 s``), and hypercollisions (``6.20e-3 s``). That keeps the
next bounded optimization focused on full-graph layout/fusion and reusable
state transforms rather than making a standalone-term speedup claim.

The matching ``office`` GPU profile is tracked in
``docs/_static/linear_rhs_terms_profile_gpu.json`` and
``docs/_static/linear_rhs_terms_profile_gpu.csv``. On one RTX A4000 with the
same ``jax==0.6.2``/``jaxlib==0.6.2`` environment used for the nonlinear RHS
refresh, it reports ``full_linear_rhs=5.50e-3 s`` and independently timed terms
summing to ``3.41e-3 s``. The accepted zero-collision branch costs
``1.24e-4 s`` in the standalone GPU timing; hypercollisions and linked
``|k_z|`` remain present as separately profiled rows. The active-state GPU
companion
``docs/_static/linear_rhs_terms_profile_z_wave_gpu.json`` activates the same
operator pair with matched norms ``2.35e-4`` and records linked ``|k_z|`` at
``3.63e-4 s`` with ``full_linear_rhs=5.48e-3 s``.

The companion state-window gate is generated with:

.. code-block:: bash

   python tools/gate_linear_rhs_zero_norm_state_window.py \
     --config examples/nonlinear/axisymmetric/runtime_cyclone_nonlinear.toml \
     --ky 0.3 \
     --Nl 4 \
     --Nm 8 \
     --out-json docs/_static/linear_rhs_zero_norm_state_window_gate.json

The current gate passes by accepting the zero-collision skip for this
``nu=0`` Cyclone window while rejecting a hypercollision skip: the initial
state has zero relative hypercollision skip error, but the resolved
``z``-varying state reaches ``3.59e-3``. This protects the optimization path
from incorrectly disabling linked ``|k_z|`` hypercollisions based only on the
initial-state profile.

Full fused linear RHS trace
---------------------------

The term profiler above times independently isolated kernels. The companion
full-graph profiler lowers and times the production ``linear_rhs_cached`` entry
point for a real runtime TOML so optimization work can target the compiled
graph seen by executable linear runs rather than only the standalone assembly
helper:

.. code-block:: bash

   python tools/profile_full_linear_rhs_trace.py \
     --config examples/nonlinear/axisymmetric/runtime_cyclone_nonlinear_miller.toml \
     --ky 0.3 \
     --Nl 4 \
     --Nm 8 \
     --repeats 3 \
     --summary-json docs/_static/full_linear_rhs_trace_summary.json

The May 11, 2026 local CPU production-path artifacts record
``source="spectraxgk.linear.linear_rhs_cached"`` and
``force_electrostatic_fields=true``. The initial-state companion reports
``warm_seconds=1.54e-1`` and ``compile_execute_seconds=1.02``. The active
``z_wave`` companion injects resolved parallel variation and reports
``warm_seconds=8.38e-2`` with the same specialized HLO shape. Both summaries
contain ``2779`` HLO lines and highlight the remaining graph-level pressure
points: broadcasts (``983`` coarse token hits), reshapes (``578``), FFT
mentions (``312``), reductions (``316``), multiplies (``200``), and gathers
(``51``). These are localization metrics, not standalone runtime claims. The
source-path change means these artifacts should be compared against future
production-path refreshes, not against older lower-level assembly-helper
artifacts.

The May 11, 2026 one-RTX-A4000 production-path artifacts
``docs/_static/full_linear_rhs_trace_gpu_summary.json`` and
``docs/_static/full_linear_rhs_trace_gpu_z_wave_summary.json`` report
``source="spectraxgk.linear.linear_rhs_cached"``, ``2779`` HLO lines, and
``force_electrostatic_fields=true``. The initial and active ``z_wave`` states
measure ``warm_seconds=5.13e-3`` and ``5.15e-3``, respectively. These GPU
artifacts show that the production linear-RHS path remains about five
milliseconds on one RTX A4000 for this benchmark-size RHS call, but they remain
kernel-localization evidence rather than a full nonlinear runtime claim.

Parallelization scaling guardrail
---------------------------------

The legacy two-device linear scaling figure remains an engineering artifact, not
the headline production parallelization claim. Current user-facing scaling
claims should point to the independent ``k_y`` scan and quasilinear/UQ ensemble
figures below, because those paths preserve serial ordering and have explicit
solver-observable identity gates.

.. image:: _static/scaling_speedup.png
   :alt: SPECTRAX-GK scaling speedup
   :align: center

The raw sweep data lives in ``docs/_static/scaling_speedup_data.csv`` and can
be replotted with:

.. code-block:: bash

   python tools/plot_scaling_speedup.py

The exploratory distributed-RK2 strong-scaling data is still tracked for
engineering work, but it is intentionally not presented as a headline
publication figure because the current curve is dominated by communication
overhead rather than near-ideal scaling.

Production parallelization should start with independent work rather than
nonlinear domain decomposition. The public helpers
``spectraxgk.ky_scan_batches`` and ``spectraxgk.batch_map`` split ``k_y``
scans, sensitivity sweeps, and UQ ensembles while preserving serial ordering.
On one device they reduce to batched ``vmap`` execution; on multiple devices
they use JAX device batching and trim padded edge samples deterministically.
Every performance claim from this path should include a numerical-identity
gate against the serial result before a speedup plot is promoted.

For the release-scale CPU/GPU panels below, the acceptance contract is
machine-checkable: the combined ``*_large`` artifact must cite split CPU and
GPU JSON/CSV/PNG/PDF companions, each split artifact must include the grid,
warmup/repeat policy, backend/device counts, positive timing samples, and
per-row identity results, and any speedup statement must name the artifact that
supports it. Whole-state nonlinear sharding uses the same large-run artifact
shape, but its timing ratios remain profiler evidence and not a production
nonlinear speedup claim unless a future matched workload refresh adds the
missing identity, full nonlinear communication, transport-window, and profiler
promotion gates.

The first release-grade gate for this policy is a real Cyclone linear
``k_y``-scan comparison:

.. image:: _static/parallel_ky_scan_gate.png
   :alt: SPECTRAX-GK ky-batch parallelization identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_parallel_ky_scan_gate.py

This gate runs the same linear solver serially and with fixed-shape
``k_y`` batching, checks ``gamma`` and ``omega`` numerical identity, and
reports observed speedup as an engineering metric. The gate intentionally does
not claim nonlinear domain scaling; that remains a separate communication and
FFT-decomposition problem.

The complementary logical-CPU gate exercises the public
``RuntimeParallelConfig`` and ``batch_map`` interface on a structured JAX
pytree output. It is not a gyrokinetic physics validation; it verifies that the
parallel API preserves serial numerical identity for independent scan/UQ-style
workloads before those workloads are connected to heavier solver paths.

.. image:: _static/logical_cpu_parallel_scan_gate.png
   :alt: SPECTRAX-GK logical CPU parallel scan identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_logical_cpu_parallel_scan_gate.py --logical-devices 2

The tracked artifact used two logical CPU devices and passed the identity gate:
``max_gamma_rel_error=6.7e-8``, ``max_ql_rel_error=1.1e-7``, and
``max_omega_abs_error=0``. The observed timing is retained as engineering
metadata only; a speedup claim requires a solver-backed workload and fresh
CPU/GPU profiler artifacts.

The solver-backed strong-scaling artifact now exercises that production
policy on a larger real Cyclone linear scan with 64 independent
``k_y`` values, ``Ny=128``, ``Nz=96``, ``Nl=4``, ``Nm=8``, and ``240`` RK2
steps per mode. Each worker performs one warmup scan before the timed repeats,
and every multi-worker result is compared against the one-worker reference for
``gamma`` and ``omega`` identity:

.. image:: _static/independent_ky_scan_scaling_large.png
   :alt: SPECTRAX-GK independent ky scan CPU/GPU strong-scaling artifact
   :align: center

It is regenerated with:

.. code-block:: bash

   KY=$(python -c "print(','.join(f'{0.04 + 0.0125*i:.3f}' for i in range(64)))")

   python tools/profile_independent_ky_scan_scaling.py \
     --backend cpu --devices 1,2,4,8 \
     --ky "$KY" \
     --ny 128 --nz 96 --nl 4 --nm 8 --steps 240 \
     --out-prefix docs/_static/independent_ky_scan_scaling_cpu_large

   python tools/profile_independent_ky_scan_scaling.py \
     --backend gpu --devices 1,2 \
     --ky "$KY" \
     --ny 128 --nz 96 --nl 4 --nm 8 --steps 240 \
     --out-prefix docs/_static/independent_ky_scan_scaling_gpu_large

   python tools/plot_independent_ky_scan_scaling.py

The May 12, 2026 refresh passes the identity gate with zero reported
``gamma``/``omega`` mismatch. CPU process scaling reaches ``1.94x`` on two
workers, ``3.78x`` on four workers, and ``7.18x`` on eight workers. The
two-GPU RTX A4000 run reaches ``1.88x`` with about ``94%`` parallel
efficiency. This is the current recommended production parallelization path
for linear scans, quasilinear studies, sensitivity sweeps, and UQ ensembles:
it has much better scaling behavior than whole-state nonlinear sharding
because communication is restricted to post-run result aggregation.

The release closure status is machine-readable and separates production claims
from diagnostic decomposition work:

.. image:: _static/parallelization_completion_status.png
   :alt: SPECTRAX-GK parallelization closure status
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/build_parallelization_completion_status.py

The tracked ``docs/_static/parallelization_completion_status.json`` reports
``production_completion_percent = 100`` for independent ``k_y`` scans and
quasilinear/UQ ensembles. The same artifact keeps whole-state nonlinear
sharding and FFT-axis decomposition at diagnostic status until runtime
distributed communication, conservation, transport-window, and profiler-backed
speedup gates are closed.

The nonlinear state-domain prototype now has a stronger diagnostic gate in
``docs/_static/nonlinear_domain_parallel_identity_gate.json``. In addition to
the one-step serial-vs-halo-decomposed state check, the embedded
``nonlinear_domain_transport_window_identity`` report advances a short
fixed-step window and compares boundary identity plus mass, free-energy-proxy,
and boundary-flux-proxy traces. Those trace drifts are agreement metadata for
the diagnostic local stencil only. They do not validate production conservation,
distributed FFT routing, field solves, benchmark transport windows, or any
speedup claim.

The same independent-worker policy is also gated on a quasilinear/UQ-style
ensemble: six late-time Cyclone ITG gradient samples, five ``k_y`` values per
sample, ``Ny=96``, ``Nz=64``, ``Nl=3``, ``Nm=6``, and ``2000`` RK2 steps per
mode. Each worker computes real late-time linear growth/frequency fits and a
reduced mixing-length feature observable. The observable is useful for
parallelization and UQ plumbing, but it is not promoted as an absolute
nonlinear heat-flux predictor.

.. image:: _static/quasilinear_uq_ensemble_scaling_large.png
   :alt: SPECTRAX-GK quasilinear UQ ensemble CPU/GPU strong-scaling artifact
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/profile_quasilinear_uq_ensemble_scaling.py \
     --backend cpu --devices 1,2,4,8 \
     --out-prefix docs/_static/quasilinear_uq_ensemble_scaling_cpu_large

   python tools/profile_quasilinear_uq_ensemble_scaling.py \
     --backend gpu --devices 1,2 \
     --out-prefix docs/_static/quasilinear_uq_ensemble_scaling_gpu_large

   python tools/plot_quasilinear_uq_ensemble_scaling.py

The May 10, 2026 ``office`` sweep passes the serial identity gate for both the
reduced quasilinear proxy and ``gamma``. The CPU run reaches ``1.70x`` on two
workers, ``2.75x`` on four workers, and ``5.41x`` on eight requested workers
using six actual ensemble chunks. The two-GPU RTX A4000 run reaches ``1.71x``
with about ``86%`` parallel efficiency. This closes the release engineering
gate for quasilinear calibration grids, finite-difference checks, sensitivity
sweeps, and UQ ensembles that can be decomposed into independent solver calls.

The future nonlinear-decomposition promotion plan follows the same conservative
rule. ``spectraxgk.build_velocity_sharding_plan`` records a GX-inspired
species-first, Hermite-second velocity-space layout, including which axes need
Hermite ghost exchange and which axes need field-solve reductions and
broadcasts. This is planning metadata, not yet a nonlinear speedup path. It is
used to keep future ``shard_map`` work explicit about communication before any
transport-runtime claim is made.

The first concrete communication-kernel gate is the Hermite ghost exchange.
It uses ``jax.shard_map`` to exchange nearest-neighbor Hermite moments across a
two-device logical CPU mesh and compares the result against the full-array
reference shift with zero physical boundaries:

.. image:: _static/hermite_exchange_gate.png
   :alt: SPECTRAX-GK Hermite ghost-exchange identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_hermite_exchange_gate.py --logical-devices 2

The tracked artifact passes with zero reported lower/upper neighbor error. It
only validates the communication primitive. Promoting nonlinear velocity-space
decomposition beyond diagnostic evidence still needs field-reduction/broadcast
gates, streaming-operator identity gates, full-RHS identity gates, and profiler
artifacts before any speedup claim.

The matching velocity-space field-reduction gate validates the second required
communication primitive. It reduces the Hermite-sharded local contributions
with ``lax.psum`` and compares against the full-array reference sum:

.. image:: _static/velocity_field_reduce_gate.png
   :alt: SPECTRAX-GK velocity field-reduction identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_velocity_field_reduce_gate.py --logical-devices 2

The tracked artifact passes with ``max_abs_error=3.9e-6`` under an absolute
tolerance of ``1e-5``. This tolerance reflects expected float32 roundoff from a
different reduction tree; it is not a physics tolerance. The next gate must
combine Hermite exchange and field reduction with the actual streaming
coefficients.

The electrostatic field-reduction gate applies the same ``lax.psum`` pattern
to the actual ``m=0`` density moment used by quasineutrality and compares the
resulting ``phi`` against the production field solve:

.. image:: _static/electrostatic_field_reduce_gate.png
   :alt: SPECTRAX-GK electrostatic field-reduction identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_electrostatic_field_reduce_gate.py --logical-devices 2

The tracked artifact passes exactly on the current single-species periodic
gate with ``phi_norm=1.68e-1`` and zero reported absolute/relative error. This
is the first true sharded field-reduction solve gate; multi-species,
linked-boundary, electromagnetic, and nonlinear field solves remain separate
gates.

That coefficient gate is now tracked separately. It applies the
``sqrt(m+1)`` upper-neighbor and ``sqrt(m)`` lower-neighbor Hermite streaming
ladder on top of the shard-map exchange and records the paired field-reduction
error:

.. image:: _static/hermite_streaming_ladder_gate.png
   :alt: SPECTRAX-GK Hermite streaming-ladder identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_hermite_streaming_ladder_gate.py --logical-devices 2

The tracked artifact passes with zero ladder error and records an accompanying
Hermite field-reduction error of ``1.9e-6``. This closes the communication and
coefficient layer for a one-dimensional Hermite mesh. The next step is an
opt-in linear streaming microkernel that includes the actual parallel
derivative contract.

The electrostatic drift-slice gate then uses offset-1 and offset-2 Hermite
exchanges for mirror and curvature terms, together with the electrostatic
field-reduction gate:

.. image:: _static/electrostatic_drift_gate.png
   :alt: SPECTRAX-GK electrostatic drift-slice identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_electrostatic_drift_gate.py --logical-devices 2

The tracked artifact passes with ``phi_norm=1.21e-1`` and zero reported
absolute/relative error for the mirror, curvature/grad-B, and combined drift
slices. This is a single-species periodic electrostatic identity gate, not a
full-RHS, linked-boundary, electromagnetic, or nonlinear performance claim.
The gated slices are available together through
``spectraxgk.linear_rhs_parallel_cached`` with
``RuntimeParallelConfig(strategy="velocity", axis="hermite",
backend="electrostatic_linear_slices")``.

The electrostatic diamagnetic-drive gate validates the remaining local
electrostatic drive slice. It uses the Hermite-sharded electrostatic field
reduction, then applies the local ``m=0`` and ``m=2`` density/temperature
gradient masks on each Hermite shard:

.. image:: _static/electrostatic_diamagnetic_gate.png
   :alt: SPECTRAX-GK electrostatic diamagnetic-drive identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_electrostatic_diamagnetic_gate.py --logical-devices 2

The tracked artifact passes with ``phi_norm=1.68e-1`` and zero reported
absolute/relative error against the production diamagnetic-only linear RHS.
The opt-in ``backend="electrostatic_linear_slices"`` route now combines
streaming, mirror, curvature, grad-B, and diamagnetic slices. It still rejects
collision, electromagnetic, linked-boundary, multi-species, and nonlinear
terms until each path has its own identity gate.

The periodic linear-streaming microkernel gate then adds the spectral
parallel derivative along the field-line direction and compares the resulting
``shard_map`` path directly against the production
``spectraxgk.terms.operators.streaming_term``:

.. image:: _static/periodic_streaming_microkernel_gate.png
   :alt: SPECTRAX-GK periodic streaming microkernel identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_periodic_streaming_microkernel_gate.py --logical-devices 2

The tracked artifact passes with zero reported absolute and relative error.
This is still a linear streaming microkernel gate, not a full linear RHS or
nonlinear performance claim.

The next release gate exercises the same periodic streaming path through the
production ``linear_rhs_cached`` call graph. The artifact disables all
non-streaming terms, keeps electromagnetic channels off, and uses non-density
Hermite moments so that the electrostatic field solve is exactly zero:

.. image:: _static/linear_rhs_streaming_gate.png
   :alt: SPECTRAX-GK streaming-only linear RHS identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_linear_rhs_streaming_gate.py --logical-devices 2

The tracked artifact passes with ``max_abs_error=9.7e-7``,
``max_rel_error=5.6e-7``, and ``phi_norm=0``. This closes a streaming-only
linear-RHS identity gate. It deliberately does not claim full-RHS, nonlinear,
or production speedup parity; those remain separate gates with additional
field-solve, drive, collision, bracket, and profiler coverage.

For code-level experiments the same route is available through
``spectraxgk.linear_rhs_parallel_cached`` with
``RuntimeParallelConfig(strategy="velocity", axis="hermite",
backend="streaming_only")``. The helper rejects any non-streaming term weights
so this remains a disabled-by-default diagnostic path rather than a hidden
solver change.

The follow-on electrostatic streaming gate keeps the term weights identical
but initializes an ``m=0`` density perturbation so that the production
electrostatic field solve produces nonzero ``phi``. It then compares
``linear_rhs_cached`` against the explicit
``backend="streaming_electrostatic"`` route:

.. image:: _static/linear_rhs_streaming_electrostatic_gate.png
   :alt: SPECTRAX-GK electrostatic streaming linear RHS identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_linear_rhs_streaming_electrostatic_gate.py --logical-devices 2

The tracked artifact passes with ``phi_norm=1.34e-1``,
``max_phi_abs_error=1.9e-9``, ``max_abs_error=1.4e-7``, and
``max_rel_error=4.1e-7``. The field solve uses the single-species
Hermite-sharded electrostatic reduction gate above; this validates the
field-reduction-to-streaming call graph before the drift, diamagnetic-drive,
and nonlinear paths are introduced.

The current composed electrostatic linear-RHS gate then exercises the opt-in
``backend="electrostatic_linear_slices"`` route against the serial production
RHS with streaming, mirror, curvature, grad-B, and diamagnetic drive enabled:

.. image:: _static/linear_rhs_electrostatic_slices_gate.png
   :alt: SPECTRAX-GK composed electrostatic linear-slices identity gate
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/generate_linear_rhs_electrostatic_slices_gate.py --logical-devices 2

The tracked artifact passes with ``phi_norm=1.68e-1``,
``max_abs_error=1.5e-7``, ``max_rel_error=3.7e-7``, and zero reported
electrostatic-potential error. This is the current single-species periodic
electrostatic linear-RHS identity gate for velocity-space parallelization. It
is not a linked-boundary, collision, electromagnetic, nonlinear, or speedup
claim.

The matching engineering profile intentionally stays separate from the
identity gate:

.. image:: _static/linear_rhs_parallel_slices_profile.png
   :alt: SPECTRAX-GK electrostatic linear-slices parallelization profile
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/profile_linear_rhs_parallel_slices.py \
     --logical-devices 8 --nl 4 --nm 128 --ny 32 --nz 128 --rtol 1e-5

The tracked CPU artifact uses a Hermite-heavy workload and keeps the sharded
route within a float32 reduction-order engineering tolerance:
``max_abs_error=2.4e-6``, ``max_rel_error=6.0e-6``, and
``max_phi_abs_error=7.5e-9``. The warm timings are
``serial_median_s=4.37e-2``, ``sharded_median_s=3.11e-2``, and
``speedup=1.40x`` on eight logical CPU devices. This remains an engineering
profile rather than a publication speedup claim; the stricter small-grid
identity gate above is the release correctness gate.

A compact CPU sweep maps the same opt-in route across Hermite resolution and
logical device count:

.. image:: _static/linear_rhs_parallel_slices_sweep.png
   :alt: SPECTRAX-GK electrostatic linear-slices parallelization sweep
   :align: center

It is regenerated with:

.. code-block:: bash

   python tools/profile_linear_rhs_parallel_slices_sweep.py \
     --platform cpu --devices 1,2,4,8 --nms 64,128 \
     --nl 4 --ny 32 --nz 128 --rtol 1e-5

The tracked sweep passes identity for all points. It shows the current
Hermite-sharded electrostatic route is overhead-limited at one and two logical
CPU devices, becomes competitive near four devices, and reaches the best
bounded engineering point of ``1.57x`` at ``Nm=128`` on four logical CPU
devices. This figure is a regime map for development, not a broad scaling
claim. The machine-readable release contract is
``docs/_static/linear_rhs_parallel_slices_sweep.json`` with CSV/PNG/PDF
companions.

The same profiler can target GPUs on the office node:

.. code-block:: bash

   PYTHONPATH=/tmp/spectrax-gk-profile/src python3 \
     tools/profile_linear_rhs_parallel_slices.py \
     --platform gpu --logical-devices 2 \
     --nl 4 --nm 64 --ny 32 --nz 128 --rtol 1e-5 \
     --out-prefix docs/_static/linear_rhs_parallel_slices_profile_gpu

The tracked two-RTX-A4000 artifact passes the engineering identity check
(``max_abs_error=1.9e-6``, ``max_rel_error=4.7e-6``), but it is much slower
than the single-GPU serial JIT path (``speedup=0.03x``). This keeps the GPU
Hermite-sharding lane open: do not claim GPU speedup until the communication
layout is redesigned or a larger production workload shows a real gain.

Fixed-step nonlinear state sharding
-----------------------------------

The fixed-step nonlinear runner now has the same full-state sharding contract
as the linear path for release-gated state axes. Set
``TimeConfig.state_sharding = "auto"`` (or a concrete axis such as ``"ky"`` or
``"kx"``) with ``use_diffrax = false`` to route through
``spectraxgk.integrate_nonlinear_sharded``. The implementation uses a ``pjit``
scan and preserves the serial Runge-Kutta update; it is therefore an
identity-gated state-sharding primitive, not a halo-exchange FFT domain
decomposition claim. Sharding the ``z`` FFT axis is deliberately not exposed as
a release-gated nonlinear runtime path because the current JAX/XLA FFT layout
does not pass the multi-device identity gate.

The profiler/identity artifact is generated with:

.. code-block:: bash

   python tools/profile_nonlinear_sharding.py \
     --sharding auto --sharding-options auto,kx \
     --out-json docs/_static/nonlinear_sharding_profile.json

The JSON records device count, requested sharding axis, warm serial/sharded
timings, profiler-trace status, final-state errors, final-field/RHS diagnostic
errors, and the fastest identity-preserving candidate among the requested
state-axis options. Refreshed profiler outputs also carry a versioned source
contract with the exact command, command argv, source artifact, backend, device
count, sharding axis, warmup/repeat policy, and Python/JAX/NumPy/SPECTRAX-GK
versions. The fast parallel artifact checker validates that contract whenever
it is present; older checked-in diagnostic profiles remain scoped until they
are refreshed with the same metadata. The
local checked-in artifact is deliberately small and only establishes the
control-flow and single-device identity gate. The two-GPU office artifact at
``docs/_static/nonlinear_sharding_profile_office_gpu.json`` records active
``auto``/``kx`` state sharding with zero final-state, final-field, and final-RHS
diagnostic error on both candidate axes. In the current bounded run the
requested ``auto`` path is slower (``0.81x``), while the best
identity-preserving candidate is explicit ``kx`` sharding at about ``0.96x``.
That is not a speedup, so this artifact should be treated as a correctness and
profiler-localization gate rather than a publication runtime claim.

The larger strong-scaling sweep is regenerated with isolated subprocesses so
each device count gets a clean JAX runtime:

.. code-block:: bash

   python tools/profile_nonlinear_sharding_sweep.py \
     --backend cpu --devices 1,2,4,8 \
     --nx 24 --ny 48 --nz 96 --nl 4 --nm 8 --steps 8 \
     --out-prefix docs/_static/nonlinear_sharding_strong_scaling_cpu_large

   python tools/profile_nonlinear_sharding_sweep.py \
     --backend gpu --devices 1,2 \
     --nx 48 --ny 96 --nz 128 --nl 4 --nm 8 --steps 12 \
     --out-prefix docs/_static/nonlinear_sharding_strong_scaling_gpu_xlarge

   # Equivalent office two-GPU profile preset with JAX traces enabled.
   python tools/profile_nonlinear_sharding_sweep.py --office-gpu-xlarge

   python tools/plot_nonlinear_sharding_strong_scaling.py

   python tools/generate_nonlinear_sharding_production_gate.py

.. image:: _static/nonlinear_sharding_strong_scaling_large.png
   :alt: SPECTRAX-GK large nonlinear whole-state sharding strong-scaling artifact
   :align: center

The May 10, 2026 large sweep passes the final-state identity gate at every
tracked point. The CPU logical-device path improves from one to four devices
but saturates at about ``1.39x`` and does not improve further at eight logical
devices. The two-RTX-A4000 GPU path is slower than one GPU even for the larger
``Nx=48, Ny=96, Nz=128, Nl=4, Nm=8`` fixed-step case, with a measured speedup
of about ``0.63x``. This makes the technical conclusion explicit: the current
whole-state nonlinear sharding path is useful as a correctness/profiler gate,
but production parallelization should prioritize independent ``k_y`` scans and
UQ/ensemble batching. Communication-aware nonlinear domain decomposition remains
diagnostic until the exact workload has identity, communication,
transport-window, and matched profiler evidence for any nonlinear multi-GPU
speedup claim.
The production gate fails closed as ``diagnostic_only`` unless the refreshed
CPU and GPU rows both pass serial identity, use active state sharding, and meet
the configured speedup and parallel-efficiency thresholds. The tracked gate
artifact is ``docs/_static/nonlinear_sharding_production_speedup_gate.json``;
in the current artifact set the CPU two-device row passes, but the GPU row
blocks production speedup claims.
The raw sweep JSON files also carry ``speedup_passed``, ``status``, and
``speedup_blockers`` fields so a timeout, profiler failure, or identity-correct
slowdown is visible before the stricter production gate is evaluated.

This claim boundary is mirrored in :doc:`parallelization` and
:doc:`release_scope`. If a future optimization changes the conclusion, refresh
the CPU and GPU sweep artifacts before changing README or release-note wording.

Spectral nonlinear mode (gated fast toggle)
-------------------------------------------

The spectral nonlinear mode skips Laguerre quadrature for the nonlinear bracket
(``laguerre_nonlinear_mode = "spectral"`` or ``"fast"``). It is not the default
mode because the speedup is case and backend dependent. The release gate runs
the same bounded nonlinear case twice, once with default grid-mode brackets and
once with spectral brackets, then compares end-of-run scalar diagnostics.

.. code-block:: bash

   python tools/gate_laguerre_nonlinear_modes.py \
     --case cyclone --case kbm --case w7x --case hsx \
     --out-json docs/_static/laguerre_mode_gate.json \
     --out-csv docs/_static/laguerre_mode_gate.csv \
     --plot-out docs/_static/laguerre_mode_gate.png

For a GPU reference artifact, run the same command on the target GPU node with
GPU-specific output paths, for example:

.. code-block:: bash

   python tools/gate_laguerre_nonlinear_modes.py \
     --case cyclone --case kbm --case w7x --case hsx \
     --out-json docs/_static/laguerre_mode_gate_gpu.json \
     --out-csv docs/_static/laguerre_mode_gate_gpu.csv \
     --plot-out docs/_static/laguerre_mode_gate_gpu.png

For W7-X/HSX runs, pass ``--w7x-geometry-file`` and
``--hsx-geometry-file`` if the local pre-generated ``*.eik.nc`` files live
outside the default cache paths.

.. image:: _static/laguerre_mode_gate.png
   :alt: SPECTRAX-GK spectral Laguerre nonlinear mode gate on CPU
   :align: center

.. image:: _static/laguerre_mode_gate_gpu.png
   :alt: SPECTRAX-GK spectral Laguerre nonlinear mode gate on GPU
   :align: center

On the bounded local CPU gate, Cyclone, KBM, W7-X, and HSX all passed the
scalar-diagnostic parity threshold with maximum relative differences below
``8.9e-4``. The measured grid/spectral runtime ratios were:

- Cyclone: ``2.90``
- KBM: ``3.31``
- W7-X: ``1.67``
- HSX: ``0.66``

On the bounded ``office`` GPU gate, all four cases also passed with maximum
relative differences below ``2.2e-5``. The measured grid/spectral runtime
ratios were:

- Cyclone: ``1.66``
- KBM: ``2.69``
- W7-X: ``1.63``
- HSX: ``0.74``

Because HSX is slower in both bounded gates, the spectral mode should be
treated as a validated optional engineering mode, not a global fast default.
Production use should rerun the gate on the target case and backend before
claiming speedup.

Runtime and memory comparison workflow
--------------------------------------

For the publication runtime comparison pass, use the manifest-driven runner:

.. code-block:: bash

   python tools/benchmark_runtime_memory.py --list
   python tools/benchmark_runtime_memory.py --dry-run --case cyclone-linear --backend spectrax_cpu
   python tools/benchmark_runtime_memory.py --continue-on-error --log-dir tools_out/runtime_memory_logs

The runner reads ``tools/runtime_memory_manifest.toml`` and writes:

- ``tools_out/runtime_memory_results.csv``
- ``tools_out/runtime_memory_summary.json``
- ``tools_out/runtime_memory_logs/*.stdout.log``
- ``tools_out/runtime_memory_logs/*.stderr.log``
- ``docs/_static/runtime_memory_benchmark.png``

The manifest is designed to hold three rows per case:

- ``spectrax_cpu``
- ``spectrax_gpu``
- ``gx``

Each row may also carry a ``host`` so the same runner can execute local and
remote measurements through one manifest while still collecting wall time and
peak RSS from the target machine.
Rows may also carry a ``profile_command``. When that secondary command succeeds
and prints ``warmup_time_s=...`` / ``run_time_s=...``, the runner merges those
warm measurements back into the same CSV/JSON summary row as the cold pass.
If a profiling command prints ``warmup_time_s=...`` or ``run_time_s=...``, the
runner also records those fields in the CSV/JSON summary so cold and warm JAX
timings can be tracked without a separate sidecar note.

The checked-in case inventory for the current release panel covers the shipped
runtime families:

- Cyclone ITG linear and nonlinear
- ETG linear
- KBM linear and nonlinear
- W7-X linear and nonlinear
- HSX linear and nonlinear
- Cyclone Miller nonlinear

These rows are the ones shown in the README/runtime panel. ETG nonlinear,
KAW, and TEM remain separate tracked work items and are intentionally excluded
from the shipped runtime figure until their release-grade benchmark contracts
are closed.

For the stellarator rows on the `office` benchmark host, the shipped panel
uses pre-generated `*.eik.nc` geometry files instead of live VMEC
regeneration. The GX reference rows on that host also need a consistent local
`netcdf-c` / `hdf5` runtime stack; the default `office` stellarator runtime
environment mixed incompatible HDF5 / NetCDF libraries and lacked the Python
geometry helper dependencies needed for VMEC-driven geometry generation.

Final runtime/memory figure
---------------------------

.. image:: _static/runtime_memory_benchmark.png
   :alt: Runtime and memory comparison across published benchmark cases
   :width: 100%

The runtime subplot uses a log scale because the measured wall times span
roughly three orders of magnitude across the linear, nonlinear, and imported
geometry cases. The memory subplot stays linear because the peak RSS spread is
much narrower.

The assembled figure is generated from the collected per-case summaries with
``tools/benchmark_runtime_memory.py --summary-glob ...`` and written to:

- ``docs/_static/runtime_memory_benchmark.png``
- ``docs/_static/runtime_memory_benchmark.pdf``

For the shipped refresh shown here, use the successful release summary rather
than the older interrupted summary that contains failed W7-X/HSX nonlinear
rows:

.. code-block:: bash

   python tools/benchmark_runtime_memory.py \
     --summary-glob tools_out/runtime_memory_summary_ship_refresh.json \
     --csv-out tools_out/runtime_memory_results_ship_refresh_regenerated.csv \
     --summary-out tools_out/runtime_memory_summary_ship_refresh_regenerated.json \
     --plot-out docs/_static/runtime_memory_benchmark.png

The published runtime figure complements the atlas instead of duplicating it:
the atlas carries growth/frequency and nonlinear transport/energy comparisons,
while the runtime figure carries CPU/GPU/reference wall time and peak RSS for
the shipped runtime cases.

Interpretation of short nonlinear GPU rows
------------------------------------------

The shipped runtime panel reports cold wall time. For the JAX backends, this
includes startup and compilation, so short nonlinear cases can look worse than
their steady-state throughput would suggest.

Targeted ``office`` GPU profiles on the same shipped short nonlinear configs
measured:

.. code-block:: text

   Cyclone nonlinear: warmup_time_s=33.957  run_time_s=15.054
   KBM nonlinear:     warmup_time_s=27.485  run_time_s= 9.725

Compared with the cold runtime panel rows:

- Cyclone nonlinear GPU: ``38.27 s`` in the shipped panel, versus ``15.05 s``
  for the second run on the same compiled executable.
- KBM nonlinear GPU: ``44.33 s`` in the shipped panel, versus ``9.73 s`` for
  the second run on the same compiled executable.

This changes the optimization reading:

- the current short-run Cyclone GPU deficit in the shipped panel is primarily a
  cold-start effect, since the warm run is already faster than the tracked GX
  row,
- the current short-run KBM GPU gap is mostly compile amortization, with warm
  performance already close to GX.

The runtime figure now overlays those warm second-run measurements as hollow
diamond markers on the runtime bars wherever ``run_time_s`` is present in the
summary input.

The highest-value performance work for these short nonlinear lanes is therefore
compile/startup reduction and executable reuse, not just per-step kernel work.

Startup phase profiler
----------------------

For cold-start deep dives, use the dedicated startup profiler:

.. code-block:: bash

   python tools/profile_runtime_startup.py \
     --config examples/nonlinear/axisymmetric/runtime_cyclone_nonlinear.toml \
     --ky 0.3 --Nl 4 --Nm 8 --compile-steps 1 \
     --json-out tools_out/startup_cyclone_gpu.json \
     --csv-out tools_out/startup_cyclone_gpu.csv

The profiler breaks the cold path into the main setup and first-compile phases:

- runtime config load
- geometry resolution
- grid/default construction
- parameter and term setup
- initial-condition construction
- linear-cache construction
- first field solve compile+execute
- first linear/full RHS compile+execute
- first nonlinear integrator compile+execute

It supports ``--trace-dir`` and ``--memory-profile`` for XProf/Perfetto
inspection with phase-level annotations, and ``--debug-log-cache`` /
``--explain-cache-misses`` for JAX cache diagnostics when a repeated compile
path looks suspicious.
By default the trace tools now start JAX profiling with
``python_tracer_level=0`` and ``host_tracer_level=0``. On the lightweight
``office`` environment this avoids the optional TensorFlow Python-hook import
path, so traces are emitted cleanly without installing TensorFlow just to
silence profiler startup noise.

The current ``office`` GPU startup profiles for the shipped short nonlinear
cases show the same dominant structure:

- Cyclone nonlinear startup total: ``35-36 s`` after the low-rank
  collision-cache and host-cache cleanup passes (previously ``41.47 s`` on the
  earlier office snapshot)
- KBM nonlinear startup total: ``32.23 s``
- dominant phases in both cases:

  - ``compile_first_integrator_run``: about ``22 s`` (Cyclone), ``19.28 s`` (KBM)
  - ``build_linear_cache``: about ``5.6 s`` (Cyclone), ``7.73 s`` (KBM)
  - ``compile_first_linear_rhs`` / ``compile_first_full_rhs``: another
    ``3.0 + 3.0 s`` (Cyclone) or ``1.7 + 1.7 s`` (KBM)

So the next high-value performance work is no longer the analytic geometry
startup path or the collision prefactor path; it is the first compiled
nonlinear integrator path, followed by the remaining Laguerre and drift/cache
construction subphases.

To break the cache-construction lump down further, use:

.. code-block:: bash

   python tools/profile_linear_cache_build.py \
     --config examples/nonlinear/axisymmetric/runtime_cyclone_nonlinear.toml \
     --Nl 4 --Nm 8 \
     --json-out tools_out/linear_cache_cyclone_gpu.json \
     --csv-out tools_out/linear_cache_cyclone_gpu.csv

The current ``office`` GPU decomposition for the shipped Cyclone short
nonlinear lane is:

- total measured decomposition: ``6.86 s`` after the low-rank collision-cache,
  host-cache, and broadcasted-gyroaverage passes
- dominant subphases:

  - ``gyro_bessel_cache``: ``1.33 s``
  - ``laguerre_cache``: ``1.21 s``
  - ``kperp_and_drifts``: ``0.99 s``
  - ``geometry_coefficients``: ``0.68 s``
  - ``collision_and_damping_cache``: ``0.17 s``

The low-rank collision cache, host-built moment/damping factors, and
broadcasted gyroaverage construction remove the old collision/damping
bottleneck from the cache profile. The overall cold-start wall clock is still
dominated by the first nonlinear integrator compile. The next cache-build
optimization work should therefore focus on Laguerre and drift/cache
construction, while the broader startup campaign should prioritize the first
integrator compile surface.

Cached basis indices
--------------------

To reduce per-step overhead, the linear cache now stores Laguerre/Hermite index
arrays (:math:`l`, :math:`m`) and derived coefficients (``l+1``, ``m+1``,
``sqrt(m)``, ``sqrt(m+1)``). These are reused inside the mirror/curvature
terms and the implicit preconditioner instead of re-allocating on every RHS
call. The change is small in absolute cost for low-order runs, but becomes
noticeable in higher-order scans and tight profiling loops.

GMRES preconditioner iterations
--------------------------------

For the implicit linear solver, we include a small iteration-count harness that
solves a reduced system and compares the GMRES iteration count with multiple
preconditioners:

- ``diag``: full diagonal (damping + drift + mirror)
- ``pas``: PAS line preconditioner (streaming + diagonal damping/drifts)
- ``pas-coarse``: line + kx-coarse additive correction (Schur-style)
- ``hermite-line``: Hermite streaming line solve (tridiagonal in ``m`` at fixed :math:`k_z`)
- ``hermite-line-coarse``: Hermite line solve + kx-coarse correction

.. code-block:: bash

   python tools/profile_gmres_precond.py

On the reference run (Nl=2, Nm=3, Ny=4, Nz=8), this reported:

.. code-block:: text

   iters_plain=6
   iters_diag=6
   iters_pas=6
   iters_pas_coarse=6
   iters_hermite_line=4
   iters_hermite_line_coarse=4

On a larger run (Ny=8, Nz=64, Nl=12, Nm=12, dt=0.1), this reported:

.. code-block:: text

   iters_plain=38
   iters_diag=38
   iters_pas=39
   iters_pas_coarse=39
   iters_hermite_line=22
   iters_hermite_line_coarse=22

JIT considerations
------------------

The linear integrator is ``jit``-compiled with the number of steps and method
as static arguments. The operator term switches (:class:`spectraxgk.linear.LinearTerms`)
should also remain static inside a compiled loop to avoid recompilation. The
cached operator arrays can be constructed once and reused across multiple runs
to avoid repeated geometry setup costs. Nonlinear IMEX paths now reuse the
electrostatic compiled linear-RHS route whenever ``apar=bpar=0``; this is a
bounded fast path for adiabatic-electron electrostatic runs, not a new runtime
claim until a fresh end-to-end profile is recorded.

Planned optimizations
---------------------

- ``vmap`` over species and parameter scans
- JAX mesh-based parallelization across multiple devices
- FFT acceleration and layout tuning
- operator fusion for nonlinear terms

Linear-to-nonlinear optimization roadmap
----------------------------------------

The current benchmark runtime gap on CPU is dominated by JAX compile latency and repeated small-shape scan launches. The next implementation phase
targets both linear and nonlinear performance with a single operator strategy:

1. **Compile-once scan kernels**

   - enforce fixed batch shapes across ``ky`` and ``beta`` scans,
   - pre-JIT a small set of canonical ``(Nl, Nm, Ny, Nz)`` signatures,
   - cache compiled executables on disk for repeated benchmark sweeps.

2. **Operator fusion in RHS assembly**

   - merge streaming/mirror/curvature/grad-B stencils into one fused kernel,
   - remove scatter-heavy intermediate writes,
   - keep field coupling and species sums contiguous in memory.

3. **Matrix-free eigen path as default for linear scans**

   - use Krylov/shift-invert for scan tables and figures,
   - reserve long time integration for spot-check diagnostics only.

4. **Preconditioner reuse**

   - persist Hermite-line and shift-invert preconditioner structures across
     neighboring scan points (same geometry/grid),
   - reuse Jacobian-like linearization objects in IMEX stages.

5. **Streaming diagnostics by default**

   - avoid storing full time traces unless explicitly requested,
   - compute growth/frequency online from selected mode signals.

These steps are chosen to carry directly into nonlinear runs, where the same
fused RHS, scan batching, and preconditioner reuse will dominate throughput and
memory behavior.