Testing
=======

Testing philosophy
------------------

SPECTRAX-GK enforces high coverage on critical solver modules and requires
physics-based checks for each numerical component. The test suite is designed
to be:

- **pedagogic**: each test explains the concept being validated
- **deterministic**: no stochastic outcomes or tolerance drift
- **future-proof**: targeted at invariants and well-posed regressions

Current testing target
----------------------

The package-wide target is 95% coverage, but the coverage number is a guardrail
rather than the scientific objective. New tests should be accepted because they
protect one of the following contracts:

- an implemented equation or reduced physical limit;
- a numerical method, convergence rate, or conservation/free-energy identity;
- a geometry, normalization, or diagnostic convention;
- a benchmark artifact and its documented fit/window policy;
- an autodiff contract checked against finite differences, tangent tests, or an
  adjoint consistency relation;
- a regression for a bug found in parity, restart, runtime, plotting, or
  geometry-adapter work.

Long reference-code runs and office/GPU comparisons should not be hidden inside
the default local suite. They should live behind explicit manifests or CI/manual
lanes so local tests remain fast enough for routine development.

The refactor branch also carries a machine-readable validation/coverage
manifest at ``tools/validation_coverage_manifest.toml``. It is checked by
``tools/check_validation_coverage_manifest.py`` and maps each critical module
to reference anchors, physics contracts, numerical contracts, fast tests,
tracked artifacts, and next tests. This is the working guardrail for reaching
95% package-wide coverage without adding shallow tests that do not validate the
implemented physics or numerics.

The manifest now has two levels of coverage ownership:

- direct ``[[modules]]`` rows for public, high-risk, or actively refactored
  surfaces that need their own contracts and artifact traceability;
- ``owned_modules`` entries for smaller implementation modules whose fast-test
  responsibility is intentionally carried by a direct row.

The checker inventories ``src/spectraxgk`` and fails if a package module is not
directly listed, owned by a listed row, or explicitly excluded as package
plumbing such as ``__init__.py`` or version metadata. This makes source
extractions fail fast until the coverage owner, fast tests, and next-test debt
are declared. New manifest tests for this policy should stay cheap and live in
``tests/test_validation_coverage_manifest.py`` or
``tests/test_refactor_coverage_*.py``.

Manifest paths are intentionally concrete. ``fast_tests`` and
``artifact_paths`` must name files, not directories or placeholder buckets, and
list fields must not repeat the same module, test, artifact, contract, or next
test. The optional Cobertura XML pass also rejects duplicate measured entries
for the same package module so coverage enforcement cannot depend on whichever
duplicate XML row happened to be parsed last.

The wide CI matrix also feeds the manifest checker with ``coverage-wide.xml``.
That pass enforces the declared package-wide coverage target and writes the
measured summary to ``docs/_static/validation_coverage_manifest_summary.json``.
Module-level rows in that summary are a debt map: they identify direct and
owned modules below their row target, but release blocking remains tied to the
package-wide gate unless the CI command is explicitly upgraded to
``--enforce-module-coverage``.

Optional external-backend artifact builders that require local ``vmec_jax`` or
``booz_xform_jax`` checkouts are kept out of the default package-wide coverage
denominator when the public CI cannot install or execute those repositories.
Their fast contracts are still covered by mocked backend tests and low-level
geometry/numerics tests, while the real physics claims are validated by the
tracked JSON/PDF artifact gates documented below. This avoids treating
unavailable optional backends as missing unit coverage while preserving the
requirement that every differentiable-geometry claim has an explicit
finite-difference or parity artifact.

Test categories
---------------

- **Basis tests**: orthonormality and recurrence checks.
- **Operator tests**: Hermite ladder streaming and mode extraction.
- **Benchmark tests**: loading reference data and growth-rate fitting.
- **Physics sanity checks**: conservation properties under simplified limits.
- **Response-function tests**: zonal-flow residuals, GAM damping, and late-time
  envelopes.
- **Spectral tests**: fluctuation spectra and windowed nonlinear statistics.
- **Autodiff tests**: tangent, finite-difference, and inverse/UQ consistency.

Unit tests (numerical invariants)
---------------------------------

Representative unit checks include:

- **Hermite/Laguerre ladder identities**:
  :func:`spectraxgk.linear.apply_hermite_v`,
  :func:`spectraxgk.linear.apply_laguerre_x`.
- **Quasineutrality consistency**:
  :func:`spectraxgk.linear.quasineutrality_phi`.
- **Streaming term validation**:
  :func:`spectraxgk.linear.grad_z_periodic`,
  :func:`spectraxgk.linear.streaming_term`.
- **Growth-rate fitting windows**:
  :func:`spectraxgk.analysis.select_fit_window`,
  :func:`spectraxgk.analysis.fit_growth_rate_auto`.
- **Grid construction and normalization**:
  :func:`spectraxgk.grids.build_spectral_grid`.
- **Normalization contract consistency**:
  :func:`spectraxgk.normalization.get_normalization_contract`,
  :func:`spectraxgk.normalization.apply_diagnostic_normalization`.
- **Modular RHS equivalence**:
  :func:`spectraxgk.linear.linear_terms_to_term_config`,
  :func:`spectraxgk.terms.assemble_rhs_cached`,
  :func:`spectraxgk.linear.linear_rhs_cached`.

These tests live in ``tests/test_linear.py`` and ``tests/test_grids.py`` and
``tests/test_normalization.py`` and ``tests/test_terms_assembly.py`` and are
designed to fail deterministically if a discretization, assembly path, or
normalization changes.

Physics regression tests
------------------------

The physics-focused tests exercise reduced or symmetry limits that should
remain invariant across refactors:

- **Term toggles**: :class:`spectraxgk.linear.LinearTerms` switches individual
  operator components without changing the equation structure.
- **Mirror/curvature activation**: nonzero drift terms create nonzero response
  when streaming and drive are turned off.
- **Diamagnetic drive structure**: the energy-weighted drive produces a
  nonzero response when gradients are enabled and vanishes at :math:`k_y=0`.
- **Normalization scaling**: ``rho_star`` rescales the cached :math:`k_y`
  values exactly.
- **End-cap damping**: the linked-boundary taper only affects :math:`k_y>0`
  modes and vanishes when ``damp_ends_amp = 0``.

These checks are in ``tests/test_linear.py`` and are meant to be future-proof
physics invariants.

Benchmark regression tests
--------------------------

Benchmark regression tests validate the Cyclone base case reference dataset and
growth-rate extraction pipeline:

- Loading the reference CSV via :func:`spectraxgk.benchmarks.load_cyclone_reference`.
- Running short linear scans via :func:`spectraxgk.benchmarks.run_cyclone_linear`
  and :func:`spectraxgk.benchmarks.run_cyclone_scan`.
- Reduced ky regression with tightened tolerances on the field-aligned grid.

These tests live in ``tests/test_benchmarks.py`` and ``tests/test_full_operator.py``.

Literature-anchored response and spectrum tests
-----------------------------------------------

The next research-facing additions should follow the published benchmark
observables rather than inventing repo-local metrics:

- **Rosenbluth-Hinton / GAM response in shaped tokamaks**: use the shaped
  benchmark conventions summarized by Merlo et al. to track residual levels and
  GAM damping alongside the linear shaping scan.
- **W7-X zonal-flow response**: use the stella/GENE W7-X benchmark conventions
  for residual level and damping envelope.
- **W7-X fluctuation spectra**: follow the W7-X Doppler-reflectometry
  comparison work for density and zonal-flow frequency spectra. The current
  closed artifact is a simulation-spectrum diagnostic; experimental transfer
  functions remain outside the release claim.
- **Electromagnetic stellarator verification**: adopt a heavy-electron
  electromagnetic lane before realistic-mass claims, following the GENE-3D
  verification pattern.

These should be implemented as reproducible, script-owned figure/artifact
lanes, not as ad hoc notebooks.

The first reusable tooling for this lane now exists:

- :func:`spectraxgk.benchmarking.zonal_flow_response_metrics`
- :func:`spectraxgk.benchmarking.load_diagnostic_time_series`
- :func:`spectraxgk.validation_gates.evaluate_scalar_gate`
- :func:`spectraxgk.validation_gates.observed_order_gate_report`
- :func:`spectraxgk.validation_gates.branch_continuity_gate_report`
- :func:`spectraxgk.validation_gates.eigenfunction_gate_report`
- :func:`spectraxgk.validation_gates.linear_metrics_gate_report`
- :func:`spectraxgk.validation_gates.nonlinear_window_gate_report`
- :func:`spectraxgk.validation_gates.zonal_response_gate_report`
- :func:`spectraxgk.zonal_validation.reference_residual_table`
- :func:`spectraxgk.zonal_validation.tail_trace_metrics`
- :func:`spectraxgk.plotting.zonal_flow_response_figure`
- ``tools/plot_zonal_flow_response.py``
- ``tools/plot_zonal_flow_response_from_output.py``
- ``tools/generate_miller_zonal_response_pilot.py``
- ``tools/generate_w7x_zonal_response_panel.py``
- ``tools/plot_w7x_zonal_contract_audit.py``
- ``tools/plot_w7x_zonal_moment_tail_audit.py``
- ``tools/plot_w7x_zonal_closure_ladder.py``
- ``tools/write_w7x_zonal_closure_sweep.py``
- ``tools/plot_w7x_zonal_state_convention_audit.py``
- ``tools/plot_w7x_zonal_recurrence_sweep.py``
- ``tools/plot_w7x_fluctuation_spectrum_panel.py``

The gate-report helpers are intentionally small and JSON-ready. They should be
used by manuscript refresh scripts so every reported artifact has the same
observable, reference, absolute/relative tolerance, and pass/fail convention.
The companion coverage manifest should be updated when a new gate helper,
artifact script, or refactor extraction changes module ownership or test
responsibility.
``tools/generate_miller_zonal_response_pilot.py`` now writes the first such
gate report into its JSON metadata for the residual, GAM frequency, and signed
GAM growth/damping comparison against the Merlo Case-III paper-scale read-off.
``tools/generate_kbm_reference_overlay.py`` writes the same gate structure for
the raw KBM eigenfunction overlay, using a strict overlap/relative-L2 policy.
The current refreshed KBM overlay passes that policy with overlap ``0.999985``
and relative ``L^2`` mismatch ``0.00721`` against the frozen GX raw mode.
``tools/generate_w7x_reference_overlay.py`` applies the same raw-mode policy to
the imported W7-X linear benchmark at ``k_y rho_i = 0.3``. It refreshes the
frozen finite GX raw-mode bundle when a matching ``.big.nc`` file is supplied
and writes ``docs/_static/w7x_eigenfunction_reference_overlay_ky0p3000.png``
plus JSON/CSV companions. The current artifact passes with overlap
``0.9999999994`` and relative ``L^2`` mismatch ``3.33e-5``.
``tools/compare_gx_nonlinear_diagnostics.py --summary-json`` now emits a
matching gate report for nonlinear diagnostic comparison figures, using the
window mean relative mismatch as the scalar acceptance metric. The summary
writer now accepts case/source labels, explicit ``tmin/tmax`` windows, and
writes strict JSON, replacing nonfinite absolute-gate relative errors with
``null``. The tracked release-window summaries cover Cyclone, Cyclone Miller,
KBM, HSX, and W7-X. The older short Cyclone diagnostic remains available as an
exploratory startup/resolved-spectrum audit, but it is not counted in the
release-gate index.
Observed-order and branch-continuity gate helpers are also available so
velocity-space convergence panels and branch-followed scan tables can use the
same JSON-ready acceptance convention.
``tools/generate_observed_order_gate.py`` is the generic no-rerun path for
CSV-backed convergence studies: it reads either an explicit step column or a
resolution column, writes an observed-order JSON gate report, and can generate
a log-log convergence figure. The tracked Cyclone velocity-space convergence
artifact lives at ``docs/_static/cyclone_resolution_observed_order.json`` and
``docs/_static/cyclone_resolution_observed_order.png``. It uses an office/GPU
``ky=0.30`` time-path sweep through ``(Nl,Nm)=(4,8),(6,12),(12,24),(16,32)``
with ``tmax=150`` and passes the strict pairwise-order and final-error gates.
``tools/compare_gx_kbm.py --branch-summary-json`` wires that convention into
the KBM branch-following workflow by summarizing adjacent ``gamma``/``omega``
jumps and successive eigenfunction-overlap continuity for the selected branch.
``tools/generate_kbm_branch_gate_summary.py`` provides the corresponding
no-rerun artifact path: it reads the existing selected KBM candidate table and
writes ``docs/_static/kbm_branch_gate_summary.json`` with the same strict gate
schema. The current continuity-first selected branch passes the adjacent
growth/frequency jump and successive-overlap gates.
``tools/make_validation_gate_index.py`` scans tracked JSON metadata and writes
``docs/_static/validation_gate_index.json``, ``.csv``, and ``.png`` so the docs
always have one compact pass/open view of the currently materialized release
validation gates. The current JSON index has ``14/14`` tracked reports passing.
Exploratory diagnostics can set ``gate_index_include=false``
to remain documented without being treated as release blockers.
``tools/plot_nonlinear_window_statistics.py`` provides the companion
manuscript-facing statistics panel for the nonlinear GX comparison gates by
plotting the per-diagnostic ``mean_rel_abs`` and ``max_rel_abs`` values from
those same tracked JSON summaries.
``tools/plot_nonlinear_feasibility_pilot.py`` is the analogous tool for new
finite nonlinear pilots that do not yet have a reference comparison or
production-resolution convergence gate. It writes PNG/PDF/JSON/CSV artifacts
with explicit ``claim_level`` and ``promotion_gate.passed = false`` metadata,
so exploratory external-VMEC runs can be documented without being promoted to
transport validation claims.

``tools/plot_external_vmec_nonlinear_convergence_gate.py`` is the promotion
gate for those pilots once at least two grid levels exist. It replays the
pilot JSON/CSV traces, compares common and least-trending late windows,
requires enough samples, bounds relative heat-flux trend and coefficient of
variation, and finally checks pairwise grid-refined heat-flux agreement. The
tracked CTH-like external-VMEC artifact intentionally fails this gate and sets
``gate_index_include=false`` because it is a research-planning negative result,
not a release-blocking validation gate.

``tools/write_external_vmec_holdout_configs.py`` is the reproducibility
companion for that lane. It writes the fixed-step nonlinear TOMLs and restart
copy commands for the standard two-grid external-VMEC holdout ladder, e.g.
``t = 150`` initial runs followed by ``t = 250`` restart continuations at
``48x48x32`` and ``64x64x40``. The script does not promote any data by itself;
the resulting traces must still pass the convergence gate above before they can
enter quasilinear calibration reports or optimization studies. For the
production nonlinear optimization evidence lane the same generator also accepts
``--seed-variant`` and ``--dt-variant`` entries. Those options write explicit
``[metadata]`` blocks and variant-specific filenames so seed and timestep
replicate windows can be launched on the office GPUs, extracted with the same
transport-window protocol, and checked by
``tools/check_nonlinear_window_ensemble_readiness.py`` before any
absolute-flux or turbulent-flux optimization wording can be considered.
For external-VMEC replicate campaigns,
``tools/build_external_vmec_replicate_ensemble.py`` is the reproducible
NetCDF-to-evidence wrapper: it extracts heat-flux traces from finished
``*.out.nc`` files, writes the transport-window summaries and convergence
reports, runs the readiness and ensemble gates, and produces the documentation
figure used by the manuscript ledger.
Before those files enter the ensemble builder, run
``tools/check_nonlinear_runtime_outputs.py`` on every produced ``*.out.nc``.
That gate verifies the grouped NetCDF contains ``Grids/time`` and the requested
heat-flux diagnostic, checks finite monotone time samples, enforces optional
``tmin/tmax`` coverage, and fails closed for restart-only or metadata-only
artifacts. It is the first campaign-level smoke check after a long office GPU
batch exits with ``rc=0``.
``tools/check_production_nonlinear_optimization_guard.py`` then consumes those
replicated long-window ensembles together with the reduced optimization and
startup finite-difference artifacts. It is the fail-closed check that allows
release-safe scoped wording while blocking production nonlinear turbulent-flux
optimization promotion until optimized equilibria have replicated
post-transient transport-window audits.
For actual nonlinear turbulence-gradient promotion, use
``tools/write_vmec_boundary_perturbation_inputs.py`` when the perturbation is a
VMEC boundary coefficient. It writes the matched ``input.*`` files and records
the exact ``vmec_jax`` commands needed to create the three real re-equilibrated
``wout`` files. Then use
``tools/write_nonlinear_turbulence_gradient_campaign.py`` to write the matched
baseline/plus/minus VMEC launch ladders and replay commands. The campaign
writer rejects missing files, duplicate resolved paths, and byte-identical VMEC
contents unless ``--allow-identical-vmec-content`` is explicitly used for a
plumbing-only smoke test; production evidence therefore requires real
``wout`` files. The generated TOMLs are restart-ladder segments: a final
``t=900`` config only advances the last segment unless the earlier restart
artifacts have been seeded. The manifest therefore records
``direct_full_horizon_launch_commands`` for one-shot final-horizon campaigns
and an ``output_gate_command`` that must pass before ensemble evidence is built.
For the direct one-shot route, launch the recorded commands with
``tools/run_nonlinear_gradient_direct_campaign.py`` instead of an ad-hoc shell
loop. The launcher reads the manifest, assigns one worker per listed GPU, writes
per-task logs and a status JSON, supports ``--skip-existing`` for safe restarts,
and keeps the command provenance identical to the manifest.
Then use
``tools/build_nonlinear_turbulence_gradient_fd_gate.py`` after the matched
``baseline``/``plus_delta``/``minus_delta`` ensembles finish. The builder writes
the central finite-difference gradient sidecar and checks response resolution,
forward/backward asymmetry, subtraction conditioning, propagated uncertainty,
and the uncertainty gates on all three replicated nonlinear windows.
The tracked optimized-QA/ESS ``ZBS(1,0)`` example is deliberately kept as a
fail-closed regression: the real ``vmec_jax`` re-equilibrated ``t=[450,900]``
baseline/plus/minus ensembles pass their replicated transport-window gates and
the initial three-replicate central finite difference is local, but
``gradient_uncertainty_rel = 0.655`` and therefore does not promote a
turbulence-gradient claim. A seed-5 follow-up for the same ``ZBS(1,0)``
bracket also remains blocked: the response fraction weakens to about ``0.037``,
``gradient_uncertainty_rel`` rises to about ``1.18``, and ``fd_asymmetry_rel``
is about ``0.520``. The companion ``RBC(1,1)`` and ``ZBS(1,1)`` controls fail
the locality/asymmetry gates. The central-FD artifact now includes
diagnostic-only paired-replicate rows when matching seed or timestep labels are
available; these rows are useful for identifying sign reversals or weak
responses, but they do not relax the production gates. A future passing
artifact must satisfy both uncertainty and locality thresholds without
weakening either threshold.
For future perturbation refreshes, keep each coefficient/amplitude in a
distinct artifact slug such as
``docs/_static/qa_ess_zbs10_rel5_nonlinear_gradient_zbs_1_0_central_fd_gradient_gate.*``.
Do not promote new prose until
``tools/check_nonlinear_turbulence_gradient_evidence.py`` reports
``passed = true`` and the JSON sidecar sets
``nonlinear_turbulence_gradient_gate = true``. Until then, describe the result
as a bounded production-candidate finite-difference audit, not as a nonlinear
turbulence-gradient claim.
The current QA/ESS composite profile-direction follow-up demonstrates this
policy. The targeted ``plus_delta`` cross variants ``seed22_dt0p05``,
``seed32_dt0p04``, and ``seed33_dt0p05`` completed and all six plus-state
outputs passed the runtime-output gate. The extended plus ensemble still fails
the spread gate with ``mean_rel_spread = 0.166`` against the ``0.15`` limit,
and the central finite-difference artifact remains blocked by
``fd_asymmetry_rel = 2.84`` and ``gradient_uncertainty_rel = 1.22``. That
artifact is tracked as
``docs/_static/qa_ess_descent_profile_rel2_nonlinear_gradient_plus_delta_followup_central_fd_gradient_gate.json``.
It is a regression target for the fail-closed workflow and a design input for
the next campaign, not promotion evidence.
``tools/rank_nonlinear_turbulence_gradient_candidates.py`` is the companion
planning utility for failed candidates. It ranks completed central-FD artifacts
by response, locality, conditioning, and propagated uncertainty margins, writes
a fail-closed JSON summary, and recommends whether the next campaign should add
replicas, shrink a bracket, or move to an overdetermined
least-squares/profile-gradient design. The current tracked ranking artifact is
``docs/_static/nonlinear_turbulence_gradient_candidate_ranking.json`` and is
not itself promotion evidence.
``tools/summarize_nonlinear_gradient_bracket_sweep.py`` is the next
same-control locality utility. It consumes one or more central-FD JSON
artifacts for the same control at different perturbation amplitudes, writes
JSON/CSV/PNG sidecars plus an optional PDF, and decides whether to promote an already passing
bracket, shrink/enlarge the amplitude, add statistical power, or abandon the
single-control direction. It also reads the diagnostic-only paired-replicate
rows when present. If those same-seed rows show sign reversals or large paired
uncertainty, the utility explicitly recommends not spending more GPU time on
more replicas at that same bracket. It also fails the campaign-planning
recommendation toward a new locality sweep or smoother composite control when
resolved central finite differences change sign across nearby amplitudes. The
tracked ``RBC(1,1)`` 5%/8% result,
``docs/_static/qa_ess_rbc11_bracket_sweep.json``, is a same-control negative
audit: response is resolved at both amplitudes, but finite-difference
asymmetry grows with amplitude, so the correct next action is a smaller
locality sweep or an overdetermined profile-gradient control.
``tools/write_overdetermined_nonlinear_gradient_campaign.py`` implements that
next launch-contract step. It writes multiple matched boundary-control VMEC
perturbation manifests from one baseline input, records the per-control
nonlinear campaign commands, and writes the final candidate-ranking command.
The tracked QA/ESS profile-gradient launch plan is
``docs/_static/qa_ess_overdetermined_nonlinear_gradient_campaign_plan.json``.
Use ``tools/check_overdetermined_nonlinear_gradient_campaign.py`` to turn that
multi-control launch plan into a machine-readable status artifact and
``tools/run_overdetermined_nonlinear_gradient_campaign.py`` to run all nested
long-window tasks through one shared CPU/GPU worker queue. The checker must
remain fail-closed until the VMEC states, nonlinear runtime outputs, ensemble
gates, central finite-difference gates, and candidate ranking all exist and
pass. Runtime outputs are only counted complete when their recorded
``Grids/time`` coverage reaches the campaign analysis-window endpoint, so
in-progress NetCDF files cannot accidentally promote a result.
After the long runtime queue completes,
``tools/postprocess_overdetermined_nonlinear_gradient_campaign.py`` runs the
per-control output gates, ensemble gates, central finite-difference gates,
candidate ranking, and final fail-closed status check in one reproducible
sequence.
The completed QA/ESS overdetermined campaign is intentionally tracked as a
negative gate result: all 27 full-horizon nonlinear outputs pass the runtime
coverage checks, but no control passes every production central-FD gate. The
best candidate is ``RBC(1,1)`` with resolved response and bounded locality, but
``gradient_uncertainty_rel = 0.559`` remains above the ``0.5`` promotion gate.
The status artifact
``docs/_static/qa_ess_overdetermined_nonlinear_gradient_campaign_status.json``
therefore reports complete runtime coverage and zero promoted controls. This is
a regression target for the fail-closed workflow and a design input for future
variance-reduction or smaller-bracket campaigns, not a nonlinear turbulence
gradient validation claim.
``tools/write_vmec_boundary_profile_perturbation_inputs.py`` is the companion
for a single smoother composite direction. It perturbs several VMEC boundary
coefficients together, normalizes the finite-difference scalar by the Euclidean
norm of the coefficient-change vector, and writes the same
baseline/plus/minus VMEC launch contract. The tracked
``docs/_static/qa_ess_descent_profile_direction_rel2_manifest.json`` uses the
current QA/ESS long-window evidence signs to define a 2% descent-oriented
``ZBS(1,1)``, ``ZBS(1,0)``, ``RBC(1,1)`` direction. This is still a launch
artifact; promotion requires the resulting re-equilibrated VMEC files and
long-window nonlinear FD gate.
After a detached office campaign finishes, run
``tools/run_nonlinear_gradient_manifest_postprocess.py`` on the generated
``gradient_campaign_manifest.json`` rather than replaying individual commands
by hand. With ``--require-outputs`` it fails before post-processing if any
expected ``*.out.nc`` file is missing; otherwise it runs the output gates,
baseline/plus/minus replicated ensemble builders, the central-FD gate, and the
final nonlinear-gradient evidence check in dependency order. Use
``--allow-blocked`` only when collecting a failure artifact for diagnosis; a
promotion run should keep the default fail-closed behavior.
If that central-FD gate is blocked by a replicated state, run
``tools/summarize_nonlinear_replicate_spread.py`` on the baseline, plus, and
minus ensemble JSON files before launching more nonlinear simulations. The
tool enriches the ensemble rows with seed/timestep labels and convergence
statistics, writes JSON/CSV/PNG sidecars, and classifies whether the failed
state is seed-limited, timestep-limited, mixed seed/timestep spread, or missing
metadata. The current QA/ESS composite profile-direction diagnostic is
``docs/_static/qa_ess_descent_profile_rel2_replicate_spread_diagnostic.json``:
the plus state is a mixed seed/timestep failure, so the next GPU campaign must
disambiguate timestep sensitivity or shrink the bracket rather than adding
blind replicas.
``tools/write_nonlinear_replicate_followup_campaign.py`` turns that diagnostic
back into a minimal run list. It reads the original
``gradient_campaign_manifest.json`` and the spread diagnostic, infers the seed
and timestep metadata from the already-generated TOMLs, and writes only the
cross variants needed to disambiguate the failed state. For the current QA/ESS
profile-direction audit, the tracked launch artifact is
``docs/_static/qa_ess_descent_profile_rel2_plus_delta_replicate_followup_plan.json``;
it selects ``seed22_dt0p05``, ``seed32_dt0p04``, and ``seed33_dt0p05`` for the
``plus_delta`` state. After those three GPU runs finish, rebuild the plus
ensemble with the added outputs, rerun
``tools/summarize_nonlinear_replicate_spread.py``, and only then rerun the
central-FD/evidence gates.

``tools/write_optimized_equilibrium_transport_configs.py`` is the production
optimization companion for that final audit. Given a concrete post-optimization
``wout*.nc`` file, it writes the ``t=250,350,450,700`` fixed-step nonlinear
ladder on the release ``n64`` grid, two seed replicates, one timestep
replicate, restart-copy commands, and the exact
``tools/build_external_vmec_replicate_ensemble.py`` plus
``tools/check_production_nonlinear_optimization_guard.py`` commands needed
after the runs finish. This wrapper is a launch contract only: the production
optimization claim remains blocked until the generated ``t=[350,700]`` ensemble
actually passes finite-flux, running-window, block/SEM, replicate-spread, and
optimized-equilibrium marker gates.

``tools/prepare_external_vmec_holdout_from_screen.py`` is the selector that
feeds that generator. It reads the tracked linear candidate screen, skips
excluded or already-audited cases, resolves the chosen VMEC file from the local
``vmec_jax`` checkout, and writes the next bounded holdout ladder plus a JSON
selection summary. This removes another manual step from the external-VMEC
nonlinear campaign and makes office reruns deterministic.

``tools/build_external_vmec_holdout_runbook.py`` is stricter than a positive
growth-rate sorter. It requires a configurable minimum screened growth rate
(``gamma >= 0.02`` by default) before writing nonlinear launch commands. This
keeps near-marginal branches in the manuscript evidence chain as linear/QI
feasibility data without silently promoting them to expensive nonlinear
transport holdout campaigns.

``tools/build_qi_branch_refinement_gate.py`` is the focused companion for that
near-marginal QI evidence. It checks finite low-``k_y`` branch rows, contiguous
positive support, optional Krylov consistency, and the same nonlinear-launch
growth threshold. A failed launch-growth subgate is a useful documented result,
not a release failure, because it prevents QI feasibility scans from being
misread as transport validation.

``tools/write_w7x_zonal_closure_sweep.py`` is the analogous reproducibility
companion for the open W7-X zonal-response lane. It writes a manifest of
single-``k_x`` closure probes for the paper-facing test-4 contract, separated
by operator family: baseline, constant-Hermite, ``|k_z|``-weighted Hermite,
mixed Laguerre-Hermite, Laguerre-only, and isotropic hypercollision variants.
The manifest includes the exact
``tools/generate_w7x_zonal_response_panel.py`` launch commands plus the
companion ``tools/plot_w7x_zonal_closure_ladder.py`` command needed to refresh
the bounded closure audit after the remote runs complete. Each launch command
writes a case-local ``panel.png`` and the final ladder command writes
``w7x_zonal_closure_ladder_full.{png,json,csv}``, preventing exploratory
office runs from overwriting the frozen documentation figure before the
candidate passes the residual, late-envelope, and moment-tail screens.

``tools/check_quasilinear_calibration_inputs.py`` is the corresponding
calibration-admission guard. It scans quasilinear train/holdout reports and
requires every non-audit nonlinear artifact to match a passed nonlinear gate.
This makes validation provenance executable: finite-but-unconverged pilots can
be documented in the docs, but they cannot silently become calibration or
optimization data. The public CI runs this audit during the docs/packaging
job, and the fast test suite checks the current tracked train/holdout reports
against the same gate index.

``tools/check_quasilinear_promotion_guardrails.py`` is the higher-level
absolute-flux promotion guard. It scans the tracked quasilinear reports plus
the claim-scope docs, fails if a promoted report lacks train/holdout points,
finite nonlinear window statistics, a passed holdout gate, or calibration
policy metadata, and writes
``docs/_static/quasilinear_promotion_guardrails.json`` with a normal
``gate_report`` for the validation index. This is not a runtime/TOML
absolute-flux predictor; it is a fast metadata and wording guard that prevents
overclaiming current diagnostics.
The model-development figure scripts for saturation-rule sweeps,
shape-aware saturation, and uncertainty-aware candidate scoring also validate
their nonlinear summary inputs by default and serialize an ``input_validation``
block into the tracked JSON artifacts.

The diagnostics stream now also carries ``Diagnostics/Phi_zonal_mode_kxt``, a
signed complex zonal-potential history reduced over ``z`` with the same volume
weights used elsewhere. That is the primitive to use for manuscript-grade
Rosenbluth-Hinton / GAM work. ``Diagnostics/Phi2_zonal_t`` remains useful as a
zonal-energy proxy for intermediate checks, but it is no longer the target
observable for the final paper lane.

The first case-specific shaped-Miller pilot for this lane is now reproducible
through ``examples/benchmarks/runtime_miller_zonal_response.toml`` and
``tools/generate_miller_zonal_response_pilot.py``. Its frozen artifact lives in
``docs/_static/miller_zonal_response_pilot.png``. The current frozen artifact
is pinned to Merlo et al. Case III: adiabatic electrons, zero gradients,
``k_xρ_i≈0.05``, ``k_y=0``, and an initial ion-density perturbation.  It uses
``Nz=32``, ``Nl=4``, ``Nm=24``, ``dt=0.005``, and runs to ``t≈60`` through the
same checkpoint-capable artifact writer used by long nonlinear runs.  Using the
Rosenbluth-Hinton convention ``phi(t -> infinity) / phi(0)`` gives a residual
of about ``0.192`` against the Merlo Case-III figure read-off of about
``0.19``.  The shipped extraction now follows the paper convention more
closely: positive and negative extrema of the signed residual-subtracted trace
are fit separately over a common pre-recurrence window, and the GAM frequency
is extracted from the instantaneous phase of that same window via a Hilbert
analytic signal.  With the current ``t≈30`` pre-recurrence window the artifact
gives ``ω_GAM R0 / v_i≈2.20`` and ``γ_GAM R0 / v_i≈-0.176``, both close to
the Merlo figure read-off.  The explicit remaining follow-up item is the
long-time recurrence visible in finite moment runs, rather than the
benchmark-scale residual/frequency/damping gate itself.

An additional recurrence audit now brackets the numerical trade-off more
explicitly: increasing the resolution to ``Nm=28`` and ``Nl=4`` lowers the
late-time recurrence ratio from about ``0.60`` to about ``0.54`` and brings
``ω_GAM R0 / v_i`` nearly onto the Merlo read-off, but it also pushes the
damping to roughly ``γ_GAM R0 / v_i≈-0.192``, which is more damped than the
paper-scale target near ``-0.17``. A minimal ``hypercollisions_const`` ladder
through ``10^{-4}`` is effectively inert for this case, while ``10^{-3}``
only lowers the recurrence ratio to roughly ``0.589`` and still does not beat
the clean higher-moment run. The shipped artifact therefore remains on the
``Nm=24``, ``Nl=4`` baseline until the long-time recurrence can be reduced
without moving the benchmark-scale damping gate.

The next literature lane now has a dedicated runtime contract as well:
``examples/benchmarks/runtime_w7x_zonal_response_vmec.toml`` and
``tools/generate_w7x_zonal_response_panel.py`` define the W7-X high-mirror
bean-tube zonal-flow relaxation benchmark from the stella/GENE paper. The
tool sweeps ``k_x rho_i`` over ``[0.05, 0.07, 0.10, 0.30]``. The runtime
contract seeds the published electrostatic-potential perturbation with
``init_field = "phi"`` and a Gaussian profile, while the panel extracts the
unweighted signed line-average diagnostic ``Phi_zonal_line_kxt``. The paper
text states that the line-average trace is normalized to its value at ``t=0``;
the caption also mentions the maximum value, but the source figure is clipped
at the initial point. The paper-facing default is therefore
``--initial-normalization=line_first`` and ``--time-scale=1``. The ``init_amp``
normalization and non-unit time-scale options are retained as explicit audits,
not as the validation contract. The default early-time fit-window cap is an
explicit analysis policy chosen to isolate the initial GAM before the slower
stellarator-specific oscillation. The generator forces a periodic radial box
for this ``k_y=0`` zonal response so the selected ``k_x rho_i`` values match
the published test-4 targets exactly; this avoids the linked-boundary
aspect-ratio override that is appropriate for drift-wave flux-tube runs but
wrong for this radial zonal scan.

The current frozen VMEC-backed artifact lives at
``docs/_static/w7x_zonal_response_panel.png`` with strict JSON metadata at
``docs/_static/w7x_zonal_response_panel.json``. The tracked combined trace CSV
``docs/_static/w7x_zonal_response_panel.traces.csv`` is written next to the
figure so comparison and audit scripts can be rerun without office-only
per-``k_x`` directories. It is a long-window run: ``k_x rho_i=0.05`` reaches
``t≈3460`` and the other three wavelengths reach ``t≈1980``. After the
paper-faithful line-first normalization, the late residuals are about
``0.0189``, ``0.137``, ``0.0938``, and ``0.526`` for ``k_x rho_i = 0.05``,
``0.07``, ``0.10``, and ``0.30``.
``tools/digitize_w7x_zonal_reference.py`` now extracts the stella/GENE Fig. 11
main traces and inset residual levels from the arXiv source ``figs/ZF.pdf``.
The resulting reference artifacts are
``docs/_static/w7x_zonal_reference_digitized.csv``,
``docs/_static/w7x_zonal_reference_digitized_residuals.csv``,
``docs/_static/w7x_zonal_reference_digitized.json``, and
``docs/_static/w7x_zonal_reference_digitized.png``. The comparison contract is
implemented in ``tools/compare_w7x_zonal_reference.py`` and materialized at
``docs/_static/w7x_zonal_reference_compare.png`` with JSON metadata in
``docs/_static/w7x_zonal_reference_compare.json``. The current long-window
artifact passes the time-coverage gate for all four wavelengths, but the
residual gate only passes at ``k_x rho_i=0.05`` and the late-envelope gate
fails by orders of magnitude. A previous ``init_amp``-normalized audit happened
to pass residual values for all four wavelengths, but that comparison is no
longer treated as a validation result because it does not follow the paper text
normalization. A later ``gaussian_width=4`` probe matched the clipped apparent
initial level of Fig. 11 better than the tracked width-1 profile, but the
source figure shows that the apparent ``0.8`` start is a plot-limit artifact,
not a reliable normalization target. The tracked TOML therefore keeps
``gaussian_width=1``, matching the source expression ``exp[-(z-z0)^2]``.

The runtime path now has three safeguards for this lane. First, strided nonlinear
diagnostics always retain the final step, so long traces do not silently stop
one stride before the intended horizon. Second, checkpointed artifact
generation validates each chunk for non-finite diagnostics, state, and fields
before writing or continuing. This makes high-moment W7-X recurrence sweeps
fail fast instead of running thousands of extra steps after a NaN. Third,
default VMEC/eik cache outputs are reused when valid and generated through a
unique temporary netCDF followed by atomic replacement, so parallel W7-X
validation sweeps cannot observe or corrupt a partially written geometry file.
A bounded
``k_x rho_i=0.07``, ``Nl=16``, ``Nm=64``, ``dt=0.05`` probe remained finite to
``t≈200`` and a post-fix ``t≈50`` rerun verified nonzero signed line-average
diagnostics through the retained final sample. A separate external-restart
artifact bug was then isolated to double-condensing already-active ``kx``/``ky``
diagnostic axes when appending loaded history. The writer now accepts either
full spectral axes or already-active GX output axes, and a W7-X VMEC external
resume smoke verified nonzero ``Phi_zonal_line_kxt`` and
``Phi_zonal_mode_kxt`` throughout the appended tail. A higher-moment follow-up
with ``Nl=16``, ``Nm=64``, ``dt=0.05`` then restart-continued the
``k_x rho_i=0.07`` trace to ``t≈100`` with finite diagnostics and nonzero
signed line/mode samples across the post-restart tail. A full four-wavelength
refresh at the same moment resolution also reached ``t≈100`` with finite,
nonzero signed traces for every target ``k_x rho_i``. A width-4 full-window
low-moment audit reached the digitized windows but flipped the residual sign at
``k_x rho_i=0.07``, ``0.10``, and ``0.30``. The remaining open item is
therefore not restart diagnostic continuity; it is the W7-X zonal damping,
closure, and velocity-space recurrence behavior under the paper-facing
line-first normalization.
``tools/plot_w7x_zonal_contract_audit.py`` turns the same tracked CSV/JSON
artifacts into ``docs/_static/w7x_zonal_contract_audit.png``. That panel is a
publication-facing diagnostic of the open mismatch rather than a release gate;
its JSON metadata has ``gate_index_include=false`` so the validation index does
not count it as closed.
``tools/plot_w7x_zonal_moment_tail_audit.py`` adds a no-rerun velocity-space
audit at ``docs/_static/w7x_zonal_moment_tail_audit.png``. It shows that the
long ``Nl=8``, ``Nm=32`` traces have large late normalized-trace standard
deviations and non-negligible final high-Hermite/high-Laguerre free-energy
fractions. The existing ``Nl=16``, ``Nm=64``, ``t≈100`` audit lowers the early
trace standard deviation but already carries a large high-Hermite tail, so the
next closure experiment should be a bounded moment/closure or recurrence
control sweep, not a change to the paper normalization.
``tools/plot_w7x_zonal_closure_ladder.py`` makes that bounded sweep explicit
for ``k_x rho_i=0.07`` in
``docs/_static/w7x_zonal_closure_ladder_kx070.png``. The ladder separates
closure families one knob at a time under the paper-facing initializer and
line-average observable. The refreshed office-GPU ladder covers baseline,
constant Hermite, ``k_z``-weighted Hermite, mixed Laguerre-Hermite,
Laguerre-only, and isotropic hypercollision variants at ``0.01`` and
``0.03``. The best early-window trace error is the isotropic ``nu_hyper=0.01``
case with mean absolute error ``0.2755`` versus baseline ``0.2861``, but its
late-window standard-deviation ratio is ``4.25`` versus baseline ``4.10`` and
therefore worsens the recurrence/envelope metric. Laguerre-only and mixed
Laguerre-Hermite closures show the same pattern: strong tail suppression with
no simultaneous improvement of trace error and late envelope. The ladder is
therefore a documented negative result for these bounded closure families, not
a hidden validation setting.
``tools/plot_w7x_zonal_state_convention_audit.py`` closes the state-level
initializer and observable convention layer for the same paper-facing setup.
At ``k_x rho_i=0.07``, ``Nl=16``, and ``Nm=64``, the recovered Gaussian
potential has relative ``L2`` error ``1.85e-6``, off-target spectral potential
content is zero to the reported precision, and the signed line-average and
volume-average helper diagnostics agree with manual reductions to about
``2e-16``. The line-first initial level is ``0.28209 init_amp`` while the
volume-weighted level is ``0.28450 init_amp``; that explicit difference is why
the paper-facing observable must remain ``Phi_zonal_line_kxt`` normalized by
its first nonzero sample.
``tools/plot_w7x_zonal_recurrence_sweep.py`` then performs the bounded
recurrence sweep requested for the paper lane without changing initializer or
normalization conventions. Moment resolution and closure source are varied
separately at ``k_x rho_i=0.07`` over the common ``t v_t/a <= 100`` window.
The no-closure rows give mean absolute reference errors ``0.295`` for
``Nl=8,Nm=32``, ``0.276`` for ``Nl=12,Nm=48``, and ``0.283`` for
``Nl=16,Nm=64``. At fixed ``Nl=16,Nm=64``, constant-source closure suppresses
the final Hermite-tail fraction from ``0.388`` to ``0.062`` but worsens the
trace mean absolute error to ``0.291``; the ``k_z``-weighted closure remains
close to no closure. This separates the remaining recurrence/closure problem
from a state-convention error.
The newest constant-hypercollision follow-up keeps the paper-facing
normalization and compares ``nu_hyper_m=0.01`` and ``0.03`` at
``Nl=16,Nm=64`` to ``t v_t/a=100``. Increasing ``nu_hyper_m`` lowers the final
Hermite-tail fraction from ``0.220`` to ``0.099`` and lowers the free-energy
ratio from ``0.759`` to ``0.600``, but the mean trace error remains
``0.289`` and the late-window standard deviation remains more than four times
the digitized reference. The W7-X zonal lane therefore remains a physical
closure/recurrence problem, not a normalization problem and not a simple
constant-damping fix.
The mixed Laguerre-Hermite closure audit then tests the best bounded closure
candidate under a moment-resolution increase. At ``Nl=16,Nm=64`` and
``dt=0.05``, the mixed closure gives mean absolute trace error ``0.2753`` and
late-window standard-deviation ratio ``4.24``. Raising the resolution to
``Nl=24,Nm=96`` requires ``dt=0.025`` for a finite run; it lowers the
late-window standard-deviation ratio slightly to ``4.11`` and further reduces
the Hermite/Laguerre tail fractions, but the trace error remains ``0.2768``.
The more aggressive ``Nl=32,Nm=128`` run still becomes non-finite by
``t v_t/a≈10`` even at ``dt=0.025``. This separates a real high-moment
time-step limitation from the larger physical result: the current mixed
closure does not converge toward the digitized W7-X trace in a way that can be
promoted as validation.
``tools/generate_w7x_zonal_response_panel.py`` now exposes explicit
``--nu-hyper``, ``--nu-hyper-l``, ``--nu-hyper-m``, ``--nu-hyper-lm``,
``--p-hyper-*``, ``--hypercollisions-const``, ``--hypercollisions-kz``,
``--enable-hypercollisions``, and ``--gaussian-width`` overrides so future
closure probes can be launched from the tracked benchmark tool rather than from
unrecorded local TOML edits. Non-unit Gaussian widths remain initializer
audits, not validation defaults.

.. figure:: _static/w7x_zonal_response_panel.png
   :alt: W7-X high-mirror bean-tube zonal-flow response panel

   W7-X high-mirror bean-tube zonal-flow response for the stella/GENE test-4
   target ``k_x rho_i`` values. The response is normalized to the first
   nonzero line-average sample, following the paper text. The red dashed line
   is the late-window residual estimate and the shaded band is the common
   initial-GAM extraction window.

.. figure:: _static/w7x_zonal_reference_digitized.png
   :alt: Digitized W7-X test-4 stella and GENE zonal-flow reference traces

   Digitized stella/GENE reference traces from the W7-X benchmark paper's
   Fig. 11. The horizontal lines are residual levels read from the figure
   insets and are the reference targets for the next long-window SPECTRAX
   zonal-response gate.

.. figure:: _static/w7x_zonal_reference_compare.png
   :alt: Current W7-X zonal SPECTRAX comparison against digitized references

   Current W7-X zonal comparison gate. Time coverage passes for all four
   wavelengths, but the paper-normalized residuals and late-window envelopes
   remain open validation issues.

.. figure:: _static/w7x_zonal_contract_audit.png
   :alt: W7-X zonal-response literature-contract audit

   Publication-facing audit of the open W7-X test-4 zonal-response lane. The
   top row separates residual and late-envelope discrepancies; the bottom row
   overlays representative paper-normalized traces against the digitized
   stella/GENE mean. This figure is intended to localize the remaining
   velocity-space recurrence / closure problem, not to claim validation closure.

.. figure:: _static/w7x_zonal_moment_tail_audit.png
   :alt: W7-X zonal-response velocity-space tail audit

   Velocity-space tail audit for existing W7-X test-4 outputs. The long
   ``Nl=8``, ``Nm=32`` traces have large late normalized-trace variance and
   visible Hermite/Laguerre tail content. The short ``Nl=16``, ``Nm=64`` run
   reduces the early trace envelope but does not by itself close the
   long-window recurrence question.

.. figure:: _static/w7x_zonal_closure_ladder_kx070.png
   :alt: W7-X zonal-response closure ladder at kx rho_i 0.07

   Bounded closure ladder for ``k_x rho_i=0.07``. Constant Hermite,
   ``k_z``-weighted Hermite, mixed Laguerre-Hermite, Laguerre-only, and
   isotropic hypercollision families are compared with the no-closure baseline.
   Some variants reduce mean trace error or velocity-space tails, but none
   improves the trace and late-envelope recurrence metrics together.

.. figure:: _static/w7x_zonal_state_convention_audit.png
   :alt: W7-X zonal-response state convention audit at kx rho_i 0.07

   State-level W7-X test-4 convention audit. The runtime path recovers the
   paper Gaussian potential initializer, selects only the requested zonal
   spectral mode, and verifies that the signed line-average and
   volume-weighted zonal observables are intentionally distinct but internally
   consistent.

.. figure:: _static/w7x_zonal_recurrence_sweep_kx070.png
   :alt: W7-X zonal-response recurrence sweep at kx rho_i 0.07

   Bounded W7-X test-4 recurrence sweep at ``k_x rho_i=0.07``. The left trace
   panel varies moment resolution with no closure; the right trace panel varies
   closure source at fixed high resolution. The bottom panels show that tail
   suppression alone does not yet close the literature-trace mismatch.

.. figure:: _static/w7x_zonal_hypercollision_probe_kx070.png
   :alt: W7-X zonal-response constant hypercollision probe at kx rho_i 0.07

   Constant-Hermite-hypercollision follow-up for ``k_x rho_i=0.07``. Stronger
   constant damping reduces Hermite-tail and free-energy metrics but does not
   reduce the long-window trace error or recurrence envelope enough to match
   the digitized stella/GENE reference. This is a documented negative result
   that motivates a more physical closure/operator study.

.. figure:: _static/w7x_zonal_mixedlm_resolution_kx070.png
   :alt: W7-X zonal-response mixed Laguerre-Hermite resolution audit at kx rho_i 0.07

   Mixed Laguerre-Hermite closure resolution audit for ``k_x rho_i=0.07``. The
   ``Nl=24,Nm=96`` run is finite only with the smaller ``dt=0.025`` and lowers
   the late-window variability modestly, but it does not improve the trace
   error relative to ``Nl=16,Nm=64``. The omitted ``Nl=32,Nm=128`` point is a
   tracked non-finite result under the same closure family, so this remains an
   open physics/numerics lane rather than a closed W7-X zonal validation.

Diffrax and nonlinear smoke tests
---------------------------------

Diffrax integration and the nonlinear driver are exercised with fast smoke
tests:

- ``tests/test_diffrax_integrators.py`` runs explicit and IMEX diffrax solvers
  on tiny grids.
- ``tests/test_diffrax_integrators_core.py`` hardens branch coverage for
  diffrax helper paths (solver selection, save modes, streaming fits, IMEX
  branches, parallelization, and validation errors).
- ``tests/test_linear_krylov_core.py`` hardens matrix-free Krylov internals
  (mode-family targeting, shift-invert preconditioner selection, fallback
  policy, and dominant eigenpair wrappers).
- ``tests/test_example_smoke.py`` verifies the config-driven runner (diffrax
  enabled) and a short nonlinear scan through the assembled E×B nonlinear
  bracket.
- ``tests/test_nonlinear_exb.py`` exercises the nonlinear bracket sign,
  real-FFT path, flutter coupling, scalar/precomputed gyroaverage paths, and
  EM component accounting. The targeted nonlinear-term tranche covers the
  pseudo-spectral bracket and electromagnetic decomposition branches without
  launching benchmark-size turbulence runs.
- ``tests/test_nonlinear_helpers_extra.py`` locks the higher-level nonlinear
  diagnostic contracts: Hermitian real-FFT projection, signed-mode masks,
  explicit Runge-Kutta variants, fixed-mode frequency extraction, collision
  splitting, and IMEX nonlinear terms.
- ``tests/test_runtime_config.py`` and ``tests/test_runtime_runner.py`` verify
  unified runtime TOML loading and case-agnostic linear runs (Cyclone/ETG/KBM)
  through the same solver path.
- ``tests/test_runtime_config.py`` also locks the public nonlinear stellarator
  runtime contract, including the absence of adaptive-step truncation caps and
  the presence of default ``tools_out/...`` artifact paths for W7-X and HSX.

Parallelization identity gates
------------------------------

Independent scan and ensemble parallelization is tested before it is used for
performance claims:

- ``tests/test_parallel.py`` locks the ``batch_map`` / ``ky_scan_batches``
  helper semantics, including deterministic padding, one-device fallback, and
  pytree outputs used by UQ and sensitivity workflows.
- ``tests/test_velocity_sharding.py`` locks the GX-inspired species/Hermite
  velocity-decomposition planner. These tests verify load balance metadata,
  Hermite ghost-exchange flags, and field-reduction axes before any production
  ``shard_map`` implementation can use that layout. The same test file also
  covers the full-array Hermite-neighbor reference and one-device fallback for
  the communication kernel.
- ``tests/test_sharded_integrators.py`` locks the sharded linear RK2 wrapper in
  both no-sharding and explicit-sharding modes using a mocked RHS and mocked
  ``pjit``. It also locks the fixed-step nonlinear state-sharded wrapper,
  including final-state-only profiling mode and the config-runner route through
  ``TimeConfig.state_sharding``. These are numerical-identity and control-flow
  gates, not speedup claims.
- ``tests/test_nonlinear_domain_parallel.py`` and
  ``tests/test_nonlinear_spectral_communication_gate.py`` lock the diagnostic
  nonlinear decomposition gates. The first covers one-cell halo chunks for a
  bounded local stencil. The second covers split/reassemble spectral layout
  identity for FFT round trip, pseudo-spectral bracket, and field-solve layout.
  Both fail closed and carry no production routing or speedup claim.
- ``tests/test_generate_parallel_ky_scan_gate.py`` tests the artifact writer
  for the real Cyclone ``k_y``-batch gate.
- ``tests/test_parallel_artifact_contracts.py`` locks the tracked large-run
  scaling artifacts themselves. It requires the performance and validation
  manifests to list the CPU/GPU split artifacts, verifies serial numerical
  identity for independent ``k_y`` and quasilinear/UQ rows, checks that
  nonlinear whole-state sharding embeds per-device profiler/profile payloads,
  and fails if docs detach speedup wording from the current artifact set.
- ``tools/generate_parallel_ky_scan_gate.py`` runs the actual linear solver
  serially and with fixed-shape ``k_y`` batching, then writes
  ``docs/_static/parallel_ky_scan_gate.{png,pdf,csv,json}``. The JSON gate
  requires numerical identity for growth rate and frequency; the speedup value
  is reported separately for engineering tracking.
- ``tools/generate_logical_cpu_parallel_scan_gate.py`` exercises
  ``RuntimeParallelConfig`` and ``batch_map`` over logical CPU devices with a
  structured JAX-native scan output. Its artifact
  ``docs/_static/logical_cpu_parallel_scan_gate.{png,pdf,csv,json}`` is an API
  identity gate, not a gyrokinetic physics benchmark.
- ``tools/generate_hermite_exchange_gate.py`` runs the first actual
  ``jax.shard_map`` communication-kernel gate for nearest-neighbor Hermite
  ghost exchange and writes
  ``docs/_static/hermite_exchange_gate.{png,pdf,csv,json}``. This is a
  prerequisite for production velocity-space decomposition, but it is not a
  nonlinear runtime speedup claim.
- ``tools/generate_velocity_field_reduce_gate.py`` runs the matching
  ``jax.shard_map`` field-reduction gate with ``lax.psum`` over the Hermite
  mesh and writes
  ``docs/_static/velocity_field_reduce_gate.{png,pdf,csv,json}``. Its
  tolerance is a float32 communication/reduction-tree tolerance, not a physics
  acceptance tolerance.
- ``tools/generate_electrostatic_field_reduce_gate.py`` applies that reduction
  pattern to the production electrostatic quasineutrality density moment and
  writes ``docs/_static/electrostatic_field_reduce_gate.{png,pdf,csv,json}``.
  It is currently scoped to single-species periodic electrostatic cases.
- ``tools/generate_hermite_streaming_ladder_gate.py`` combines the Hermite
  exchange with the actual ``sqrt(m+1)`` / ``sqrt(m)`` streaming-ladder
  coefficients and writes
  ``docs/_static/hermite_streaming_ladder_gate.{png,pdf,csv,json}``. This is
  the last isolated communication/coefficient gate before a linear streaming
  microkernel can be wired.
- ``tools/generate_electrostatic_drift_gate.py`` gates the single-species
  periodic electrostatic mirror and curvature/grad-B drift slices against the
  production linear RHS. It uses offset-1 and offset-2 Hermite exchanges and
  writes ``docs/_static/electrostatic_drift_gate.{png,pdf,csv,json}``.
- ``tools/generate_electrostatic_diamagnetic_gate.py`` gates the
  single-species periodic electrostatic diamagnetic drive against the
  production diamagnetic-only linear RHS. It uses the Hermite-sharded
  electrostatic field reduction plus local ``m=0`` and ``m=2`` drive masks and
  writes ``docs/_static/electrostatic_diamagnetic_gate.{png,pdf,csv,json}``.
- ``tools/generate_periodic_streaming_microkernel_gate.py`` adds the periodic
  spectral parallel derivative and compares the shard-map path directly
  against ``spectraxgk.terms.operators.streaming_term``. Its artifact
  ``docs/_static/periodic_streaming_microkernel_gate.{png,pdf,csv,json}``
  gates the first opt-in linear streaming microkernel before full RHS wiring.
- ``tools/generate_linear_rhs_streaming_gate.py`` routes the same sharded
  periodic streaming kernel through production ``linear_rhs_cached`` with all
  non-streaming terms and electromagnetic channels disabled. Its artifact
  ``docs/_static/linear_rhs_streaming_gate.{png,pdf,csv,json}`` is the first
  full-call-graph linear-RHS identity gate for velocity-space streaming.
- ``tools/generate_linear_rhs_streaming_electrostatic_gate.py`` repeats that
  gate with an ``m=0`` density perturbation and nonzero electrostatic ``phi``.
  Its artifact
  ``docs/_static/linear_rhs_streaming_electrostatic_gate.{png,pdf,csv,json}``
  gates the field-reduction-to-streaming call graph for the current
  single-species periodic electrostatic route.
- ``tools/generate_linear_rhs_electrostatic_slices_gate.py`` compares the
  composed opt-in ``backend="electrostatic_linear_slices"`` route against
  serial ``linear_rhs_cached`` with streaming, mirror, curvature, grad-B, and
  diamagnetic drive enabled. Its artifact
  ``docs/_static/linear_rhs_electrostatic_slices_gate.{png,pdf,csv,json}``
  is the current single-species periodic electrostatic linear-RHS identity
  gate for velocity-space parallelization.
- ``tools/profile_linear_rhs_parallel_slices.py`` times that same composed
  route on a larger bounded CPU workload and writes
  ``docs/_static/linear_rhs_parallel_slices_profile.{png,pdf,csv,json}``.
  The tracked profile is explicitly an engineering artifact, not a publication
  speedup claim; it uses a Hermite-heavy workload and a float32
  reduction-order tolerance so the stricter composed identity gate remains the
  release correctness check. The office GPU companion artifact
  ``docs/_static/linear_rhs_parallel_slices_profile_gpu.{png,pdf,csv,json}``
  is currently a negative performance baseline: it passes identity but is much
  slower than the single-GPU serial JIT path.
- ``tools/profile_nonlinear_sharding.py`` runs a bounded fixed-step nonlinear
  serial-vs-sharded final-state comparison and writes
  ``docs/_static/nonlinear_sharding_profile.json`` locally and
  ``docs/_static/nonlinear_sharding_profile_office_gpu.json`` for the two-GPU
  office run. The release-gated nonlinear axes are ``auto``/``ky`` and ``kx``;
  ``z``-axis FFT sharding remains an exploratory domain-decomposition lane and
  must pass its own identity gate before it can be exposed as a runtime option.
  This keeps nonlinear state-sharding work profiler-backed while preventing
  unsupported runtime claims from entering the README.

Nonlinear parity snapshots
--------------------------

Recent GX parity spot checks are tracked outside the automated test suite:

- **Cyclone nonlinear short replay**: the GX `cyclone_salpha_short.in` replay
  (`dt=0.05`, `t_max=5`, collisions off, diagnostics stride 1) now uses the
  explicit short-reference runtime contract in
  ``examples/nonlinear/axisymmetric/runtime_cyclone_nonlinear_short.toml``.
  The main short-run drift turned out to be configuration-level: the replay
  needed ``p_hyper = 2`` and no end damping to match the public GX short input.
  With that contract restored, the tracked comparison improves to
  ``mean_rel_abs(Wphi) ~= 2.11e-1`` and
  ``mean_rel_abs(HeatFlux) ~= 2.51e-1``. The resolved audit remains in
  ``docs/_static/nonlinear_cyclone_short_resolved_audit_t5.{png,csv}``, where
  ``Wphi_kyst`` is still the dominant residual mismatch.
- **Secondary (`kh01a`)**: the tracked secondary comparison now uses a dense
  real GX run (`kh01a_shortdense.out.nc`, 10 samples in ``omega_kxkyt``) and
  the rebuilt ``secondary_gx_out_compare.csv``. The comparison helper now uses
  the GX file horizon automatically in ``out-nc`` mode, so it no longer mixes a
  short GX replay with a ``t_max = 100`` SPECTRAX stage-2 run. On the matched
  short window, growth rates match tightly (``max rel_gamma ~= 1.87e-4``) and
  the non-zonal ``omega`` modes also close tightly
  (``rel_omega ~= 3.23e-4`` and ``9.92e-4`` on the ``k_y = 0.1`` sidebands).
  The only large relative ``omega`` values left are the effectively zero-
  frequency ``k_y = 0`` sidebands, where the absolute mismatch stays
  ``O(1e-6)``.
- **W7-X nonlinear (`t \\approx 200`)**: the refreshed long-window NetCDF-backed
  comparison now closes at
  ``mean_rel_abs(Phi2) ~= 9.74e-2``,
  ``mean_rel_abs(Wg) ~= 3.20e-2``,
  ``mean_rel_abs(Wphi) ~= 3.02e-2``,
  ``mean_rel_abs(HeatFlux) ~= 4.53e-2``.
- **W7-X fluctuation spectrum**: ``tools/plot_w7x_fluctuation_spectrum_panel.py``
  reuses the same gated nonlinear NetCDF artifact and writes
  ``docs/_static/w7x_fluctuation_spectrum_panel.{png,pdf,json,csv}``. The JSON
  records the time window, dominant nonzonal ``k_y``, dominant heat-flux
  ``k_y``, dominant zonal ``k_x``, and ``claim_level``. This is a reproducible
  simulation diagnostic and explicitly not a Doppler-reflectometry transfer-
  function validation.
- **W7-X/TEM extension status**:
  ``tools/build_w7x_tem_extension_status.py`` reads the W7-X fluctuation panel
  plus the current TEM branch audit and writes
  ``docs/_static/w7x_tem_extension_status.{png,pdf,json,csv}``. It closes only
  the simulation-spectrum estimator. ``tools/build_tem_branch_parity_audit.py``
  writes ``docs/_static/tem_branch_parity_audit.{png,pdf,json,csv}`` from the
  tracked TEM mismatch table. TEM linear parity remains open with maximum
  absolute relative growth-rate mismatch about ``4.25``, maximum absolute
  relative frequency mismatch about ``3.3`` when near-zero reference
  denominators are excluded, one growth-rate sign mismatch, three frequency
  sign mismatches, and an inverted frequency-branch rank ordering
  (Spearman ``≈ -0.986``). Because this reference is a provisional literature
  digitization rather than a direct case dump, the audit blocks broad TEM
  claims but is not a standalone tuning target. W7-X multi-alpha,
  multi-surface, and kinetic-electron nonlinear windows remain unstarted.
- **HSX nonlinear (`t = 50`)**: the refreshed comparison closes at
  ``mean_rel_abs(Wg) ~= 2.75e-2``,
  ``mean_rel_abs(Wphi) ~= 3.61e-2``,
  ``mean_rel_abs(HeatFlux) ~= 2.91e-2``.
- **KBM nonlinear (`t = 100`)**: the refreshed long-window comparison closes at
  roughly ``9.3e-3`` mean-relative error across
  ``Wg/Wphi/Wapar/HeatFlux/ParticleFlux``.

.. figure:: _static/w7x_fluctuation_spectrum_panel.png
   :alt: W7-X nonlinear fluctuation-spectrum diagnostic panel
   :width: 100%

   W7-X nonlinear fluctuation-spectrum diagnostic from the gated ``t≈200``
   VMEC-backed run. The panel summarizes resolved simulation spectra and is
   intentionally scoped below an experimental Doppler-reflectometry comparison.

.. figure:: _static/tem_branch_parity_audit.png
   :alt: TEM branch parity audit
   :width: 100%

   Executable TEM branch audit. The growth-rate and frequency branches fail
   simultaneously, with the frequency branch ordered oppositely to the
   digitized reference over the tracked low-``k_y`` interval.

.. figure:: _static/w7x_tem_extension_status.png
   :alt: W7-X fluctuation/TEM extension validation status
   :width: 100%

   Executable status of the W7-X fluctuation/TEM extension lane. The released
   simulation-spectrum diagnostic is closed, but TEM linear parity,
   alpha/surface-resolved W7-X scans, and kinetic-electron nonlinear windows
   remain open before broad W7-X/TEM validation claims.

Linear physics checks
---------------------

Before nonlinear validation, we exercise linear physics checks grounded in
published benchmarks and trend tests:

- **ITG/Cyclone base case**: reproduce the standard Cyclone base case growth
  rates and frequencies across a reduced ky scan. [Dimits00]_ [Lin99]_
- **GX term-by-term audit**: use the term-dump tooling to compare SPECTRAX-GK
  streaming and linear-kernel RHS components against GX for a single Cyclone
  state (see ``tools/dump_rhs_terms.py`` and ``tools/compare_gx_rhs_terms.py``).
- **GX nonlinear term audit (KBM/Cyclone)**: compare nonlinear
  derivative, bracket, electromagnetic split, and total RHS dumps using
  ``tools/compare_gx_nonlinear_terms.py``. The tool supports GX dump folders
  with ``nl_apar.bin``/``nl_bpar.bin`` and can infer shape metadata when
  ``rhs_terms_shape.txt`` is absent.
- **ETG linear instability**: verify that growth rates remain positive across
  reduced electron-scale gradients and that the real frequency follows the
  electron diamagnetic direction. [Dorland00]_ [Jenko00]_
- **KBM beta scan**: verify the transition between ITG-like and KBM branches
  in a fixed-:math:`k_y` beta sweep against the tracked benchmark reference and
  exact-diagnostic audits.

Running tests
-------------

.. code-block:: bash

   pytest

Benchmark reproducibility stack
-------------------------------

The public CI and the tracked benchmark atlas are currently validated against a
tested numerical stack:

- ``jax>=0.8,<0.9``
- ``jaxlib>=0.8,<0.9``
- ``numpy>=2.3,<2.4``
- ``diffrax>=0.7,<0.8``
- ``equinox>=0.13,<0.14``

This is not a claim that newer releases are unsupported. It is a statement
about benchmark reproducibility. Near-marginal or branch-sensitive lanes such
as TEM, ETG runtime scans, and some imported-linear stellarator cases can move
materially under newer JAX/NumPy combinations even when the code still runs.
When investigating parity regressions, reproduce the issue on the tested stack
first before changing solver logic.
For runtime-example parity reproduction across recent precision-policy changes,
also set ``JAX_ENABLE_X64=1``. Default precision can be faster while still
moving parity-sensitive linear example outputs.

Stress-matrix parity gates
--------------------------

In addition to unit/regression tests, SPECTRAX-GK includes a small set of
"stress-matrix" gates meant to catch parity regressions early (before tracked
benchmark figures move):

- **Restart parity**: ``tests/test_restart_gate.py`` verifies that a nonlinear
  run resumed from a compatible restart reproduces the same final state as a
  continuous run. This now covers both the raw binary state path and the
  nonlinear ``*.restart.nc`` bundle path, together with append-on-restart
  history preservation in ``*.out.nc``.
- **CPU/GPU short-window parity** (optional): ``tests/test_device_parity_gate.py``
  compares a short nonlinear trajectory norm on CPU vs GPU. Enable explicitly:

  .. code-block:: bash

     SPECTRAXGK_DEVICE_PARITY=1 pytest -q tests/test_device_parity_gate.py

- **VMEC roundtrip determinism** (optional): ``tests/test_vmec_roundtrip_gate.py``
  regenerates an ``*.eik.nc`` from a provided VMEC file twice and asserts the
  imported geometry arrays are bitwise identical. Enable explicitly:

  .. code-block:: bash

     SPECTRAXGK_VMEC_FILE=/path/to/wout.nc pytest -q tests/test_vmec_roundtrip_gate.py

For developer workflows that require local reference benchmark NetCDFs or dump
artifacts, use:

- ``tools/run_gx_linear_stress_matrix.py`` (KAW, Cyclone kinetic electrons, KBM Miller)
- ``tools/run_imported_linear_targeted_audit.py`` (generic per-``ky`` targeted imported-linear wrapper)
- ``tools/compare_gx_imported_window.py`` (exact imported-linear one-window replay against reference ``diag_state`` dumps)
- ``tools/run_kbm_lowky_extractor_audit.py`` (direct cached-trajectory KBM low-``ky`` extractor audit)
- ``tools/run_exact_state_audit.py`` (manifest-driven wrapper around the exact-state audit tools)
- ``tools/plot_w7x_exact_state_audit.py`` (no-rerun W7-X exact-state convention audit panel)
- ``tools/run_restart_parity_gate.py`` (manifest-driven nonlinear restart/continuation parity gate)
- ``tools/run_device_parity_gate.py`` (manifest-driven CPU/GPU short-window parity gate)
- ``tools/run_vmec_roundtrip_gate.py`` (manifest-driven VMEC ``vmec -> eik.nc`` determinism gate)

The current full-GK nonlinear ETG lane is now explicitly tracked as a pilot
runtime contract via
``examples/nonlinear/axisymmetric/runtime_etg_nonlinear.toml``. That lane is
separate from the reduced ``cETG`` solver and should be used for future
GX-backed nonlinear ETG parity work.

For ETG nonlinear audit runs, use dense short-window overrides first:

.. code-block:: bash

   JAX_ENABLE_X64=1 spectrax-gk examples/nonlinear/axisymmetric/runtime_etg_nonlinear.toml \
     --steps 10 \
     --sample-stride 1 \
     --diagnostics-stride 1

This lane is currently expensive enough that short persisted windows are the
right first diagnostic step before attempting long production horizons.

The ETG short-window startup mismatch was traced to the GX input contract, not
the nonlinear ETG operator. GX reads ``init_single`` from ``[Expert]`` rather
than ``[Initialization]``, so the audited GX pilot was actually using the
Gaussian startup branch. The shipped runtime ETG pilot now matches that
contract with ``gaussian_init = true``, ``init_single = false``,
``Lx = 1.25``, and GX-style ``kz`` hypercollisions. On the matched
``Nx=10``, ``Ny=22``, ``ntheta=16``, ``Nl=4``, ``Nm=4``, ``dt=1e-4``,
``t_max=0.001`` pilot, the refreshed short-window comparison lands at
``mean_rel_abs(Wg) ~= 1.31e-2`` and ``mean_rel_abs(Wphi) ~= 5.18e-3``, with
the final heat-flux point within a few percent of GX.

The targeted imported-linear wrapper and the underlying
``compare_gx_imported_linear.py`` comparator now support two important controls
for honest stress-lane scoring without changing the default full-window
behavior:

- ``--sample-step-stride``: subsample the saved diagnostic sample indices
  before scoring.
- ``--max-samples``: truncate scoring to the first N selected samples.

The lower-level comparator also supports ``--cache-dir`` plus ``--reuse-cache``
to persist per-``ky`` trajectory/result arrays (``gamma``, ``omega``,
``Wg``, ``Wphi``, ``Wapar``) as compressed ``.npz`` files keyed by the actual
reference file, geometry file, reference input, selected ``ky``, Hermite/Laguerre
resolution, mode selector, and sample-window contract. This makes the
stress-lane tooling incremental instead of rerunning a full lane every time.
It now also writes absolute diagnostic-error columns and the reference
``|gamma|`` / ``|omega|`` scales alongside the relative metrics. That matters
for near-marginal imported-linear stellarator lanes such as HSX, where
``mean_rel_gamma`` can look large simply because the reference growth rate is
close to zero even while the absolute growth-rate mismatch and the field-energy
diagnostics remain small.

For VMEC-backed exact-state audits, the runtime bridge now prefers a local
``booz_xform_jax`` checkout and injects a temporary ``booz_xform`` compatibility
shim only into the external geometry-helper subprocess. This preserves the
audited reference workflow while avoiding a host-level dependency on the original ``booz_xform``
Python package.

The bridge auto-discovers ``booz_xform_jax`` from
``BOOZ_XFORM_JAX_PATH`` / ``SPECTRAX_BOOZ_XFORM_JAX_PATH`` or from a checkout placed
next to the SPECTRAX-GK workspace. When a specific
Python environment is needed for the helper subprocesses, set
``geometry.gx_python`` in the runtime TOML. On ``office``, the normal audited
path is:

.. code-block:: bash

   export BOOZ_XFORM_JAX_PATH=/path/to/booz_xform_jax
   export SPECTRAX_VENV_PYTHON=/path/to/venv/bin/python
   export SPECTRAX_OFFICE_ROOT=/path/to/SPECTRAX-GK
   W7X_VMEC_FILE=/path/to/wout_w7x.nc \
   HSX_VMEC_FILE=/path/to/wout_HSX_QHS_vac.nc \
   "$SPECTRAX_VENV_PYTHON" tools/run_exact_state_audit.py \
     --manifest tools/exact_state_lanes.office.toml \
     --outdir tools_out/exact_state_audit_office

The tracked ``office`` manifest now pins these audit lanes to
``JAX_PLATFORMS=cpu``. These are parity/reference jobs, not performance runs,
and CPU pinning avoids spurious GPU ``RESOURCE_EXHAUSTED`` failures when
``booz_xform_jax`` or grid-default assembly would otherwise grab a busy device.

The restart/continuation gate uses the same environment model and should be
run against the tracked nonlinear lanes with ``PYTHONPATH`` set to the source
tree so the office venv does not pick up a stale installed package:

.. code-block:: bash

   PYTHONPATH="$SPECTRAX_OFFICE_ROOT/src" \
   "$SPECTRAX_VENV_PYTHON" tools/run_restart_parity_gate.py \
     --manifest tools/restart_gate_lanes.office.toml \
     --outdir tools_out/restart_parity_office

The current ``office`` exact-state manifest now includes:

- startup audits for Cyclone, KBM, W7-X, and HSX
- late dumped-state audits for Cyclone Miller, Cyclone runtime, W7-X, and KBM

The tracked W7-X exact-state convention panel is generated by
``tools/plot_w7x_exact_state_audit.py`` from the ``office`` W7-X startup and
late diagnostic-state dumps. It closes the VMEC geometry, Fourier-grid,
fieldsolve, and scalar-diagnostic convention layer against GX with a
``1e-4`` pointwise relative-error gate: startup ``g_state``/``phi`` are below
``7.4e-7``, late ``kperp2``/``fluxfac``/``kx``/``ky``/``phi`` arrays have
maximum finite relative error ``4.62e-5`` with ``phi`` RMS relative error
``3.77e-7``, and late scalar diagnostics are below ``1.8e-7``. This panel is
not a replacement for the open W7-X zonal-response literature lane; it rules
out the geometry/diagnostic convention layer as the source of that separate
recurrence/damping-envelope mismatch.

.. figure:: _static/w7x_exact_state_audit.png
   :alt: W7-X nonlinear exact-state convention audit against GX

   W7-X nonlinear exact-state convention audit. Startup state, late dumped
   geometry/field arrays, and re-evaluated scalar diagnostics are compared
   directly against GX dumps from the same VMEC equilibrium and nonlinear
   runtime contract.

For KBM specifically, the startup audit, late dumped-state audit, nonlinear
term replay, and first RK4 partial-step replay now all close on the shipped
nonlinear config for the current release pass. The remaining KBM work is
therefore future long-window cleanup rather than a blocking startup-state,
diagnostic-reconstruction, or first-step assembly mismatch.

The device-parity gate now has audited ``office`` manifests for one tokamak and
one stellarator lane, both requiring stable nonzero outputs rather than the
older zero-norm smoke probe:

.. code-block:: bash

   PYTHONPATH="$SPECTRAX_OFFICE_ROOT/src" \
   "$SPECTRAX_VENV_PYTHON" tools/run_device_parity_gate.py \
     --manifest tools/device_parity_lanes.office.toml \
     --outdir tools_out/device_parity_office

The VMEC roundtrip gate uses the same manifest pattern and currently covers the
tracked W7-X and HSX VMEC lanes:

.. code-block:: bash

   PYTHONPATH="$SPECTRAX_OFFICE_ROOT/src" \
   "$SPECTRAX_VENV_PYTHON" tools/run_vmec_roundtrip_gate.py \
     --manifest tools/vmec_roundtrip_lanes.office.toml \
     --outdir tools_out/vmec_roundtrip_office

If the helper must be forced to another interpreter, set ``geometry.gx_python``
in the runtime TOML used by the audit and rerun the same command. The old
environment-variable override is no longer documented because the preferred
path is the internal ``booz_xform_jax`` backend.

CI split: fast PR vs manual full
--------------------------------

CI is split into two tiers to keep pull requests fast while preserving full
physics rigor:

- **Fast PR/push tier**: the quick-test matrix runs mypy and targeted test
  subsets across fundamentals, release artifacts, linear core, runtime,
  nonlinear, and parallel/autodiff contracts. This catches solver and dtype
  regressions quickly.
- **Wide coverage tier**: CI runs the 48 top-level coverage shards as a matrix,
  uploads the per-shard ``coverage.py`` data, then combines the artifacts in one
  final ``wide-coverage`` check that enforces the package-wide ``>=95%`` target.
  The same helper, ``tools/run_wide_coverage_gate.py``, is used locally and in
  CI so the threshold is not weakened when the job is parallelized. Each shard
  has its own timeout so a single slow validation slice cannot become an
  unbounded release job. The combine step also requires labeled coverage data
  for every CI shard and writes ``coverage-wide-shard-manifest.json`` before
  refreshing the package-wide Codecov flag.
  Optional VMEC/Boozer artifact builders remain validated by their tracked
  offline artifact gates and mocked CI contracts, not by importing unavailable
  external repositories in the public coverage job.
- **Manual full tier**: full ``pytest`` suite plus strict coverage gates:
  ``spectraxgk.terms >= 90%`` and per-module core gates for
  ``linear_krylov.py`` and ``diffrax_integrators.py``.

This keeps iteration latency low for development and still enforces complete
coverage and regression checks on demand without relying on scheduled runners.

For bounded local feedback, use the per-file runner:

.. code-block:: bash

   python tools/run_tests_fast.py

It enforces both a per-file timeout and a whole-run timeout of 300 seconds by
default, then reports any remaining files as ``not_run(total_timeout)`` instead
of leaving orphaned pytest children. Use ``--total-timeout 0`` only for an
explicit full sequential local pass.

The same wide gate can be run locally in one process with:

.. code-block:: bash

   python tools/run_wide_coverage_gate.py \
     --shards 48 \
     --timeout 300 \
     --fail-under 95 \
     --pytest-arg=-o \
     --pytest-arg=addopts= \
     --pytest-arg=-m \
     --pytest-arg="not slow"

On local machines where every pytest process must stay below the five-minute
release timeout, run one shard at a time and combine afterward. This is the
same data-flow used by CI, except CI runs the ``--only-shard`` jobs in
parallel and downloads the resulting coverage artifacts before the
``--combine-only`` gate:

.. code-block:: bash

   python -m coverage erase
   for shard in $(seq 1 48); do
     python tools/run_wide_coverage_gate.py \
       --shards 48 \
       --timeout 300 \
       --only-shard "${shard}" \
       --keep-existing-coverage \
       --skip-combine \
       --pytest-arg=-o \
       --pytest-arg=addopts= \
       --pytest-arg=-m \
       --pytest-arg="not slow"
   done
   python tools/run_wide_coverage_gate.py \
     --shards 48 \
     --combine-only \
     --fail-under 95 \
     --pytest-arg=-o \
     --pytest-arg=addopts= \
     --pytest-arg=-m \
     --pytest-arg="not slow"

Core modular coverage gate
--------------------------

To keep the modular RHS path future-proof, CI also enforces a dedicated
coverage gate for ``spectraxgk.terms``:

.. code-block:: bash

   pytest -q tests/test_terms_assembly.py \
          tests/test_terms_operators.py \
          tests/test_terms_fields.py \
          tests/test_terms_integrators.py \
          tests/test_terms_validation.py \
          --maxfail=1 --disable-warnings \
          --cov=src/spectraxgk/terms \
          --cov-fail-under=90

This guard ensures term-wise kernels, field solves, custom-VJP behavior, and
assembly plumbing stay highly covered while the rest of the benchmark and
cross-code harness keeps evolving.

Core solver coverage gates
--------------------------

CI also enforces dedicated per-module thresholds for the two linear solver
engines that are most likely to regress during algorithm work:

- ``spectraxgk.linear_krylov`` (matrix-free Arnoldi/shift-invert path)
- ``spectraxgk.diffrax_integrators`` (explicit/IMEX/implicit diffrax path)

The gate runs focused tests and checks each module from ``coverage-core.xml``:

.. code-block:: bash

   pytest -q tests/test_linear_krylov_core.py \
          tests/test_diffrax_integrators.py \
          tests/test_diffrax_integrators_core.py \
          --maxfail=1 --disable-warnings \
          --cov=src/spectraxgk \
          --cov-report=xml:coverage-core.xml

Both modules are required to stay at or above 90% line coverage in CI.