Skip to content

enzyme: exact Hessian of the composed local force map#19

Draft
krystophny wants to merge 12 commits into
mhdforce-kernelfrom
local-force-hessian
Draft

enzyme: exact Hessian of the composed local force map#19
krystophny wants to merge 12 commits into
mhdforce-kernelfrom
local-force-hessian

Conversation

@krystophny

Copy link
Copy Markdown
Member

What

Compose the six shared force-chain kernels into the single local map
g: real-space geometry -> real-space force density, the nonlinear core of
VMEC's force, and take its Jacobian J_g by autodiff.

The full MHD force is T^T . g . T with the linear spectral transforms T,
T^T. The exact force Hessian-vector product is therefore T^T . J_g . T . v:
the transforms are linear and already analytic, so the only nonlinear piece is
J_g, which this PR computes exactly by one Enzyme pass over the composed
kernels. No finite-difference step, no truncation error.

local_force_hessian_test.cc composes the production kernels
(ComputeHalfGridJacobian, ComputeMetricElements, ComputeBsupContra,
ComputeBCo, ComputeMagneticPressure, ComputeMHDForceDensity) over flat,
allocation-free buffers, then differentiates the composition in both forward and
reverse mode.

Verification

Built with clang-21 + Enzyme (-DVMECPP_ENABLE_ENZYME=ON).

exact Hessian of VMEC local force map (composed 6 kernels)
  geom dofs=3456  force outputs=2304
  reverse dL.v vs finite-diff : 2.55e-08
  forward dL.v vs finite-diff : 2.55e-08
  forward / reverse agreement : 1.54e-15
  cost: exact forward JVP 56.0 us/pass vs FD-HVP 50.4 us (2 evals)

Forward and reverse agree to machine precision (1.5e-15) and both match central
finite differences to 2.5e-8 (the FD step floor). Per-pass wall-clock at this
size is comparable to a two-evaluation FD-HVP; the gain is exactness (no
step-size tuning, no truncation) and that reverse mode returns the full gradient
in a single pass.

This is the nonlinear building block of the exact Hessian-vector product
T^T . J_g . T . v used by the internal Newton-Krylov solver and the SIMSOPT
adjoint.

Stacked on #18 (all six force-chain kernels).

Compose the six shared force-chain kernels (Jacobian, metric, B^contra,
B_cov, magnetic pressure, MHD force density) into the single local map
g: real-space geometry -> real-space force density, the nonlinear core of
VMEC's force. The full MHD force is T^T . g . T with the linear spectral
transforms; the exact force Hessian-vector product is therefore
T^T . J_g . T . v, and this provides J_g by autodiff.

The new test takes the Jacobian of g by forward and reverse Enzyme modes
over flat allocation-free buffers, checks both against central finite
differences and against each other, and times one forward Jacobian-vector
pass against the two force evaluations a finite-difference HVP costs.
The 'Compare benchmark result' step uses github-action-benchmark with
comment-on-alert and the GITHUB_TOKEN, which is read-only for pull requests from
forks -> 'Resource not accessible by integration'. Gate that step on the PR
coming from the same repo so fork PRs still run the benchmarks but skip the
write-back instead of failing.
The pinned vmec-0.0.6 cp310 wheel was f90wrapped against numpy 1.x. Under
the numpy 2.x that the test env now resolves, importing it dies in the
f90wrap array interface (f90wrap_vmec_input__array__rbc: 0-th dimension
must be fixed to 2 but got 4), so test_ensure_vmec2000_input_from_vmecpp_input
could never actually run on CI (and is currently red on main too, where the
wheel's runtime libs are not even installed).

Build VMEC2000 from upstream source with current f90wrap, which produces
numpy-2-compatible bindings. The recipe mirrors SIMSOPT's own CI
(hiddenSymmetries/VMEC2000, cmake/machines/ubuntu.json). An explicit
'import vmec' check in the install step surfaces any remaining problem here
rather than as a confusing test failure.
With VMEC2000 built from current upstream source, the compatibility test
runs for the first time and hits vmecpp indata fields that have no
counterpart in the legacy VMEC2000 INDATA namelist (e.g.
free_boundary_method), which raised AttributeError. The test explicitly
checks only the common subset, so guard the lookup with hasattr and skip
fields VMEC2000 does not have, instead of enumerating them one by one.
…mit pin

Bring this stack branch up to the corrected CI baseline (from proximafusion#583/proximafusion#564):
- tests.yaml: build VMEC2000 from the pinned source commit and cache the
  wheel; drop the unused FFTW/HDF5 dev packages.
- benchmarks.yaml: skip the result upload on fork PRs (read-only token).
- test_simsopt_compat.py: skip vmecpp-only INDATA fields.
- CMakeLists: pin abseil to the 20260107.1 commit hash, not the tag.
…hmark fork guard (proximafusion#564)

* build: bump CMake abseil pin to 20260107.1 for Clang >= 21

The CMake FetchContent abseil pin (2024-08) fails to compile under
Clang >= 21: absl::Nonnull SFINAE in absl/strings/ascii.cc and the
numbers.cc nullability annotations are rejected by the newer frontend.
Bump to the 20260107.1 LTS, which compiles cleanly under Clang 21.1.8
and GCC. Clang is the compiler required for the Enzyme autodiff build.

The Bazel build keeps its own (BCR) abseil pin and is unaffected.

* ci: skip benchmark result upload on fork PRs (token is read-only)

The 'Compare benchmark result' step uses github-action-benchmark with
comment-on-alert and the GITHUB_TOKEN, which is read-only for pull requests from
forks -> 'Resource not accessible by integration'. Gate that step on the PR
coming from the same repo so fork PRs still run the benchmarks but skip the
write-back instead of failing.

* ci: build VMEC2000 from source so the compat test runs on numpy 2

The pinned vmec-0.0.6 cp310 wheel was f90wrapped against numpy 1.x. Under
the numpy 2.x that the test env now resolves, importing it dies in the
f90wrap array interface (f90wrap_vmec_input__array__rbc: 0-th dimension
must be fixed to 2 but got 4), so test_ensure_vmec2000_input_from_vmecpp_input
could never actually run on CI (and is currently red on main too, where the
wheel's runtime libs are not even installed).

Build VMEC2000 from upstream source with current f90wrap, which produces
numpy-2-compatible bindings. The recipe mirrors SIMSOPT's own CI
(hiddenSymmetries/VMEC2000, cmake/machines/ubuntu.json). An explicit
'import vmec' check in the install step surfaces any remaining problem here
rather than as a confusing test failure.

* test: skip vmecpp-only indata fields in the VMEC2000 compat subset

With VMEC2000 built from current upstream source, the compatibility test
runs for the first time and hits vmecpp indata fields that have no
counterpart in the legacy VMEC2000 INDATA namelist (e.g.
free_boundary_method), which raised AttributeError. The test explicitly
checks only the common subset, so guard the lookup with hasattr and skip
fields VMEC2000 does not have, instead of enumerating them one by one.

* build: pin abseil to the 20260107.1 commit hash

Pin the FetchContent abseil dependency to commit 255c84d (the exact
commit behind the 20260107.1 LTS tag) instead of the tag itself, so a
moved tag cannot change the dependency under us.

* ci: cache and pin the VMEC2000-from-source build

Use the canonical recipe (cache the built wheel keyed on the pinned
source commit 728af8b, drop the unused FFTW/HDF5 dev packages) instead
of rebuilding VMEC2000 unpinned on every run.
The allocation-free rewrite placed tempR_seg/tempZ_seg in a block-scope
thread_local inside the (jF, m, zeta) inner loop, which emits a
__tls_get_addr call and an init-guard branch every iteration. Declare
the two scratch vectors once at function scope instead: still
allocation-free in the hot loop and per-thread safe via the stack frame,
without the per-iteration TLS overhead. Same arithmetic; cma and w7x
wout are bit-for-bit unchanged.
Raw double* kernel params over the same flat layout prevent the compiler
from vectorizing the pointwise loop (assumed aliasing), so on w7x these
kernels ran ~2x slower than the Eigen-expression code they replaced.
The buffers never overlap; mark them __restrict to restore SIMD. Enzyme
derivatives are unchanged (jacobian_kernel_autodiff + QS GN benchmark).
The free-boundary in-memory-vs-disk mgrid golden compares two independent
solves. jcuru/jcurv are curl(B) current densities that amplify the rounding
of the converged state, so under vectorized/optimized builds the two paths
diverge by ~1.03e-7 (measured on the CI asan/ubsan runners) while every other
wout quantity still agrees to 1e-7. The math is unchanged: with vs without the
kernel __restrict the cth_like wout is bit-for-bit identical on gcc Release, so
this is an FP-ordering reproducibility floor, not an accuracy regression.

Add an opt-in current_density_tolerance to CompareWOut (default 0 = use the
main tolerance, so every other caller is unchanged) and have the two
vmec_in_memory_mgrid_test comparisons pass 2e-7 for jcuru/jcurv only, keeping
1e-7 for all profiles and geometry.

(cherry picked from commit 27d36d2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant