Skip to content

Landau Damping + FFT periodic solver crashes on Lumi #481

@s-mayani

Description

@s-mayani

The Landau Damping test case with the FFT periodic Poisson solver crashes on Lumi since the new update in January 2026 (could run without issues in December). This only happens for certain problem sizes and node counts.

Specifically, it is happening for

  • 1024^3 grid points, with 8 particles per cell, 10 timesteps
  • 8, 16, 32, 64, 128 nodes (x8 to get number of GPUs)

The error is: Memory access fault by GPU node-5 (Agent handle: 0x745fc0) on address 0x147299919000. Reason: Unknown.

The reason for this is unknown, but does not seem to be reproducible with the heffte benchmarks when using the same problem size. Furthermore, the simulation still runs for a certain amount of timesteps (e.g. 6 or 8 timesteps) before crashing with this error, so initially it seems that there is not a problem. Such an error was not seen when running the same problem sizes and node counts on Lumi before the update.

To reproduce it, here the modules I load on Lumi:

module load LUMI/25.03 partition/G cpeAMD rocm buildtools/25.03

Then, clone IPPL master and create a build directory, and run cmake with the following options:

cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_STANDARD=20 -DCMAKE_CXX_COMPILER=hipcc -DBUILD_SHARED_LIBS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a -DCMAKE_HIP_FLAGS=--offload-arch=gfx90a -DKokkos_ARCH_ZEN3=ON -DKokkos_ARCH_AMD_GFX90A=ON -DKokkos_ENABLE_HIP=ON -DIPPL_PLATFORMS="HIP;OPENMP" -DIPPL_ENABLE_TESTS=ON -DIPPL_ENABLE_UNIT_TESTS=OFF -DIPPL_ENABLE_FFT=ON -DIPPL_ENABLE_SOLVERS=ON -DIPPL_ENABLE_ALPINE=ON -DHeffte_ENABLE_ROCM=ON -DHeffte_ENABLE_GPU_AWARE_MPI=ON -DIPPL_ENABLE_ALIGNED_COMMUNICATION_BUFFERS=ON -DCMAKE_EXE_LINKER_FLAGS="-L/opt/cray/pe/mpich/8.1.32/ofi/amd/6.0/lib -L/opt/cray/pe/mpich/8.1.32/gtl/lib -L/opt/cray/pe/libsci/25.03.0/AMD/6.0/x86_64/lib -L/opt/cray/pe/dsmml/0.3.0/dsmml//lib -L/opt/cray/xpmem/2.8.2-1.0_5.1__g84a27a5.shasta/lib64 -lsci_amd_mpi -lsci_amd -ldl -lmpi_amd -lmpi_gtl_hsa -ldsmml -lxpmem -L/opt/rocm-6.3.4/lib/lib -L/opt/rocm-6.3.4/lib/lib64 -L/opt/rocm-6.3.4/lib/llvm/lib"

The jobscript I am using is:

#!/bin/bash
# ---
# sbatch settings for AMD Lumi
# ---
#SBATCH --job-name=landauFFT
#SBATCH --output=landauFFT_%j.out
#SBATCH --error=landauFFT_%j.err
#SBATCH --time=00:15:00
#SBATCH --nodes 8
#SBATCH --partition=standard-g
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --account=[account number]
#SBATCH --hint=nomultithread
#SBATCH --hint=exclusive

CPU_BIND="map_cpu:49,57,17,25,1,9,33,41"

export MPICH_GPU_SUPPORT_ENABLED=1

ulimit -s unlimited

mkdir -p data

# Threading and pinning
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=threads               # bind to full cores
export OMP_PROC_BIND=spread            # keep threads close to the rank
# (optional, redundant for OpenMP backend)
export KOKKOS_NUM_THREADS=$SLURM_CPUS_PER_TASK

cat << EOF > select_gpu
#!/bin/bash

export ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID
exec \$*
EOF
chmod +x ./select_gpu

export exedir=[your_exe_dir]

srun ./select_gpu ${exedir}/LandauDamping 1024 1024 1024 8589934592 10 FFT 1.0 LeapFrog --overallocate 2.0 --info 5

where the "exedir" should correspond to the alpine folder in the build directory of IPPL, where the LandauDamping executable can be found.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions