-
Notifications
You must be signed in to change notification settings - Fork 38
Description
The Landau Damping test case with the FFT periodic Poisson solver crashes on Lumi since the new update in January 2026 (could run without issues in December). This only happens for certain problem sizes and node counts.
Specifically, it is happening for
- 1024^3 grid points, with 8 particles per cell, 10 timesteps
- 8, 16, 32, 64, 128 nodes (x8 to get number of GPUs)
The error is: Memory access fault by GPU node-5 (Agent handle: 0x745fc0) on address 0x147299919000. Reason: Unknown.
The reason for this is unknown, but does not seem to be reproducible with the heffte benchmarks when using the same problem size. Furthermore, the simulation still runs for a certain amount of timesteps (e.g. 6 or 8 timesteps) before crashing with this error, so initially it seems that there is not a problem. Such an error was not seen when running the same problem sizes and node counts on Lumi before the update.
To reproduce it, here the modules I load on Lumi:
module load LUMI/25.03 partition/G cpeAMD rocm buildtools/25.03
Then, clone IPPL master and create a build directory, and run cmake with the following options:
cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_STANDARD=20 -DCMAKE_CXX_COMPILER=hipcc -DBUILD_SHARED_LIBS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a -DCMAKE_HIP_FLAGS=--offload-arch=gfx90a -DKokkos_ARCH_ZEN3=ON -DKokkos_ARCH_AMD_GFX90A=ON -DKokkos_ENABLE_HIP=ON -DIPPL_PLATFORMS="HIP;OPENMP" -DIPPL_ENABLE_TESTS=ON -DIPPL_ENABLE_UNIT_TESTS=OFF -DIPPL_ENABLE_FFT=ON -DIPPL_ENABLE_SOLVERS=ON -DIPPL_ENABLE_ALPINE=ON -DHeffte_ENABLE_ROCM=ON -DHeffte_ENABLE_GPU_AWARE_MPI=ON -DIPPL_ENABLE_ALIGNED_COMMUNICATION_BUFFERS=ON -DCMAKE_EXE_LINKER_FLAGS="-L/opt/cray/pe/mpich/8.1.32/ofi/amd/6.0/lib -L/opt/cray/pe/mpich/8.1.32/gtl/lib -L/opt/cray/pe/libsci/25.03.0/AMD/6.0/x86_64/lib -L/opt/cray/pe/dsmml/0.3.0/dsmml//lib -L/opt/cray/xpmem/2.8.2-1.0_5.1__g84a27a5.shasta/lib64 -lsci_amd_mpi -lsci_amd -ldl -lmpi_amd -lmpi_gtl_hsa -ldsmml -lxpmem -L/opt/rocm-6.3.4/lib/lib -L/opt/rocm-6.3.4/lib/lib64 -L/opt/rocm-6.3.4/lib/llvm/lib"
The jobscript I am using is:
#!/bin/bash
# ---
# sbatch settings for AMD Lumi
# ---
#SBATCH --job-name=landauFFT
#SBATCH --output=landauFFT_%j.out
#SBATCH --error=landauFFT_%j.err
#SBATCH --time=00:15:00
#SBATCH --nodes 8
#SBATCH --partition=standard-g
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=8
#SBATCH --account=[account number]
#SBATCH --hint=nomultithread
#SBATCH --hint=exclusive
CPU_BIND="map_cpu:49,57,17,25,1,9,33,41"
export MPICH_GPU_SUPPORT_ENABLED=1
ulimit -s unlimited
mkdir -p data
# Threading and pinning
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PLACES=threads # bind to full cores
export OMP_PROC_BIND=spread # keep threads close to the rank
# (optional, redundant for OpenMP backend)
export KOKKOS_NUM_THREADS=$SLURM_CPUS_PER_TASK
cat << EOF > select_gpu
#!/bin/bash
export ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID
exec \$*
EOF
chmod +x ./select_gpu
export exedir=[your_exe_dir]
srun ./select_gpu ${exedir}/LandauDamping 1024 1024 1024 8589934592 10 FFT 1.0 LeapFrog --overallocate 2.0 --info 5
where the "exedir" should correspond to the alpine folder in the build directory of IPPL, where the LandauDamping executable can be found.