ROCm GPU Fix by eessmann · Pull Request #711 · QuEST-Kit/QuEST

eessmann · 2026-02-23T11:58:16Z

Inital hack to get us working on ROCm 7.

Not very happy about the casts and explicit calls to addCuQcomp and mulCuQcomp

TysonRayJones · 2026-02-26T04:29:02Z

Are you sure addCuQcomp is necessary? The ROCm 6.0 changelog mentions only * and / issues, and your diff doesn't change every + operator into a addCuQcomp call (consider this one). So I suspect

amp = addCuQcomp(amp, mulCuQcomp(coeffs[q], inAmps[q][n]));

could instead be just

amp = amp + mulCuQcomp(coeffs[q], inAmps[q][n]);

and addCuQcomp() deleted.

I'm not even sure that's better, but it'd help the investigation by shrinking the diff

eessmann · 2026-03-16T14:17:23Z

amp = amp + mulCuQcomp(coeffs[q], inAmps[q][n]);

Yup, I agree that is better way to do it. I'll update the PR

otbrown · 2026-04-09T16:22:50Z

Min example output on 8 GPUs on Frontier using ROCm 7.2.0:

QuEST execution environment:
  [precision]
    qreal.................double (8 bytes)
    qcomp.................std::complex<double> (16 bytes)
    qindex................long long int (8 bytes)
    validationEpsilon.....1e-12
  [compilation]
    isMpiCompiled...........1
    isGpuCompiled...........1
    isOmpCompiled...........1
    isCuQuantumCompiled.....0
  [deployment]
    isMpiEnabled............1
    isGpuEnabled............1
    isOmpEnabled............1
    isCuQuantumEnabled......0
    isGpuSharingEnabled.....0
  [cpu]
    numCpuCores.......128 per machine
    numOmpProcs.......128 per machine
    numOmpThrds.......7 per node
    cpuMemory.........502.6 GiB per machine
    cpuMemoryFree.....unknown
  [gpu]
    numGpus...........8
    gpuDirect.........0
    gpuMemPools.......1
    gpuMemory.........63.98 GiB per gpu
    gpuMemoryFree.....63.9 GiB per gpu
    gpuCache..........0 bytes per gpu
  [distribution]
    isMpiGpuAware.....0
    numMpiNodes.......8
  [statevector limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............34
    maxQubitsForGpu.............31
    maxQubitsForMpiCpu..........36
    maxQubitsForMpiGpu..........33
    maxQubitsForMemOverflow.....58
    maxQubitsForIndOverflow.....63
  [density matrix limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............17
    maxQubitsForGpu.............15
    maxQubitsForMpiCpu..........19
    maxQubitsForMpiGpu..........18
    maxQubitsForMemOverflow.....28
    maxQubitsForIndOverflow.....31
  [statevector autodeployment]
    8 qubits......[omp]
    12 qubits.....[omp] [gpu]
    29 qubits.....[omp] [gpu] [mpi]
  [density matrix autodeployment]
    4 qubits......[omp]
    6 qubits......[omp] [gpu]
    15 qubits.....[omp] [gpu] [mpi]

Qureg:
  [deployment]
    isMpiEnabled.....1
    isGpuEnabled.....1
    isOmpEnabled.....1
  [dimension]
    isDensMatr.....0
    numQubits......20
    numCols........N/A
    numAmps........2^20 = 1048576
  [distribution]
    numNodes.....2^3 = 8
    numCols......N/A
    numAmps......2^17 = 131072 per node
  [memory]
    cpuAmps...........2 MiB per node
    gpuAmps...........2 MiB per node
    cpuCommBuffer.....2 MiB per node
    gpuCommBuffer.....2 MiB per node
    globalTotal.......64 MiB

Qureg (20 qubit statevector, 1048576 qcomps over 8 nodes, 4 MiB per gpu):
    -0.00025368+0.00010867i   |0⟩        [node 0]
    -0.00071384+0.00059215i   |1⟩
    0.00016356+0.0003021i     |2⟩
    0.00034209+0.0011542i     |3⟩
    0.00034146-0.00023145i    |4⟩
    -0.00050332+0.00055775i   |5⟩
    0.00040364+0.00059691i    |6⟩
    0.0009239-(9.5636e-5)i    |7⟩
    0.0011029-0.00081217i     |8⟩
    -0.00042481+0.00064243i   |9⟩
    -0.00050878-0.00050273i   |10⟩
    0.00011366+0.00021342i    |11⟩
    0.00085193+0.0013929i     |12⟩
    -0.00019395-0.0006723i    |13⟩
    0.00014992+0.0010192i     |14⟩
    -0.0001247+0.0012711i     |15⟩
                ⋮
    (2.3332e-5)+(5.4964e-5)i  |1048560⟩  [node 7]
    0.00051552-0.0007637i     |1048561⟩
    -0.00094351-0.00018347i   |1048562⟩
    0.0010981+0.001146i       |1048563⟩
    -0.00047254-0.00027003i   |1048564⟩
    -0.00049763+0.0013388i    |1048565⟩
    0.00034088-0.00031098i    |1048566⟩
    (-3.3804e-5)-0.00064161i  |1048567⟩
    -0.00081207-0.00020891i   |1048568⟩
    0.00019263+0.00062731i    |1048569⟩
    -0.00035998+0.00096104i   |1048570⟩
    -0.0016944-0.00040577i    |1048571⟩
    0.00050406+(3.8155e-5)i   |1048572⟩
    (5.8668e-5)-0.00054072i   |1048573⟩
    -0.0021844-0.00086032i    |1048574⟩
    -0.00047689-0.00012361i   |1048575⟩

Total probability: 1

Test output on 1 GPU on Frontier:


QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     1
  GPU-accelerated: 1
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       1

Testing configuration:
  test all deployments:  0
  num qubits in qureg:   6
  max num qubit perms:   50
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  GPU + OMP + MPI

Randomness seeded to: 2089935186
===============================================================================
All tests passed (343063 assertions in 277 test cases)

Looks good to me!

@eessmann two things:

Could you post test results from NVIDIA to a) perform we haven't broken it, and b) so I can find them easily if needed.
Could you please merge devel into this branch, as I recently updated it. Shouldn't be any clash, but importantly it will fix the HIP compile tests.

I will walk back ROCm versions to see what our backwards compatibility looks like here.

otbrown · 2026-04-09T16:51:29Z

Okay, all tests pass on ROCm 5.7.1, so I lazily declare ROCm back-compat 'fine'.


QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     1
  GPU-accelerated: 1
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       1

Testing configuration:
  test all deployments:  0
  num qubits in qureg:   6
  max num qubit perms:   10
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  GPU + OMP + MPI

Randomness seeded to: 1219776663
===============================================================================
All tests passed (294177 assertions in 277 test cases)

otbrown · 2026-04-09T17:05:41Z

@eessmann Final thing before I forget -- could you do a performance regression check on one of the NV systems as well?

otbrown changed the title ~~ROMc GPU Fix~~ ROCm GPU Fix Feb 23, 2026

TysonRayJones mentioned this pull request Feb 26, 2026

HIP loses the imaginary part on ROCM 7 #704

Open

otbrown mentioned this pull request Apr 6, 2026

Runs with large numAmps per GPU fail #719

Open

second attempt

d558be1

eessmann force-pushed the rocm_fix branch from 6155428 to d558be1 Compare April 8, 2026 09:50

Fix HIP complex arithmetic on ROCm 6+

91152d1

Merge branch 'devel' into rocm_fix

fdfdba0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm GPU Fix#711

ROCm GPU Fix#711
eessmann wants to merge 3 commits intoQuEST-Kit:develfrom
eessmann:rocm_fix

eessmann commented Feb 23, 2026

Uh oh!

TysonRayJones commented Feb 26, 2026 •

edited

Loading

Uh oh!

eessmann commented Mar 16, 2026

Uh oh!

otbrown commented Apr 9, 2026

Uh oh!

otbrown commented Apr 9, 2026

Uh oh!

otbrown commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eessmann commented Feb 23, 2026

Uh oh!

TysonRayJones commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eessmann commented Mar 16, 2026

Uh oh!

otbrown commented Apr 9, 2026

Uh oh!

otbrown commented Apr 9, 2026

Uh oh!

otbrown commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TysonRayJones commented Feb 26, 2026 •

edited

Loading