Skip to content

ROCm GPU Fix#711

Open
eessmann wants to merge 3 commits intoQuEST-Kit:develfrom
eessmann:rocm_fix
Open

ROCm GPU Fix#711
eessmann wants to merge 3 commits intoQuEST-Kit:develfrom
eessmann:rocm_fix

Conversation

@eessmann
Copy link
Copy Markdown
Contributor

Inital hack to get us working on ROCm 7.

Not very happy about the casts and explicit calls to addCuQcomp and mulCuQcomp

@otbrown otbrown changed the title ROMc GPU Fix ROCm GPU Fix Feb 23, 2026
@TysonRayJones
Copy link
Copy Markdown
Member

TysonRayJones commented Feb 26, 2026

Are you sure addCuQcomp is necessary? The ROCm 6.0 changelog mentions only * and / issues, and your diff doesn't change every + operator into a addCuQcomp call (consider this one). So I suspect

amp = addCuQcomp(amp, mulCuQcomp(coeffs[q], inAmps[q][n]));

could instead be just

amp = amp + mulCuQcomp(coeffs[q], inAmps[q][n]);

and addCuQcomp() deleted.

I'm not even sure that's better, but it'd help the investigation by shrinking the diff

@eessmann
Copy link
Copy Markdown
Contributor Author

amp = amp + mulCuQcomp(coeffs[q], inAmps[q][n]);

Yup, I agree that is better way to do it. I'll update the PR

@otbrown
Copy link
Copy Markdown
Collaborator

otbrown commented Apr 9, 2026

Min example output on 8 GPUs on Frontier using ROCm 7.2.0:

QuEST execution environment:
  [precision]
    qreal.................double (8 bytes)
    qcomp.................std::complex<double> (16 bytes)
    qindex................long long int (8 bytes)
    validationEpsilon.....1e-12
  [compilation]
    isMpiCompiled...........1
    isGpuCompiled...........1
    isOmpCompiled...........1
    isCuQuantumCompiled.....0
  [deployment]
    isMpiEnabled............1
    isGpuEnabled............1
    isOmpEnabled............1
    isCuQuantumEnabled......0
    isGpuSharingEnabled.....0
  [cpu]
    numCpuCores.......128 per machine
    numOmpProcs.......128 per machine
    numOmpThrds.......7 per node
    cpuMemory.........502.6 GiB per machine
    cpuMemoryFree.....unknown
  [gpu]
    numGpus...........8
    gpuDirect.........0
    gpuMemPools.......1
    gpuMemory.........63.98 GiB per gpu
    gpuMemoryFree.....63.9 GiB per gpu
    gpuCache..........0 bytes per gpu
  [distribution]
    isMpiGpuAware.....0
    numMpiNodes.......8
  [statevector limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............34
    maxQubitsForGpu.............31
    maxQubitsForMpiCpu..........36
    maxQubitsForMpiGpu..........33
    maxQubitsForMemOverflow.....58
    maxQubitsForIndOverflow.....63
  [density matrix limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............17
    maxQubitsForGpu.............15
    maxQubitsForMpiCpu..........19
    maxQubitsForMpiGpu..........18
    maxQubitsForMemOverflow.....28
    maxQubitsForIndOverflow.....31
  [statevector autodeployment]
    8 qubits......[omp]
    12 qubits.....[omp] [gpu]
    29 qubits.....[omp] [gpu] [mpi]
  [density matrix autodeployment]
    4 qubits......[omp]
    6 qubits......[omp] [gpu]
    15 qubits.....[omp] [gpu] [mpi]

Qureg:
  [deployment]
    isMpiEnabled.....1
    isGpuEnabled.....1
    isOmpEnabled.....1
  [dimension]
    isDensMatr.....0
    numQubits......20
    numCols........N/A
    numAmps........2^20 = 1048576
  [distribution]
    numNodes.....2^3 = 8
    numCols......N/A
    numAmps......2^17 = 131072 per node
  [memory]
    cpuAmps...........2 MiB per node
    gpuAmps...........2 MiB per node
    cpuCommBuffer.....2 MiB per node
    gpuCommBuffer.....2 MiB per node
    globalTotal.......64 MiB

Qureg (20 qubit statevector, 1048576 qcomps over 8 nodes, 4 MiB per gpu):
    -0.00025368+0.00010867i   |0⟩        [node 0]
    -0.00071384+0.00059215i   |1⟩
    0.00016356+0.0003021i     |2⟩
    0.00034209+0.0011542i     |3⟩
    0.00034146-0.00023145i    |4⟩
    -0.00050332+0.00055775i   |5⟩
    0.00040364+0.00059691i    |6⟩
    0.0009239-(9.5636e-5)i    |7⟩
    0.0011029-0.00081217i     |8⟩
    -0.00042481+0.00064243i   |9⟩
    -0.00050878-0.00050273i   |10⟩
    0.00011366+0.00021342i    |11⟩
    0.00085193+0.0013929i     |12⟩
    -0.00019395-0.0006723i    |13⟩
    0.00014992+0.0010192i     |14⟩
    -0.0001247+0.0012711i     |15⟩
                ⋮
    (2.3332e-5)+(5.4964e-5)i  |1048560⟩  [node 7]
    0.00051552-0.0007637i     |1048561⟩
    -0.00094351-0.00018347i   |1048562⟩
    0.0010981+0.001146i       |1048563⟩
    -0.00047254-0.00027003i   |1048564⟩
    -0.00049763+0.0013388i    |1048565⟩
    0.00034088-0.00031098i    |1048566⟩
    (-3.3804e-5)-0.00064161i  |1048567⟩
    -0.00081207-0.00020891i   |1048568⟩
    0.00019263+0.00062731i    |1048569⟩
    -0.00035998+0.00096104i   |1048570⟩
    -0.0016944-0.00040577i    |1048571⟩
    0.00050406+(3.8155e-5)i   |1048572⟩
    (5.8668e-5)-0.00054072i   |1048573⟩
    -0.0021844-0.00086032i    |1048574⟩
    -0.00047689-0.00012361i   |1048575⟩

Total probability: 1

Test output on 1 GPU on Frontier:


QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     1
  GPU-accelerated: 1
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       1

Testing configuration:
  test all deployments:  0
  num qubits in qureg:   6
  max num qubit perms:   50
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  GPU + OMP + MPI

Randomness seeded to: 2089935186
===============================================================================
All tests passed (343063 assertions in 277 test cases)

Looks good to me!

@eessmann two things:

  1. Could you post test results from NVIDIA to a) perform we haven't broken it, and b) so I can find them easily if needed.
  2. Could you please merge devel into this branch, as I recently updated it. Shouldn't be any clash, but importantly it will fix the HIP compile tests.

I will walk back ROCm versions to see what our backwards compatibility looks like here.

@otbrown
Copy link
Copy Markdown
Collaborator

otbrown commented Apr 9, 2026

Okay, all tests pass on ROCm 5.7.1, so I lazily declare ROCm back-compat 'fine'.


QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     1
  GPU-accelerated: 1
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       1

Testing configuration:
  test all deployments:  0
  num qubits in qureg:   6
  max num qubit perms:   10
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  GPU + OMP + MPI

Randomness seeded to: 1219776663
===============================================================================
All tests passed (294177 assertions in 277 test cases)

@otbrown
Copy link
Copy Markdown
Collaborator

otbrown commented Apr 9, 2026

@eessmann Final thing before I forget -- could you do a performance regression check on one of the NV systems as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants