GPU race condition in MPI communication by jeremyfirst22 · Pull Request #98 · OPM/LBPM

jeremyfirst22 · 2025-10-15T00:59:03Z

Bugfix

Fixes a race condition within the MPI communication of the GPU execution of the Color model.

In the ScaLBL_Communicator::BiSendD3Q7AA routine, the GPU kernel must finish packing the MPI buffer prior to sending the message. Currently, there is no guarentee that the kernel finishes processing, leading to a race condition in the MPI communication, and communication of a partially uninitialized message, leading to non-reproducible results depending on the number of subdomains:

This manifests as noise at the domain decomposition boundary, as shown in this water invasion of an oil saturated cubic sphere pack:

before_bugfix.mp4

Adding a device synchronization before the MPI_Isend calls ensures the GPU kernels have finished packing the message, leading to reproducible results independent of the number of subdomains:

and no introduction of water phase at the domain decomposition boundary:

after_bugfix.mp4

I have not extensively checked the other models to see if this fix needs to be extended elsewhere in the code. I am also not certain if some compilers may pick up on this dependency and force device synchronization before sending, so it may or may not have impacted others.

Commit 1364d10 contains some minor compilation fixes I needed to get the code to compile with nvhpc.

Resolves #94.

…g message passing

diogosiebert · 2025-10-15T22:15:05Z

The build-and-test action failed due to an issue downloading silo. However, I can confirm that the code builds successfully and resolves issue #94 using a different case from the one originally reported.

Since James has also approved the changes, and given the impact of this bug, I will proceed to merge the fix directly into the master branch.

jeremyfirst22 added 2 commits October 13, 2025 14:36

compilation fixes

1364d10

bugfix: kernel packing of send buffers must complete before initiatin…

fe6f38a

…g message passing

JamesEMcClure approved these changes Oct 15, 2025

View reviewed changes

diogosiebert merged commit b5cac49 into OPM:master Oct 15, 2025
1 check failed

jeremyfirst22 mentioned this pull request Oct 24, 2025

Body force convergence, convergence issues #103

Closed

jeremyfirst22 mentioned this pull request Nov 11, 2025

bugfix: kernel packing of send buffers must complete before initiatin… #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU race condition in MPI communication#98

GPU race condition in MPI communication#98
diogosiebert merged 2 commits intoOPM:masterfrom
jeremyfirst22:gpu-race-condition

jeremyfirst22 commented Oct 15, 2025

Uh oh!

diogosiebert commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

jeremyfirst22 commented Oct 15, 2025

Bugfix

Uh oh!

diogosiebert commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments