-
-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Add intrinsic for launch-sized workgroup memory on GPUs #146181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
rustbot has assigned @petrochenkov. Use |
|
Some changes occurred in src/tools/compiletest cc @jieyouxu Some changes occurred in compiler/rustc_codegen_ssa Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter |
This comment has been minimized.
This comment has been minimized.
0aa0e58 to
3ebaccb
Compare
This comment has been minimized.
This comment has been minimized.
3ebaccb to
2378959
Compare
library/core/src/intrinsics/mod.rs
Outdated
| #[rustc_nounwind] | ||
| #[unstable(feature = "dynamic_shared_memory", issue = "135513")] | ||
| #[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))] | ||
| pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that outside the GPU world, "shared memory" typically refers to memory shared between processes. So I would suggest using a name that's less likely to be confused, like something that explicitly involves "GPU" or so.
This sounds like a form of "global" memory (similar to a static item), but then apparently OpenCL calls it "local" which is very confusing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add a mod gpu?
I think there are more intrinsics for gpus that make can be added (although more in the traditional intrinsic sense, relating to an instruction, edit: re-exposing intrinsics from core::arch::nvptx and the amdgpu equivalent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or should it be in core::arch::gpu?
(From #135516 (comment), cc @workingjubilee)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rust intrinsic names are not namespaced. They are exposed in a module, but inside the compiler they are identified entirely by their name. So moving them into a different module doesn't alleviate the need for a clear name that will be understandable to non-GPU people working in the compiler (which is the vast majority of compiler devs).
If there's more GPU intrinsics to come, moving them into a gpu.rs file here still might make sense.
I don't have a strong opinion on how the eventually stable public API is organized, I am commenting entirely as someone who has an interest in keeping the set of intrinsics the Rust compiler offers understandable and well-defined (the ones in this folder, not the ones in core::arch which you call "more traditional" but that's very dependent on your background ;). These intrinsics are just an implementation detail, but every intrinsic we add here is a new language primitive -- it's like adding a new keyword, just without the syntax discussions and perma-unstable. In the past we used to have intrinsics that entirely break the internal consistency of the language, and we used to have intrinsics whose safety requirements were very poorly documented.
library/core/src/intrinsics/mod.rs
Outdated
| /// All pointers returned by `dynamic_shared_memory` point to the same address, | ||
| /// so alias the same memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that all dynamic_shared_memory::<T> for the same T return the same pointer, or do even dynamic_shared_memory::<T> and dynamic_shared_memory::<U> point to the same memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of them alias, independent of the type.
It’s probably worth explaining the concept of shared memory in the comment?
Maybe this makes it clearer:
Returns a pointer to dynamic shared memory.
Shared memory is a memory region that is shared between all threads in
the same block/work-group. It is usually faster than global memory, which is
shared between all threads on a GPU.
Dynamic shared memory is in the shared memory region, though the allocated
size is specified late, when launching agpu-kernel.The pointer returned by
dynamic_shared_memory()is the start of the dynamic
shared memory region. All calls todynamic_shared_memoryin a block/work-group,
independent of the generic type, return the same address, so alias the same memory.
The returned pointer is aligned by at least the alignment ofT.Other APIs
CUDA and HIP call this shared memory, shared between threads in a block.
OpenCL and SYCL call this local memory, shared between threads in a work-group.
GLSL calls this shared memory, shared between invocations in a work group.
DirectX calls this groupshared memory, shared between threads in a thread-group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s probably worth explaining the concept of shared memory in the comment?
It's probably worth using a more specific term ("GPU shared memory" or so) since many people reading this will think "shared memory" refers to its more standard meaning of memory shared across different processes (often set up via mmap). It's unfortunate that GPU vendors chose to overload this term, but when working in a more general-purpose codebase you can't just assume everyone to know the conventions of the GPU community, and you can't give general-purpose terms a GPU-specific meaning without risking confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other APIs
CUDA and HIP call this shared memory, shared between threads in a block.
OpenCL and SYCL call this local memory, shared between threads in a work-group.
GLSL calls this shared memory, shared between invocations in a work group.
DirectX calls this groupshared memory, shared between threads in a thread-group.
This sort of "translation guide" is not actually useful if you are not familiar with any of these things, so I would just leave it out as it is a distraction from the actual description. Especially since it's very easy to go look up a definition of, say, OpenCL's local memory, see it referred to as "this is GLSL's shared memory", look up a definition of that and see it referred to as basically the same idea as groupshared memory in DirectX, then look up a definition of that and... you get the idea. Our definition should stand on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The translation guide might be useful to people that are familiar with these things and wondering why we are making up our own terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I will merely continue to insist that the description should make sense without reference to prevent infinite regress.
Exceedingly fine details, of course, can be handled elsewhere by other sources, but the concepts should be usefully clear here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, we need something reasonably concise such that a rust compiler dev with zero GPU knowledge has a rough idea of what this does after reading the description. I don't think that's that hard, GPUs aren't that special, but they use a lot of "weird" (read: grown-over-time) terminology that presents a big barrier to entry.
| /// The returned pointer is the start of the dynamic shared memory region. | ||
| /// All pointers returned by `dynamic_shared_memory` point to the same address, | ||
| /// so alias the same memory. | ||
| /// The returned pointer is aligned by at least the alignment of `T`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some prior discussion of the design decision to determine the alignment by giving a type parameter? I could also be a const generic parameter, for instance. I don't have an opinion on the matter since I am an outsider to the GPU world, but as a compiler team member it'd be good to know if this is something you thought about for 5 minutes or whether there's some sort of larger design by a team that has a vision of how all these things will fit together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some discussion in #135516. I don’t mind either way, I thought (for 5 minutes ;)) that specifying the type of the returned pointer makes sense.
I’m not much of a GPU programmer, but I think in most cases, you would store an array in dynamic shared memory, or maybe a struct followed by a dynamically sized array (or maybe two/n arrays of different types).
For just a struct, static shared memory would make more sense, though we don’t support that yet (there’s some discussion in the tracking issue, but I think that’s more complicated to design and implement).
|
Sorry for drowning you in questions here, but extending the core language with new operations (as in, adding a new intrinsic doing things that couldn't be done before) is a big deal, and we had a bad experience in the past when this was done without wider discussion in the team to ensure that the intrinsics actually make sense in the context of Rust. Not everything that exists in the hardware can be 1:1 exposed in Rust, sometimes this requires a lot of work and sometimes it's just basically impossible. It can be a lot of work to clean these things up later, and as someone who did a bunch of that work, I'd rather not have to do it again. :) |
|
I agree that it makes a lot of sense to have the discussion now. Thanks for taking a look and helping to design something useful!
Heh, yes, that’s something that should be mentioned in the doc comment as well. (Especially comments on how to safely use it.)
Depends on the size specified on the CPU side when launching the gpu-kernel. It may or it may not.
There are “higher-level APIs” like “do a fast matrix-matrix multiplication”, but not much in-between. I’d assume that people usually use this in its raw form.
Two general use cases are: 1) All threads in a group load a part from global memory (the RAM/VRAM) and store it in shared memory. Then all threads read from the collaboratively loaded data. 2) All threads in a group do some work and collaborate on shared memory (with atomics or so) to aggregate results. Then one of the threads stores the final result to global memory. So, shared memory is meant to be accessed collaboratively and the developer must ensure proper synchronization. It is hard to provide a safe abstraction for this and tbh, I don’t want to try 😅 (though I can see 3rd party crates doing this – at least to some extent). From Rust’s perspective, guarantees should be the same as with memory that’s shared between processes.
I agree, it would be nice to have good documentation for the intrinsics in Rust! |
Wait, there's a single static size set when launching the kernel? Why is it called "dynamic" memory? "dynamic" memory usually means Are you saying dynamic shared memory is neither dynamic in the normal sense nor shared in the normal sense? ;) |
|
r? @RalfJung |
|
I won't be able to do the final approval here, I can just help with ensuring that the intrinsics are documented well enough that they can be understood without GPU expertise, and that the LLVM codegen looks vaguely reasonable. I don't know if we have anyone who actually knows how the generated LLVM IR should look like and can ensure it makes sense. r? @nikic maybe? |
a9f43d5 to
3390a11
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
|
I slightly changed the implementation, to generate a new (unnamed) global per intrinsic call. I also added a summary and considered alternatives to the commit/PR description. @workingjubilee, @RalfJung, does this read good to you now? |
|
I have looked at the intrinsic docs; that is now much better, thanks! I did not look at the compiler parts -- we should get someone who knows a little more about LLVM APIs and in particular LLVM-for-GPUs to take a look at that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LLVM bits look fine in terms of API, but I have no familiarity with how the GPU lowering works.
|
Our sole amdgpu maintainer seems to be the PR author, so that's no way to get a second opinion. :) |
3390a11 to
61e5213
Compare
|
Thanks for the reviews! I incorporated all your comments. |
|
Sorry for coming late to the party. I have some time right now catching up on the rust issues and will take a look at this MR and test it out using the nvptx backend. I see there's already a bunch of comments related to the naming. This feature is highly important and wanted for the nvptx backend and I certainly don't want to rip up in these things close to the finish line. I still feel the need to quickly comment on it. To make it complete, I can start by saying that my intuition would be to align closer to the more open apis and call it "workgroup memory" instead of "groupshared memory". On the other hand I also see that "groupshared memory" is very descriptive and concise name, while being less burdened by other specs so I agree that it's arguably even better. Most importantly I can see from the previous discussion that "workgroup" was considered and "groupshared" was deliberately selected for these reasons, and the whole discussion of this seemed productive. When it comes to I see the discussion basically rounds of with a "we can probably live with dynamic" sentiment. It seems intuitive to me that anchoring the naming to the property that this memory is specific for a launch/dispatch (like @kulst basically suggests) is strictly better, and I am therefore wondering if there's any downside to calling it something in the lines of When all that is said I also think that we can live well with |
|
I was going to suggest to merge this and address any later comments afterwards, but here we go, thanks for taking a look! Getting this name right is somewhat important as it also affects intrinsic names (do we want I noticed that this PR currently doesn’t use work-group vs group consistently, I suppose we should decide between To make the name not too long and unwieldy, maybe we can drop the For reference, the GPU hierarchy from largest to tiniest unit is
|
That sounds like this function will launch a workgroup memory. Not that I know what that could mean, but "launch" very much feels like a verb here. In a full sentence (and as someone who has not used any of these GPU APIs) I would describe this as "per-launch memory" or maybe "per-kernel-instance memory". Though it's only the size that is per-launch, right? The contents are per-workgroup. "Per-launch-sized workgroup memory". What a mouthful. That doesn't really have a parallel outside the GPU world so we may just have to make up a new adjective. ;) I presume the use of some adjective here is important, i.e. there are other forms of per-workgroup memory that this needs to be distinguished from? |
|
I'm completely fine with either "launch" or "dispatch". They are both descriptive and concise. When choosing the names, it will be good to be explicit whether we're defining the concepts for kernel or shader execution or both. Using our best "guess" as a placeholder and completely deciding on a name later is probably what this will be in practice anyway. Same as the above with "groupshared_mem" vs "group_mem" vs "workgroup_mem". They're all sufficiently descriptive.
If I get a vote then +1 for workgroup. On the other hand, all of these are fine so I don't particularly mind either way.
There's the per-workgroup memory that is defined by the "gpu code" rather than the "cpu code" (really: launch site) definition. In CUDA this is called "static shared memory". I don't believe it makes sense to have intrinsics to fetch this pointer so it might not be that important to make the distinctions here.
You cannot re-use the memory content between launches. You also do not need to manually manage issues with overlapping launches. So if I interpret your question literally, then the answer is that also actual memory content is (per-workgroup-)per-launch. But I think it's important that this info is semantically contained in the
I fully agree, even with the knowledge of what this function, it sounds wrong. I tried to "sneak in" a "sized" part. I think adding something like that can make it sufficiently express that it's "defined" or "sized" at launch. I think Also if we agree that this "static" kind of memory is not relevant for an intrinsic, we can simply get away with |
|
Oh yes, I was blind to the launch as a verb 😄 |
|
Side-node for everyone participating here: As some naming discussion already happened here, I created a Discourse topic to discuss GPU naming more generally, also in light of naming intrinsics: https://internals.rust-lang.org/t/naming-gpu-things-in-the-rust-compiler-and-standard-library/23833 |
|
I think As it's more important getting rid of a misleading name than getting the absolute best between two fine names I don't have any strong opinions. I think "launch_sized" is my slight favorite since I believe the most important property of this memory is that the size can differ between launches. |
|
I ported the most basic CUDA dynamic shared memory example to Rust today and discovered an issue. When using The rust function looks like this: #![no_std]
#![feature(
abi_ptx,
core_intrinsics,
stdarch_nvptx,
intrinsics,
gpu_dynamic_groupshared_mem
)]
use core::intrinsics::gpu_dynamic_groupshared_mem;
#[unsafe(no_mangle)]
pub unsafe extern "ptx-kernel" fn dynamic_reverse(d: *mut i32, len: usize) {
let s = gpu_dynamic_groupshared_mem::<i32>();
let t = core::arch::nvptx::_thread_idx_x() as usize;
let tr = len - t - 1;
s.add(t).write(d.add(t).read());
core::arch::nvptx::_syncthreads();
d.add(t).write(s.add(tr).read());
}
#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
loop {}
}while the ptx looks like this In case you want to reproduce it I built it using the command (where To see the actual error it can be convinient to run the ptx in Which indicates it's the name of the extern binding that contains a dot ( It all reminds me a lot about the problems in #99248, but it might be easier to fix as we might have the necesarry control of the name from rust? |
|
That sounds more like an LLVM problem to me... if Rust has to name its symbols in a way that all the (more-or-less silly) restrictions of each backend are honored, we'll be restricted to a-zA-Z0-9 and make debugging and reading LLVM IR harder for everyone. |
|
Good catch @kjetilkjeka. I had the same issue testing a previous version of the intrinsic. The fix was what you also suggest: Giving the global a fixed but NVPTX conforming name. |
|
I agree that it's an LLVM problem. Similar things have been fixed before, like the debug symbols in this PR. Before doing any actual exploration into LLVM it seems like the right solution is catching that the
and based on that, run the same logic. I can look into that and either make a PR or at least report my findings. For this PR we probably don't want to leave it hanging on an LLVM upstream fix. So we need a decision on whether we want to:
|
|
I managed to get it working by extending the cases where this "valid ptx identifier" function is called to also for This is basically the new conditional |
Workgroup memory is a memory region that is shared between all threads in a workgroup on GPUs. Workgroup memory can be allocated statically or after compilation, when launching a gpu-kernel. The intrinsic added here returns the pointer to the memory that is allocated at launch-time. # Interface With this change, workgroup memory can be accessed in Rust by calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T` intrinsic. It returns the pointer to workgroup memory guaranteeing that it is aligned to at least the alignment of `T`. The pointer is dereferencable for the size specified when launching the current gpu-kernel (which may be the size of `T` but can also be larger or smaller or zero). All calls to this intrinsic return a pointer to the same address. See the intrinsic documentation for more details. ## Alternative Interfaces It was also considered to expose dynamic workgroup memory as extern static variables in Rust, like they are represented in LLVM IR. However, due to the pointer not being guaranteed to be dereferencable (that depends on the allocated size at runtime), such a global must be zero-sized, which makes global variables a bad fit. # Implementation Details Workgroup memory in amdgpu and nvptx lives in address space 3. Workgroup memory from a launch is implemented by creating an external global variable in address space 3. The global is declared with size 0, as the actual size is only known at runtime. It is defined behavior in LLVM to access an external global outside the defined size. There is no similar way to get the allocated size of launch-sized workgroup memory on amdgpu an nvptx, so users have to pass this out-of-band or rely on target specific ways for now.
61e5213 to
c9c04f0
Compare
|
I pushed the rename to Sorry for breaking nvptx names again. I tried to add a test for this, but this only fails in ptxas, so I didn’t manage to. |
|
I made the llvm PR: llvm/llvm-project#173018 I also tested your branch with the fix and that works great! The ptx is arguably nicer as the name of the I also think the name in the current revision is a nice improvement, but in case the reviews don't like it, the other alternatives are fine as well. All in all this PR looks pretty great from my point of view. I wouldn't expect @RDambrosio016 to take a look so there's at least no reason to block it for nvptx related things. |
|
Thanks for testing and for opening the LLVM PR. (You probably need a lit test for people to approve it.)
Just for context on why I removed the name and why I think it should eventually be removed for nvptx as well: With the named global, the maximum alignment is enforced for all kernels in the IR module, which is unnecessarily conservative. With the unnamed global, different kernels in the same module can end up with different minimum alignments, depending on the calls they make. IMO the later behavior is what we want. |
Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.
Interface
With this change, workgroup memory can be accessed in Rust by
calling the new
gpu_launch_sized_workgroup_mem<T>() -> *mut Tintrinsic.
It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of
T.The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of
Tbut can also be largeror smaller or zero).
All calls to this intrinsic return a pointer to the same address.
See the intrinsic documentation for more details.
Alternative Interfaces
It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.
Implementation Details
Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.
There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.
Tracking issue: #135516