Add intrinsic for launch-sized workgroup memory on GPUs #146181

Flakebi · 2025-09-03T22:32:23Z

Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.

Interface

With this change, workgroup memory can be accessed in Rust by
calling the new gpu_launch_sized_workgroup_mem<T>() -> *mut T
intrinsic.

It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of T.
The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of T but can also be larger
or smaller or zero).

All calls to this intrinsic return a pointer to the same address.

See the intrinsic documentation for more details.

Alternative Interfaces

It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.

Implementation Details

Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.

There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.

Tracking issue: #135516

rustbot · 2025-09-03T22:32:28Z

r? @petrochenkov

rustbot has assigned @petrochenkov.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2025-09-03T22:32:32Z

Some changes occurred in src/tools/compiletest

cc @jieyouxu

Some changes occurred in compiler/rustc_codegen_ssa

cc @WaffleLapkin

Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter
gets adapted for the changes, if necessary.

cc @rust-lang/miri, @RalfJung, @oli-obk, @lcnr

RalfJung · 2025-09-04T06:28:17Z

library/core/src/intrinsics/mod.rs

+#[rustc_nounwind]
+#[unstable(feature = "dynamic_shared_memory", issue = "135513")]
+#[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))]
+pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T;


Note that outside the GPU world, "shared memory" typically refers to memory shared between processes. So I would suggest using a name that's less likely to be confused, like something that explicitly involves "GPU" or so.

This sounds like a form of "global" memory (similar to a static item), but then apparently OpenCL calls it "local" which is very confusing...

Does it make sense to add a mod gpu?
I think there are more intrinsics for gpus that make can be added (although more in the traditional intrinsic sense, relating to an instruction, edit: re-exposing intrinsics from core::arch::nvptx and the amdgpu equivalent).

Or should it be in core::arch::gpu?
(From #135516 (comment), cc @workingjubilee)

Rust intrinsic names are not namespaced. They are exposed in a module, but inside the compiler they are identified entirely by their name. So moving them into a different module doesn't alleviate the need for a clear name that will be understandable to non-GPU people working in the compiler (which is the vast majority of compiler devs).

If there's more GPU intrinsics to come, moving them into a gpu.rs file here still might make sense.

I don't have a strong opinion on how the eventually stable public API is organized, I am commenting entirely as someone who has an interest in keeping the set of intrinsics the Rust compiler offers understandable and well-defined (the ones in this folder, not the ones in core::arch which you call "more traditional" but that's very dependent on your background ;). These intrinsics are just an implementation detail, but every intrinsic we add here is a new language primitive -- it's like adding a new keyword, just without the syntax discussions and perma-unstable. In the past we used to have intrinsics that entirely break the internal consistency of the language, and we used to have intrinsics whose safety requirements were very poorly documented.

RalfJung · 2025-09-04T06:31:24Z

library/core/src/intrinsics/mod.rs

+/// All pointers returned by `dynamic_shared_memory` point to the same address,
+/// so alias the same memory.


Does this mean that all dynamic_shared_memory::<T> for the same T return the same pointer, or do even dynamic_shared_memory::<T> and dynamic_shared_memory::<U> point to the same memory?

All of them alias, independent of the type.
It’s probably worth explaining the concept of shared memory in the comment?

Maybe this makes it clearer:

Returns a pointer to dynamic shared memory.

Shared memory is a memory region that is shared between all threads in
the same block/work-group. It is usually faster than global memory, which is
shared between all threads on a GPU.
Dynamic shared memory is in the shared memory region, though the allocated
size is specified late, when launching a gpu-kernel.

The pointer returned by dynamic_shared_memory() is the start of the dynamic
shared memory region. All calls to dynamic_shared_memory in a block/work-group,
independent of the generic type, return the same address, so alias the same memory.
The returned pointer is aligned by at least the alignment of T.

Other APIs

CUDA and HIP call this shared memory, shared between threads in a block.
OpenCL and SYCL call this local memory, shared between threads in a work-group.
GLSL calls this shared memory, shared between invocations in a work group.
DirectX calls this groupshared memory, shared between threads in a thread-group.

It’s probably worth explaining the concept of shared memory in the comment?

It's probably worth using a more specific term ("GPU shared memory" or so) since many people reading this will think "shared memory" refers to its more standard meaning of memory shared across different processes (often set up via mmap). It's unfortunate that GPU vendors chose to overload this term, but when working in a more general-purpose codebase you can't just assume everyone to know the conventions of the GPU community, and you can't give general-purpose terms a GPU-specific meaning without risking confusion.

Other APIs

CUDA and HIP call this shared memory, shared between threads in a block.
OpenCL and SYCL call this local memory, shared between threads in a work-group.
GLSL calls this shared memory, shared between invocations in a work group.
DirectX calls this groupshared memory, shared between threads in a thread-group.

This sort of "translation guide" is not actually useful if you are not familiar with any of these things, so I would just leave it out as it is a distraction from the actual description. Especially since it's very easy to go look up a definition of, say, OpenCL's local memory, see it referred to as "this is GLSL's shared memory", look up a definition of that and see it referred to as basically the same idea as groupshared memory in DirectX, then look up a definition of that and... you get the idea. Our definition should stand on its own.

The translation guide might be useful to people that are familiar with these things and wondering why we are making up our own terms.

Then I will merely continue to insist that the description should make sense without reference to prevent infinite regress.

Exceedingly fine details, of course, can be handled elsewhere by other sources, but the concepts should be usefully clear here.

For sure, we need something reasonably concise such that a rust compiler dev with zero GPU knowledge has a rough idea of what this does after reading the description. I don't think that's that hard, GPUs aren't that special, but they use a lot of "weird" (read: grown-over-time) terminology that presents a big barrier to entry.

compiler/rustc_codegen_llvm/src/intrinsic.rs

compiler/rustc_abi/src/lib.rs

library/core/src/intrinsics/mod.rs

RalfJung · 2025-09-04T10:20:06Z

library/core/src/intrinsics/mod.rs

+/// The returned pointer is the start of the dynamic shared memory region.
+/// All pointers returned by `dynamic_shared_memory` point to the same address,
+/// so alias the same memory.
+/// The returned pointer is aligned by at least the alignment of `T`.


Is there some prior discussion of the design decision to determine the alignment by giving a type parameter? I could also be a const generic parameter, for instance. I don't have an opinion on the matter since I am an outsider to the GPU world, but as a compiler team member it'd be good to know if this is something you thought about for 5 minutes or whether there's some sort of larger design by a team that has a vision of how all these things will fit together.

There is some discussion in #135516. I don’t mind either way, I thought (for 5 minutes ;)) that specifying the type of the returned pointer makes sense.
I’m not much of a GPU programmer, but I think in most cases, you would store an array in dynamic shared memory, or maybe a struct followed by a dynamically sized array (or maybe two/n arrays of different types).

For just a struct, static shared memory would make more sense, though we don’t support that yet (there’s some discussion in the tracking issue, but I think that’s more complicated to design and implement).

RalfJung · 2025-09-04T10:28:19Z

Sorry for drowning you in questions here, but extending the core language with new operations (as in, adding a new intrinsic doing things that couldn't be done before) is a big deal, and we had a bad experience in the past when this was done without wider discussion in the team to ensure that the intrinsics actually make sense in the context of Rust. Not everything that exists in the hardware can be 1:1 exposed in Rust, sometimes this requires a lot of work and sometimes it's just basically impossible. It can be a lot of work to clean these things up later, and as someone who did a bunch of that work, I'd rather not have to do it again. :)

Flakebi · 2025-09-04T11:33:41Z

I agree that it makes a lot of sense to have the discussion now. Thanks for taking a look and helping to design something useful!

Speaking of safety requirements... how does one use this pointer?

Heh, yes, that’s something that should be mentioned in the doc comment as well. (Especially comments on how to safely use it.)

I get that it is aligned, but does it point to enough memory to store a T?

Depends on the size specified on the CPU side when launching the gpu-kernel. It may or it may not.

If it's always the same address, doesn't everyone overwrite each other's data all the time? This API looks very odd for a non-GPU person, and it's not clear to me whether that is resolved by having more magic behavior (which should be documented or at least referenced here), or whether there's higher-level APIs built on top that deal with this (but this intrinsic provides so few guarantees, I can't see how that should be possible).

There are “higher-level APIs” like “do a fast matrix-matrix multiplication”, but not much in-between. I’d assume that people usually use this in its raw form.
On GPUs, accessing memory is orders of magnitude slower than it is on CPUs. But, GPUs

have a lot more registers (e.g. up to 256 32-bit registers on amdgpu)
and shared memory, which is essentially a software-defined cache.

Two general use cases are: 1) All threads in a group load a part from global memory (the RAM/VRAM) and store it in shared memory. Then all threads read from the collaboratively loaded data. 2) All threads in a group do some work and collaborate on shared memory (with atomics or so) to aggregate results. Then one of the threads stores the final result to global memory.

So, shared memory is meant to be accessed collaboratively and the developer must ensure proper synchronization. It is hard to provide a safe abstraction for this and tbh, I don’t want to try 😅 (though I can see 3rd party crates doing this – at least to some extent).

From Rust’s perspective, guarantees should be the same as with memory that’s shared between processes.

Typically, intrinsic documentations should be detailed enough that I can read and write code using the intrinsic and know exactly whether the code is correct and what it will do in all circumstances. I don't know if there's any hope of achieving that with GPU intrinsics, but if not then we need to have a bit of a wider discussion -- we have had bad experience with just importing "externally defined" semantics into Rust without considering all the interactions (in general, it is not logically coherent to have semantics externally defined).

I agree, it would be nice to have good documentation for the intrinsics in Rust!

RalfJung · 2025-09-04T13:07:49Z

Depends on the size specified on the CPU side when launching the gpu-kernel. It may or it may not.

Wait, there's a single static size set when launching the kernel? Why is it called "dynamic" memory? "dynamic" memory usually means malloc/free, i.e. you can get any amount of fresh memory during runtime (until RAM is full obviously).

Are you saying dynamic shared memory is neither dynamic in the normal sense nor shared in the normal sense? ;)

petrochenkov · 2025-09-04T13:13:20Z

r? @RalfJung

RalfJung · 2025-09-04T13:31:21Z

I won't be able to do the final approval here, I can just help with ensuring that the intrinsics are documented well enough that they can be understood without GPU expertise, and that the LLVM codegen looks vaguely reasonable.

I don't know if we have anyone who actually knows how the generated LLVM IR should look like and can ensure it makes sense. r? @nikic maybe?

rustbot · 2025-12-04T00:31:12Z

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

Flakebi · 2025-12-04T00:31:47Z

I slightly changed the implementation, to generate a new (unnamed) global per intrinsic call.
This is as correct as the previous implementation – all calls return the same pointer address – but it is less conservative if there are multiple kernels with different alignment requirements.
With the previous named global, the maximum alignment would be enforced for all kernels in the IR module, with the new unnamed global, different kernels can end up with different minimum alignments, depending on the calls they make.
The later (new) behavior is what we want.

I also added a summary and considered alternatives to the commit/PR description.

@workingjubilee, @RalfJung, does this read good to you now?

library/core/src/intrinsics/mod.rs

RalfJung · 2025-12-04T07:29:25Z

I have looked at the intrinsic docs; that is now much better, thanks!

I did not look at the compiler parts -- we should get someone who knows a little more about LLVM APIs and in particular LLVM-for-GPUs to take a look at that.

nikic

The LLVM bits look fine in terms of API, but I have no familiarity with how the GPU lowering works.

View changes since this review

library/core/src/intrinsics/mod.rs

RalfJung · 2025-12-04T08:05:45Z

Our sole amdgpu maintainer seems to be the PR author, so that's no way to get a second opinion. :)
Cc @RDambrosio016 @kjetilkjeka since the intrinsic seems to also be made available on nvptx.

Flakebi · 2025-12-04T08:28:10Z

Thanks for the reviews! I incorporated all your comments.

kjetilkjeka · 2025-12-17T14:32:32Z

Sorry for coming late to the party. I have some time right now catching up on the rust issues and will take a look at this MR and test it out using the nvptx backend.

I see there's already a bunch of comments related to the naming. This feature is highly important and wanted for the nvptx backend and I certainly don't want to rip up in these things close to the finish line. I still feel the need to quickly comment on it.

To make it complete, I can start by saying that my intuition would be to align closer to the more open apis and call it "workgroup memory" instead of "groupshared memory". On the other hand I also see that "groupshared memory" is very descriptive and concise name, while being less burdened by other specs so I agree that it's arguably even better. Most importantly I can see from the previous discussion that "workgroup" was considered and "groupshared" was deliberately selected for these reasons, and the whole discussion of this seemed productive.

When it comes to dynamic it feels overloaded and judging from the previous confused comments perhaps slightly misleading (unless you know the APIs already using this term). I'm slightly surprised that after inventing (by taking the best parts of "workgroup" and "shared") the term "groupshared" we want to stick with the arguably less precise and standardized "dynamic".

I see the discussion basically rounds of with a "we can probably live with dynamic" sentiment. It seems intuitive to me that anchoring the naming to the property that this memory is specific for a launch/dispatch (like @kulst basically suggests) is strictly better, and I am therefore wondering if there's any downside to calling it something in the lines of dispatch_sized_groupshared_memory.

When all that is said I also think that we can live well with dynamic and really only propose to "revisit" it if it's something that was "missed" and not because a suitable compromise was found.

Flakebi · 2025-12-17T15:33:32Z

I was going to suggest to merge this and address any later comments afterwards, but here we go, thanks for taking a look!
I agree that the “dynamic” can be improved.
“dispatch” is coming from the graphics world. The GPGPU APIs seem rather undecided and use “dispatch” and “launch” as synonyms. Though with some preference for “launch” (I searched cuda/hip/sycl documentation, all three use both, though all use launch more often). API function names seem to favor launch.

Getting this name right is somewhat important as it also affects intrinsic names (do we want launch_groups_x/y/z or dispatch_groups_x/y/z?) and potential CPU APIs to launch/dispatch kernels.
Any opinions?

I noticed that this PR currently doesn’t use work-group vs group consistently, I suppose we should decide between workgroup, group and work_group and reflect that in the intrinsic name. I currently favor workgroup.

To make the name not too long and unwieldy, maybe we can drop the shared and call it workgroup memory?
With all this, I’ll add gpu_launch_workgroup_mem() to the pool of names (which is almost @kulst’s suggestion, but removing the past tense).
Please share your opinions and suggestions :)

For reference, the GPU hierarchy from largest to tiniest unit is

Launch/Dispatch: The size that is started from the CPU (it is possible to start things from the GPU, though seldomly used so far)
Workgroup (also called block or thread-group): The threads that have access to the same workgroup-shared memory
Wave/Wavefront/Warp (not something we need to decide here): The threads executed in SIMD fashion (usually 32, sometimes less or 64)
Thread: The only simple/uncontentious name here :D

RalfJung · 2025-12-17T15:45:51Z

I’ll add gpu_launch_workgroup_mem() to the pool of names

That sounds like this function will launch a workgroup memory. Not that I know what that could mean, but "launch" very much feels like a verb here.

In a full sentence (and as someone who has not used any of these GPU APIs) I would describe this as "per-launch memory" or maybe "per-kernel-instance memory". Though it's only the size that is per-launch, right? The contents are per-workgroup. "Per-launch-sized workgroup memory". What a mouthful. That doesn't really have a parallel outside the GPU world so we may just have to make up a new adjective. ;)

I presume the use of some adjective here is important, i.e. there are other forms of per-workgroup memory that this needs to be distinguished from?

kjetilkjeka · 2025-12-17T16:46:06Z

I'm completely fine with either "launch" or "dispatch". They are both descriptive and concise. When choosing the names, it will be good to be explicit whether we're defining the concepts for kernel or shader execution or both. Using our best "guess" as a placeholder and completely deciding on a name later is probably what this will be in practice anyway.

Same as the above with "groupshared_mem" vs "group_mem" vs "workgroup_mem". They're all sufficiently descriptive. workgroup_mem/group_mem is more applicable to share a naming scheme with other things (like launch_group) if that is a bonus.

I noticed that this PR currently doesn’t use work-group vs group consistently, I suppose we should decide between workgroup, group and work_group and reflect that in the intrinsic name. I currently favor workgroup.

If I get a vote then +1 for workgroup. On the other hand, all of these are fine so I don't particularly mind either way.

I presume the use of some adjective here is important, i.e. there are other forms of per-workgroup memory that this needs to be distinguished from?

There's the per-workgroup memory that is defined by the "gpu code" rather than the "cpu code" (really: launch site) definition. In CUDA this is called "static shared memory". I don't believe it makes sense to have intrinsics to fetch this pointer so it might not be that important to make the distinctions here.

Though it's only the size that is per-launch, right?

You cannot re-use the memory content between launches. You also do not need to manually manage issues with overlapping launches. So if I interpret your question literally, then the answer is that also actual memory content is (per-workgroup-)per-launch.

But I think it's important that this info is semantically contained in the workgroup_mem part as there's no major difference between dynamic and static workgroup memory in this regard. The major difference between these kinds of workgroup memory is whether the size is defined in the GPU program description or at the launch-site. That's why I believe something conveying the information of "launch_sized" is a relevant addition.

That sounds like this function will launch a workgroup memory. Not that I know what that could mean, but "launch" very much feels like a verb here.

I fully agree, even with the knowledge of what this function, it sounds wrong. I tried to "sneak in" a "sized" part. I think adding something like that can make it sufficiently express that it's "defined" or "sized" at launch.

I think gpu_launch_sized_workgroup_mem is fully descriptive? Do we dislike the long and clunky name? Turning it around we get gpu_workgroup_mem_launch which cuts the "sized" and place "launch" on the end to remove it sounding like a verb.

Also if we agree that this "static" kind of memory is not relevant for an intrinsic, we can simply get away with gpu_workgroup_mem

Flakebi · 2025-12-17T21:15:50Z

Oh yes, I was blind to the launch as a verb 😄
I changed it to gpu_launch_sized_workgroup_mem locally, though before I push, one more option: gpu_workgroup_mem_from_launch
Doesn’t contain the sized but it is maybe slightly easier to read?

Flakebi · 2025-12-18T12:04:12Z

Side-node for everyone participating here: As some naming discussion already happened here, I created a Discourse topic to discuss GPU naming more generally, also in light of naming intrinsics: https://internals.rust-lang.org/t/naming-gpu-things-in-the-rust-compiler-and-standard-library/23833
(This doesn’t affect the workgroup-shared memory intrinsic here much, the names used here seem to be mostly agreed upon.)

kjetilkjeka · 2025-12-18T19:12:15Z

I think gpu_workgroup_mem_from_launch is also pretty fine. Both are way more descriptive than anything containing "dynamic".

As it's more important getting rid of a misleading name than getting the absolute best between two fine names I don't have any strong opinions. I think "launch_sized" is my slight favorite since I believe the most important property of this memory is that the size can differ between launches.

kjetilkjeka · 2025-12-18T19:20:58Z

I ported the most basic CUDA dynamic shared memory example to Rust today and discovered an issue. When using gpu_workgroup_mem_from_launch the compiler creates a name that is not valid ptx due to containing a dot (.).

The rust function looks like this:

#![no_std]
#![feature(
    abi_ptx,
    core_intrinsics,
    stdarch_nvptx,
    intrinsics,
    gpu_dynamic_groupshared_mem
)]

use core::intrinsics::gpu_dynamic_groupshared_mem;

#[unsafe(no_mangle)]
pub unsafe extern "ptx-kernel" fn dynamic_reverse(d: *mut i32, len: usize) {
    let s = gpu_dynamic_groupshared_mem::<i32>();
    let t = core::arch::nvptx::_thread_idx_x() as usize;
    let tr = len - t - 1;
    s.add(t).write(d.add(t).read());
    core::arch::nvptx::_syncthreads();
    d.add(t).write(s.add(tr).read());
}

#[panic_handler]
fn panic(_info: &core::panic::PanicInfo) -> ! {
    loop {}
}

while the ptx looks like this

//
// Generated by LLVM NVPTX Back-End
//

.version 7.8
.target sm_89
.address_size 64

	// .globl	dynamic_reverse         // -- Begin function dynamic_reverse
.extern .shared .align 4 .b8 anon.053e97c9968a03105e3a057a5966d52f.0[];
                                        // @dynamic_reverse
.visible .entry dynamic_reverse(
	.param .u64 .ptr .align 1 dynamic_reverse_param_0,
	.param .u64 dynamic_reverse_param_1
)
{
	.reg .b32 	%r<4>;
	.reg .b64 	%rd<14>;

// %bb.0:
	ld.param.b64 	%rd1, [dynamic_reverse_param_0];
	cvta.to.global.u64 	%rd2, %rd1;
	ld.param.b64 	%rd3, [dynamic_reverse_param_1];
	mov.u32 	%r1, %tid.x;
	cvt.u64.u32 	%rd4, %r1;
	not.b64 	%rd5, %rd4;
	mul.wide.u32 	%rd6, %r1, 4;
	mov.b64 	%rd7, anon.053e97c9968a03105e3a057a5966d52f.0;
	add.s64 	%rd8, %rd7, %rd6;
	add.s64 	%rd9, %rd2, %rd6;
	ld.global.b32 	%r2, [%rd9];
	st.shared.b32 	[%rd8], %r2;
	bar.sync 	0;
	shl.b64 	%rd10, %rd3, 2;
	add.s64 	%rd11, %rd7, %rd10;
	shl.b64 	%rd12, %rd5, 2;
	add.s64 	%rd13, %rd11, %rd12;
	ld.shared.b32 	%r3, [%rd13];
	st.global.b32 	[%rd9], %r3;
	ret;
                                        // -- End function
}

In case you want to reproduce it I built it using the command (where stage1 is the stage1 compiler built from this PR branch)

$ RUSTFLAGS='-Ctarget-cpu=sm_89' cargo +stage1 rustc --release --target=nvptx64-nvidia-cuda -Zbuild-std=core --crate-type=cdylib -- -Clinker-flavor=llbc -Zunstable-options

To see the actual error it can be convinient to run the ptx in ptxas giving

$ ptxas ptx.ptx
ptxas ptx.ptx, line 10; fatal   : Parsing error near '.2': syntax error
ptxas fatal   : Ptx assembly aborted due to errors

Which indicates it's the name of the extern binding that contains a dot (.).

It all reminds me a lot about the problems in #99248, but it might be easier to fix as we might have the necesarry control of the name from rust?

RalfJung · 2025-12-18T19:28:19Z

That sounds more like an LLVM problem to me... if Rust has to name its symbols in a way that all the (more-or-less silly) restrictions of each backend are honored, we'll be restricted to a-zA-Z0-9 and make debugging and reading LLVM IR harder for everyone.

kulst · 2025-12-18T19:44:20Z

Good catch @kjetilkjeka. I had the same issue testing a previous version of the intrinsic.
For NVPTX LLVM is renaming global symbols but only if they have local linkage.

The fix was what you also suggest: Giving the global a fixed but NVPTX conforming name.

kjetilkjeka · 2025-12-18T19:55:43Z

I agree that it's an LLVM problem. Similar things have been fixed before, like the debug symbols in this PR.

Before doing any actual exploration into LLVM it seems like the right solution is catching that the .extern is either

.shared and the name doesn't matter
or that the name given is "" and the name cannot possibly matter

and based on that, run the same logic.

I can look into that and either make a PR or at least report my findings.

For this PR we probably don't want to leave it hanging on an LLVM upstream fix. So we need a decision on whether we want to:

Provide a "reserved" kind of name either on nvptx or one that is valid for both targets (as @kulst reports this to be working)
Remove the nvptx cfg until it's fixed in LLVM

kjetilkjeka · 2025-12-19T01:40:21Z

I managed to get it working by extending the cases where this "valid ptx identifier" function is called to also for .extern .shared. I believe that is sane and preferred and will create an LLVM PR tomorrow.

This is basically the new conditional

for (GlobalVariable &GV : M.globals()) {
    // We rename:
    // - local symbols, since all references will be renamed
    // - .extern .shared symbols, since they're the same address regardless of
    // name
    if (GV.hasLocalLinkage() ||
        (GV.hasExternalLinkage() &&
         GV.getAddressSpace() == NVPTX::AddressSpace::Shared)) {
      // setName doesn't do extra work if the name does not change.
      // Note: this does not create collisions - if setName is asked to set the
      // name to something that already exists, it adds a proper postfix to
      // avoid collisions.
      GV.setName(NVPTX::getValidPTXIdentifier(GV.getName()));
    }
  }

Workgroup memory is a memory region that is shared between all threads in a workgroup on GPUs. Workgroup memory can be allocated statically or after compilation, when launching a gpu-kernel. The intrinsic added here returns the pointer to the memory that is allocated at launch-time. # Interface With this change, workgroup memory can be accessed in Rust by calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T` intrinsic. It returns the pointer to workgroup memory guaranteeing that it is aligned to at least the alignment of `T`. The pointer is dereferencable for the size specified when launching the current gpu-kernel (which may be the size of `T` but can also be larger or smaller or zero). All calls to this intrinsic return a pointer to the same address. See the intrinsic documentation for more details. ## Alternative Interfaces It was also considered to expose dynamic workgroup memory as extern static variables in Rust, like they are represented in LLVM IR. However, due to the pointer not being guaranteed to be dereferencable (that depends on the allocated size at runtime), such a global must be zero-sized, which makes global variables a bad fit. # Implementation Details Workgroup memory in amdgpu and nvptx lives in address space 3. Workgroup memory from a launch is implemented by creating an external global variable in address space 3. The global is declared with size 0, as the actual size is only known at runtime. It is defined behavior in LLVM to access an external global outside the defined size. There is no similar way to get the allocated size of launch-sized workgroup memory on amdgpu an nvptx, so users have to pass this out-of-band or rely on target specific ways for now.

Flakebi · 2025-12-19T10:21:02Z

I pushed the rename to gpu_launch_sized_workgroup_mem and removed all “shared memory” wording, it’s just “workgroup memory” now.

Sorry for breaking nvptx names again. I tried to add a test for this, but this only fails in ptxas, so I didn’t manage to.
@kjetilkjeka, your LLVM change looks good to me (feel free to ping me if you don’t get a review).
This PR now gives a name on nvptx again, explicitly as a workaround, so it can be easily removed once the LLVM fix is propagated to Rust.

kjetilkjeka · 2025-12-19T15:49:33Z

I made the llvm PR: llvm/llvm-project#173018

I also tested your branch with the fix and that works great! The ptx is arguably nicer as the name of the .extern .shared is readable. I even attempted to make the name collide with a function with the same name and it was automatically changed to gpu_launch_sized_workgroup_mem1. I'm very happy with this workaround and hope that @RalfJung concludes that we can live with it.

I also think the name in the current revision is a nice improvement, but in case the reviews don't like it, the other alternatives are fine as well.

All in all this PR looks pretty great from my point of view. I wouldn't expect @RDambrosio016 to take a look so there's at least no reason to block it for nvptx related things.

Flakebi · 2025-12-19T16:06:21Z

Thanks for testing and for opening the LLVM PR. (You probably need a lit test for people to approve it.)

I also think the name in the current revision is a nice improvement

Just for context on why I removed the name and why I think it should eventually be removed for nvptx as well: With the named global, the maximum alignment is enforced for all kernels in the IR module, which is unnecessarily conservative. With the unnamed global, different kernels in the same module can end up with different minimum alignments, depending on the calls they make. IMO the later behavior is what we want.

rustbot assigned petrochenkov Sep 3, 2025

Flakebi mentioned this pull request Sep 3, 2025

Tracking Issue for NVPTX shared memory #135516

Open

3 tasks

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from 0aa0e58 to 3ebaccb Compare September 3, 2025 22:43

This comment has been minimized.

Sign in to view

Flakebi force-pushed the dynamic-shared-memory branch from 3ebaccb to 2378959 Compare September 3, 2025 22:50