Metal 4 backend for Apple Silicon by valtterivalo · Pull Request #532 · PufferAI/PufferLib

valtterivalo · 2026-04-19T11:46:21Z

alright we've talked about this work a couple times on Discord and otherwise and while this PR isn't a likely candidate to merge and maintain in upstream puffer, it's still something that might interest some people and allow more to enjoy the puffer style opinionated RL on hardware that isn't great for it.

the problem here is that CUDA gives you a lot of stuff for free from the ecosystem so this backend will by default be at least like 2x more LOC (i haven't code golfed this so idk exactly) for something that you can easily get past by telling new devs to simply get better hardware or rent it. performance wise, it obviously gets blown out of the water by something like a basic 4090 machine even if you're running top end apple silicon hardware like an M3 Ultra or M5 Max. i haven't done apples to apples perf benchmarks since it's not possible or a good use of my time.

what i can say for sure though is that this is a LOT better of an alternative than MPS. for MPS you might as well be running on CPU for how slow and poorly optimized it is for this kind of workloads. the silly benchmarking that i did there was solving breakout at 3.2M SPS on this build while puffer 3.0 MPS integration learns that best at 600k ish. i call that silly because there are many paradigm shifts between 3.0 and 4.0 outside of this, it's not an ablated result and if someone wants to be rigorous about that, go ahead.

in a nutshell what this aimed to solve for me and will solve for others interested is being able to build, train, and eval the upstream 4.0 stack on a Metal 4 Mac, then move the same code and checkpoints to CUDA for actual compute needs.

with that rambling out of the way, a couple apple specific things and instructions here for people who actually have a MacBook and want to iterate faster locally with this backend:

Metal 4 required, which means Tahoe 26 or later
Intel chips not supported (duh)
the extra config choices that are exposed on this arch are:
- cpu_inference
- overlap
- train_fp16

i'll talk briefly about why those are a thing in the first place.

cpu_inference means that we do a rollout forward pass on the CPU instead of the GPU. here, the CPU is actually strong enough to make this plausible with accelerate/AMX. CPU and GPU also see the same unified memory, so the CPU can read obs/state and write actions/logprobs directly into the shared buffers. there's no usual tax from copying tensors from GPU to CPU and back over PCIe. in many cases in my testing this can help with rollout stops that would force GPU syncs, and GPU can spend more time training while CPU gets to work too. my current impl is what i think is the clenaest way to get real CPU/GPU parallelism on one chip.

however, it's not always a win because small models will just suffer from moving inference to CPU by making rollouts slower. the CPU and GPU paths are also not numerically identical (this could be worked on further) so training recomputes old logprobs at the start of training.

overlap means that we pipeline rollout and training instead of doing them strictly one after the other. Metal 4 gives you a cleaner queue/command-buffer model for this kind of pipelining. unified memory then makes shared allocator state and double-buffered weights much simpler than on a split host/device setup. in many cases env stepping is heavy on the CPU so if there's slack to hide GPU training behind it that's good. this way total wall-clock can improve even if individual rollout/train phases don't.

this also is not a free lunch since if the rollout uses the GPU, then rollout and training contend for the same GPU anyway. there's also a one iteration policy lag. the premise is perverse in that both cpu inference and overlap would benefit most from heavier workloads that you probably wouldn't want running on your laptop. but you can. and that's the point.

train_fp16 keeps a separate fp16 training copy of the weights and runs the training path with fp16. activations/grads where possible while keeping the optimizer weights in fp32. memory bandwidth on these apple machines is a big problem in my experience because of unified memory so this aims to alleviate it. this drops total memory footprint as well which is shared by GPU and CPU. this has been pretty finicky in my own work and it's easy to NaN the shit out of everything so further work would be required here to make it robust.

because of all those tradeoffs the three knobs are not on in the default config, but i'd recommend sweeping them obviously since at least for myself they've all helped in my own envs, just not breakout or g2048 which i've smoke tested.

so ultimately what this PR is:

one man's personal solution for annoying bottlenecks in local iteration speed when implementing envs on "unsuitable" hardware
that same man's learning journey in RL and kernel development

what it (probably) isn't:

a battle tested backend that has you immediately sweeping on your M3 Ultra Mac Studio cluster (lol)
for people who don't like a bit of tinkering themselves

valtterivalo added 2 commits April 19, 2026 13:56

add metal 4 backend

e175e63

drop dead metal muon knobs

8949be7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal 4 backend for Apple Silicon#532

Metal 4 backend for Apple Silicon#532
valtterivalo wants to merge 2 commits intoPufferAI:4.0from
valtterivalo:metal-pr-ready

valtterivalo commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

valtterivalo commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant