Metal 4 backend for Apple Silicon#532
Open
valtterivalo wants to merge 2 commits intoPufferAI:4.0from
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
alright we've talked about this work a couple times on Discord and otherwise and while this PR isn't a likely candidate to merge and maintain in upstream puffer, it's still something that might interest some people and allow more to enjoy the puffer style opinionated RL on hardware that isn't great for it.
the problem here is that CUDA gives you a lot of stuff for free from the ecosystem so this backend will by default be at least like 2x more LOC (i haven't code golfed this so idk exactly) for something that you can easily get past by telling new devs to simply get better hardware or rent it. performance wise, it obviously gets blown out of the water by something like a basic 4090 machine even if you're running top end apple silicon hardware like an M3 Ultra or M5 Max. i haven't done apples to apples perf benchmarks since it's not possible or a good use of my time.
what i can say for sure though is that this is a LOT better of an alternative than MPS. for MPS you might as well be running on CPU for how slow and poorly optimized it is for this kind of workloads. the silly benchmarking that i did there was solving breakout at 3.2M SPS on this build while puffer 3.0 MPS integration learns that best at 600k ish. i call that silly because there are many paradigm shifts between 3.0 and 4.0 outside of this, it's not an ablated result and if someone wants to be rigorous about that, go ahead.
in a nutshell what this aimed to solve for me and will solve for others interested is being able to build, train, and eval the upstream 4.0 stack on a Metal 4 Mac, then move the same code and checkpoints to CUDA for actual compute needs.
with that rambling out of the way, a couple apple specific things and instructions here for people who actually have a MacBook and want to iterate faster locally with this backend:
i'll talk briefly about why those are a thing in the first place.
cpu_inference means that we do a rollout forward pass on the CPU instead of the GPU. here, the CPU is actually strong enough to make this plausible with accelerate/AMX. CPU and GPU also see the same unified memory, so the CPU can read obs/state and write actions/logprobs directly into the shared buffers. there's no usual tax from copying tensors from GPU to CPU and back over PCIe. in many cases in my testing this can help with rollout stops that would force GPU syncs, and GPU can spend more time training while CPU gets to work too. my current impl is what i think is the clenaest way to get real CPU/GPU parallelism on one chip.
however, it's not always a win because small models will just suffer from moving inference to CPU by making rollouts slower. the CPU and GPU paths are also not numerically identical (this could be worked on further) so training recomputes old logprobs at the start of training.
overlap means that we pipeline rollout and training instead of doing them strictly one after the other. Metal 4 gives you a cleaner queue/command-buffer model for this kind of pipelining. unified memory then makes shared allocator state and double-buffered weights much simpler than on a split host/device setup. in many cases env stepping is heavy on the CPU so if there's slack to hide GPU training behind it that's good. this way total wall-clock can improve even if individual rollout/train phases don't.
this also is not a free lunch since if the rollout uses the GPU, then rollout and training contend for the same GPU anyway. there's also a one iteration policy lag. the premise is perverse in that both cpu inference and overlap would benefit most from heavier workloads that you probably wouldn't want running on your laptop. but you can. and that's the point.
train_fp16 keeps a separate fp16 training copy of the weights and runs the training path with fp16. activations/grads where possible while keeping the optimizer weights in fp32. memory bandwidth on these apple machines is a big problem in my experience because of unified memory so this aims to alleviate it. this drops total memory footprint as well which is shared by GPU and CPU. this has been pretty finicky in my own work and it's easy to NaN the shit out of everything so further work would be required here to make it robust.
because of all those tradeoffs the three knobs are not on in the default config, but i'd recommend sweeping them obviously since at least for myself they've all helped in my own envs, just not breakout or g2048 which i've smoke tested.
so ultimately what this PR is:
what it (probably) isn't: