feat(runtime_profile): :emily_fast profile applies FastKernels rewrites#2
Draft
ausimian wants to merge 1 commit into
Draft
feat(runtime_profile): :emily_fast profile applies FastKernels rewrites#2ausimian wants to merge 1 commit into
ausimian wants to merge 1 commit into
Conversation
Adds an explicit :emily_fast runtime profile that mirrors :emily's
Apple-shaped flags and Emily-empirical margin floors but additionally
runs `Emily.Bumblebee.FastKernels.apply/1` on the loaded Bumblebee
model at `Coordinator.load/1` time. The rewrite swaps RMSNorm /
LayerNorm / RoPE / SDPA Axon layers for `Emily.Fast.*` helpers that
dispatch to fused `mx::fast::*` kernels under Emily.Backend (and fall
through to composed-defn equivalents on any other backend, so the
rewritten model stays evaluable on Nx.BinaryBackend / EXLA for
conformance).
Validation pass (Apple Silicon, Emily 0.4.0, against the existing
adapted bundle, `--determinism-runs 2`):
* 37/37 decision-stable agreement with the CUDA snapshot
(agent_id / role_id / token_count / transcript_hash).
* In-process determinism stable across both runs.
* Margin floors: 37/37 pass with the seeded 0.33 / 0.82 inherited
from :emily (no overrides required). The escalate_to_human role
margin is bitwise identical between bare :emily and :emily_fast
(1.0291), so the same floor absorbs the near-miss in both lanes.
* Wall-clock ~10.8s on 37 cases × 2 determinism runs, vs ~12.4s
bare :emily — ~15% faster on this single-forward-pass workload.
Larger relative win expected on generative workloads.
What changed:
* `lib/trinity_coordinator/runtime_profile.ex`: new `resolve(:emily_fast)`
clause; `:emily_fast` added to `builtin_names/0` and the
unknown-profile error message; moduledoc updated.
* `lib/trinity_coordinator/sakana/coordinator.ex`: public-with-@doc-false
`maybe_apply_fast_kernels/2` keyed off `profile.name == :emily_fast`,
dispatching via `Module.concat` + `apply/3` so this module compiles
cleanly without the optional :emily dep.
* `guides/runtime_profiles.md`: new `### :emily_fast` section, bare-vs-fast
comparison table, choosing-a-profile entry.
Tests (mirror the existing :emily TDD pattern):
* :emily_fast resolves with the right Apple-shaped flags + Emily backend
* :emily_fast inherits :emily's empirical margin floors (0.33 / 0.82)
* :emily_fast accepts Emily.Backend labels via accepts_backend_label?/2
* :emily_fast put_default_backend!/1 raises the same Emily-naming error
as :emily when Emily.Backend is not loaded
* maybe_apply_fast_kernels/2: non-emily_fast profiles pass through;
:emily_fast returns {:error, {:emily_fast_kernels_unavailable, _}}
on hosts without the Emily dep; :emily_fast invokes the rewriter
on hosts with Emily loaded (no-op pass-through on the test's
minimal Axon graph since it contains none of the patterns the
rewriter recognises).
mix.exs is intentionally unchanged — Emily stays an optional dep that
hosts add to their parent application's mix.exs, same convention as
:emlx and :emily.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
Adds an explicit
:emily_fastruntime profile that mirrors:emily's Apple-shaped flags and Emily-empirical margin floors but additionally runsEmily.Bumblebee.FastKernels.apply/1on the loaded Bumblebee model atCoordinator.load/1time. The rewrite swaps RMSNorm / LayerNorm / RoPE / SDPA Axon layers forEmily.Fast.*helpers that dispatch to fusedmx::fast::*kernels underEmily.Backend. On any other backend the rewritten helpers fall through to composed-defn equivalents, so the rewritten model stays evaluable onNx.BinaryBackend/ EXLA for conformance.Same optional-dep convention as
:emily—trinity_coordinator's ownmix.exsis intentionally unchanged. Consumers add{:emily, "~> 0.4", only: [:dev, :test]}to their parent app.Validation result
Apple Silicon, Emily 0.4.0, against the existing adapted bundle exported on the same host,
--determinism-runs 2, default floors (no margin overrides)::emily:emily_fastagent_id/role_id/token_count/transcript_hash)route_hashdrift vs CUDAroute_hashdrift fast-vs-bare--determinism-runs 2)escalate_to_humanrole_marginThe
escalate_to_humanrole_margin being bitwise identical between the two lanes is the key signal that the same per-profile floor absorbs both — the rewrite moves logits on 25 of the other 36 cases but happens to leave the borderline case untouched. Larger relative wall-clock wins are expected on generative workloads where attention / RoPE / RMSNorm dominate per-token cost; this eval is a single forward pass and is the worst case for showcasing fusion.What's in the diff
6 files changed, 297 insertions(+), 2 deletions(-). Nomix.exschange.lib/trinity_coordinator/runtime_profile.ex— newresolve(:emily_fast)clause inheriting from:emily's shape with a distinct name and FastKernels-aware notes;:emily_fastadded tobuiltin_names/0and to the unknown-profile error message; moduledoc updated.lib/trinity_coordinator/sakana/coordinator.ex— public-with-@doc falsemaybe_apply_fast_kernels/2keyed offprofile.name == :emily_fast. Reference toEmily.Bumblebee.FastKernelsis dynamic (Module.concat+apply/3) so this module compiles cleanly on hosts without the optional:emilydep; the missing-dep branch returns{:error, {:emily_fast_kernels_unavailable, msg}}with operator-facing guidance pointing at the setup guide.guides/runtime_profiles.md— new### :emily_fastsection after the existing### :emilysection, bare-vs-fast comparison table, choosing-a-profile entry.Tests (mirror the existing
:emilyTDD pattern):test/trinity_coordinator/runtime_profile_test.exs—:emily_fastresolves with the right Apple-shaped flags andEmily.Backendtuple; inherits Emily margin floors (0.33 / 0.82); acceptsEmily.Backendlabels;put_default_backend!/1raises the same Emily-naming error as:emilywhenEmily.Backendis not loaded.test/trinity_coordinator/runtime_profile_margins_test.exs—:emily_fastinherits 0.33 / 0.82 floors.test/trinity_coordinator/sakana/coordinator_fast_kernels_test.exs(new) — all three routing branches ofmaybe_apply_fast_kernels/2: non-:emily_fastprofiles pass through unchanged;:emily_fastreturns the documented error tuple on hosts without the Emily dep;:emily_fastinvokes the rewriter on hosts with Emily loaded (gated onCode.ensure_loaded?so the happy-path test runs on Apple Silicon and is a no-op elsewhere).All 40 tests in those three files pass locally on this Apple Silicon host with Emily loaded.
Design notes
:emily? Two reasons. First, it keeps the bare:emilylane available as the no-rewrite reference baseline — useful if anyone needs to isolate fused-vs-composed numerical differences from other factors. Second, it matches the existing:emlx/:emilysplit: both are Apple-shaped, both differ at one well-defined layer of the stack (here it's the Bumblebee model rewrite layer; there it was the Nx backend module). Operators choose the lane that matches the contract they want.@doc falsepublic, notdefp? So each of the three routing branches can be unit-tested without staging a full Bumblebee model. The rewriter's contract is "takes an Axon graph, returns an Axon graph," so a sentinel map with a:modelkey is enough to assert the routing logic. Same pattern theRuntime.BackendLabelextraction in 21c3088 used for testability.mix.exschange? Same convention you established for:emilyand:emlx: the Emily dep lives in the consumer's parent application, not intrinity_coordinatoritself.Emily.Bumblebee.FastKernelsships inside the Emily package and only defines when both:axonand:bumblebeeare loaded; trinity already brings those in, so the rewriter is available as soon as the consumer adds Emily.Test plan
:emily(in which case the diff collapses to roughly thecoordinator.exhunk plus a renamed clause).mix test test/trinity_coordinator/runtime_profile_test.exs test/trinity_coordinator/runtime_profile_margins_test.exs test/trinity_coordinator/sakana/coordinator_fast_kernels_test.exson CUDA to confirm the missing-Emily branch is exercised exactly as documented.:emily_fast(or whatever you call it) — feel free to lift the bare-vs-fast numbers and the structural-not-artefact framing of theescalate_to_humannear-miss into your own commit message.🤖 Generated with Claude Code