feat(runtime_profile): :emily_fast profile applies FastKernels rewrites by ausimian · Pull Request #2 · nshkrdotcom/trinity_coordinator

ausimian · 2026-05-22T04:21:20Z

What this is

Adds an explicit :emily_fast runtime profile that mirrors :emily's Apple-shaped flags and Emily-empirical margin floors but additionally runs Emily.Bumblebee.FastKernels.apply/1 on the loaded Bumblebee model at Coordinator.load/1 time. The rewrite swaps RMSNorm / LayerNorm / RoPE / SDPA Axon layers for Emily.Fast.* helpers that dispatch to fused mx::fast::* kernels under Emily.Backend. On any other backend the rewritten helpers fall through to composed-defn equivalents, so the rewritten model stays evaluable on Nx.BinaryBackend / EXLA for conformance.

Same optional-dep convention as :emily — trinity_coordinator's own mix.exs is intentionally unchanged. Consumers add {:emily, "~> 0.4", only: [:dev, :test]} to their parent app.

Validation result

Apple Silicon, Emily 0.4.0, against the existing adapted bundle exported on the same host, --determinism-runs 2, default floors (no margin overrides):

$ mix run examples/qwen_router_prompt_eval.exs \
    --runtime-profile emily_fast \
    --artifact-dir tmp/emily_adapted_qwen3_0_6b_layer26 \
    --snapshot examples/fixtures/qwen_router_prompt_eval_logits.json \
    --determinism-runs 2

Summary
  passed: 37
  failed: 0
  roles selected: Worker=18, Thinker=4, Verifier=15
  agent slots selected: 0=17, 1=1, 4=19

PASS qwen_router_prompt_eval

Axis	bare `:emily`	`:emily_fast`
Decisions vs CUDA snapshot (`agent_id` / `role_id` / `token_count` / `transcript_hash`)	37/37 match	37/37 match
`route_hash` drift vs CUDA	37/37 differ	37/37 differ
`route_hash` drift fast-vs-bare	n/a	25/37 differ
In-process determinism (`--determinism-runs 2`)	37/37 stable	37/37 stable
`escalate_to_human` role_margin	1.0291	1.0291 (bitwise identical)
Margin floors (0.33 / 0.82, seeded by profile)	37/37 pass	37/37 pass
Wall-clock, 37 cases × 2 determinism runs, warm cache	~12.4s	~10.8s (~15% faster)

The escalate_to_human role_margin being bitwise identical between the two lanes is the key signal that the same per-profile floor absorbs both — the rewrite moves logits on 25 of the other 36 cases but happens to leave the borderline case untouched. Larger relative wall-clock wins are expected on generative workloads where attention / RoPE / RMSNorm dominate per-token cost; this eval is a single forward pass and is the worst case for showcasing fusion.

What's in the diff

6 files changed, 297 insertions(+), 2 deletions(-). No mix.exs change.

lib/trinity_coordinator/runtime_profile.ex — new resolve(:emily_fast) clause inheriting from :emily's shape with a distinct name and FastKernels-aware notes; :emily_fast added to builtin_names/0 and to the unknown-profile error message; moduledoc updated.
lib/trinity_coordinator/sakana/coordinator.ex — public-with-@doc false maybe_apply_fast_kernels/2 keyed off profile.name == :emily_fast. Reference to Emily.Bumblebee.FastKernels is dynamic (Module.concat + apply/3) so this module compiles cleanly on hosts without the optional :emily dep; the missing-dep branch returns {:error, {:emily_fast_kernels_unavailable, msg}} with operator-facing guidance pointing at the setup guide.
guides/runtime_profiles.md — new ### :emily_fast section after the existing ### :emily section, bare-vs-fast comparison table, choosing-a-profile entry.

Tests (mirror the existing :emily TDD pattern):

test/trinity_coordinator/runtime_profile_test.exs — :emily_fast resolves with the right Apple-shaped flags and Emily.Backend tuple; inherits Emily margin floors (0.33 / 0.82); accepts Emily.Backend labels; put_default_backend!/1 raises the same Emily-naming error as :emily when Emily.Backend is not loaded.
test/trinity_coordinator/runtime_profile_margins_test.exs — :emily_fast inherits 0.33 / 0.82 floors.
test/trinity_coordinator/sakana/coordinator_fast_kernels_test.exs (new) — all three routing branches of maybe_apply_fast_kernels/2: non-:emily_fast profiles pass through unchanged; :emily_fast returns the documented error tuple on hosts without the Emily dep; :emily_fast invokes the rewriter on hosts with Emily loaded (gated on Code.ensure_loaded? so the happy-path test runs on Apple Silicon and is a no-op elsewhere).

All 40 tests in those three files pass locally on this Apple Silicon host with Emily loaded.

Design notes

Why a separate profile and not a flag on :emily? Two reasons. First, it keeps the bare :emily lane available as the no-rewrite reference baseline — useful if anyone needs to isolate fused-vs-composed numerical differences from other factors. Second, it matches the existing :emlx / :emily split: both are Apple-shaped, both differ at one well-defined layer of the stack (here it's the Bumblebee model rewrite layer; there it was the Nx backend module). Operators choose the lane that matches the contract they want.
Why @doc false public, not defp? So each of the three routing branches can be unit-tested without staging a full Bumblebee model. The rewriter's contract is "takes an Axon graph, returns an Axon graph," so a sentinel map with a :model key is enough to assert the routing logic. Same pattern the Runtime.BackendLabel extraction in 21c3088 used for testability.
Why no mix.exs change? Same convention you established for :emily and :emlx: the Emily dep lives in the consumer's parent application, not in trinity_coordinator itself. Emily.Bumblebee.FastKernels ships inside the Emily package and only defines when both :axon and :bumblebee are loaded; trinity already brings those in, so the rewriter is available as soon as the consumer adds Emily.

Test plan

You read the diff and decide whether the two-profile shape is right, or whether you'd rather absorb FastKernels into bare :emily (in which case the diff collapses to roughly the coordinator.ex hunk plus a renamed clause).
Run the existing mix test test/trinity_coordinator/runtime_profile_test.exs test/trinity_coordinator/runtime_profile_margins_test.exs test/trinity_coordinator/sakana/coordinator_fast_kernels_test.exs on CUDA to confirm the missing-Emily branch is exercised exactly as documented.
If you'd rather not take this as-is, the result table above is still load-bearing for the upstream :emily_fast (or whatever you call it) — feel free to lift the bare-vs-fast numbers and the structural-not-artefact framing of the escalate_to_human near-miss into your own commit message.

🤖 Generated with Claude Code

Adds an explicit :emily_fast runtime profile that mirrors :emily's Apple-shaped flags and Emily-empirical margin floors but additionally runs `Emily.Bumblebee.FastKernels.apply/1` on the loaded Bumblebee model at `Coordinator.load/1` time. The rewrite swaps RMSNorm / LayerNorm / RoPE / SDPA Axon layers for `Emily.Fast.*` helpers that dispatch to fused `mx::fast::*` kernels under Emily.Backend (and fall through to composed-defn equivalents on any other backend, so the rewritten model stays evaluable on Nx.BinaryBackend / EXLA for conformance). Validation pass (Apple Silicon, Emily 0.4.0, against the existing adapted bundle, `--determinism-runs 2`): * 37/37 decision-stable agreement with the CUDA snapshot (agent_id / role_id / token_count / transcript_hash). * In-process determinism stable across both runs. * Margin floors: 37/37 pass with the seeded 0.33 / 0.82 inherited from :emily (no overrides required). The escalate_to_human role margin is bitwise identical between bare :emily and :emily_fast (1.0291), so the same floor absorbs the near-miss in both lanes. * Wall-clock ~10.8s on 37 cases × 2 determinism runs, vs ~12.4s bare :emily — ~15% faster on this single-forward-pass workload. Larger relative win expected on generative workloads. What changed: * `lib/trinity_coordinator/runtime_profile.ex`: new `resolve(:emily_fast)` clause; `:emily_fast` added to `builtin_names/0` and the unknown-profile error message; moduledoc updated. * `lib/trinity_coordinator/sakana/coordinator.ex`: public-with-@doc-false `maybe_apply_fast_kernels/2` keyed off `profile.name == :emily_fast`, dispatching via `Module.concat` + `apply/3` so this module compiles cleanly without the optional :emily dep. * `guides/runtime_profiles.md`: new `### :emily_fast` section, bare-vs-fast comparison table, choosing-a-profile entry. Tests (mirror the existing :emily TDD pattern): * :emily_fast resolves with the right Apple-shaped flags + Emily backend * :emily_fast inherits :emily's empirical margin floors (0.33 / 0.82) * :emily_fast accepts Emily.Backend labels via accepts_backend_label?/2 * :emily_fast put_default_backend!/1 raises the same Emily-naming error as :emily when Emily.Backend is not loaded * maybe_apply_fast_kernels/2: non-emily_fast profiles pass through; :emily_fast returns {:error, {:emily_fast_kernels_unavailable, _}} on hosts without the Emily dep; :emily_fast invokes the rewriter on hosts with Emily loaded (no-op pass-through on the test's minimal Axon graph since it contains none of the patterns the rewriter recognises). mix.exs is intentionally unchanged — Emily stays an optional dep that hosts add to their parent application's mix.exs, same convention as :emlx and :emily. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(runtime_profile): :emily_fast profile applies FastKernels rewrites#2

feat(runtime_profile): :emily_fast profile applies FastKernels rewrites#2
ausimian wants to merge 1 commit into
nshkrdotcom:mainfrom
ausimian:emily-fast-profile

ausimian commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ausimian commented May 22, 2026

What this is

Validation result

What's in the diff

Design notes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant