Skip to content

feat(runtime_profile): :emily_fast profile applies FastKernels rewrites#2

Draft
ausimian wants to merge 1 commit into
nshkrdotcom:mainfrom
ausimian:emily-fast-profile
Draft

feat(runtime_profile): :emily_fast profile applies FastKernels rewrites#2
ausimian wants to merge 1 commit into
nshkrdotcom:mainfrom
ausimian:emily-fast-profile

Conversation

@ausimian
Copy link
Copy Markdown

What this is

Adds an explicit :emily_fast runtime profile that mirrors :emily's Apple-shaped flags and Emily-empirical margin floors but additionally runs Emily.Bumblebee.FastKernels.apply/1 on the loaded Bumblebee model at Coordinator.load/1 time. The rewrite swaps RMSNorm / LayerNorm / RoPE / SDPA Axon layers for Emily.Fast.* helpers that dispatch to fused mx::fast::* kernels under Emily.Backend. On any other backend the rewritten helpers fall through to composed-defn equivalents, so the rewritten model stays evaluable on Nx.BinaryBackend / EXLA for conformance.

Same optional-dep convention as :emilytrinity_coordinator's own mix.exs is intentionally unchanged. Consumers add {:emily, "~> 0.4", only: [:dev, :test]} to their parent app.

Validation result

Apple Silicon, Emily 0.4.0, against the existing adapted bundle exported on the same host, --determinism-runs 2, default floors (no margin overrides):

$ mix run examples/qwen_router_prompt_eval.exs \
    --runtime-profile emily_fast \
    --artifact-dir tmp/emily_adapted_qwen3_0_6b_layer26 \
    --snapshot examples/fixtures/qwen_router_prompt_eval_logits.json \
    --determinism-runs 2

Summary
  passed: 37
  failed: 0
  roles selected: Worker=18, Thinker=4, Verifier=15
  agent slots selected: 0=17, 1=1, 4=19

PASS qwen_router_prompt_eval
Axis bare :emily :emily_fast
Decisions vs CUDA snapshot (agent_id / role_id / token_count / transcript_hash) 37/37 match 37/37 match
route_hash drift vs CUDA 37/37 differ 37/37 differ
route_hash drift fast-vs-bare n/a 25/37 differ
In-process determinism (--determinism-runs 2) 37/37 stable 37/37 stable
escalate_to_human role_margin 1.0291 1.0291 (bitwise identical)
Margin floors (0.33 / 0.82, seeded by profile) 37/37 pass 37/37 pass
Wall-clock, 37 cases × 2 determinism runs, warm cache ~12.4s ~10.8s (~15% faster)

The escalate_to_human role_margin being bitwise identical between the two lanes is the key signal that the same per-profile floor absorbs both — the rewrite moves logits on 25 of the other 36 cases but happens to leave the borderline case untouched. Larger relative wall-clock wins are expected on generative workloads where attention / RoPE / RMSNorm dominate per-token cost; this eval is a single forward pass and is the worst case for showcasing fusion.

What's in the diff

6 files changed, 297 insertions(+), 2 deletions(-). No mix.exs change.

  • lib/trinity_coordinator/runtime_profile.ex — new resolve(:emily_fast) clause inheriting from :emily's shape with a distinct name and FastKernels-aware notes; :emily_fast added to builtin_names/0 and to the unknown-profile error message; moduledoc updated.
  • lib/trinity_coordinator/sakana/coordinator.ex — public-with-@doc false maybe_apply_fast_kernels/2 keyed off profile.name == :emily_fast. Reference to Emily.Bumblebee.FastKernels is dynamic (Module.concat + apply/3) so this module compiles cleanly on hosts without the optional :emily dep; the missing-dep branch returns {:error, {:emily_fast_kernels_unavailable, msg}} with operator-facing guidance pointing at the setup guide.
  • guides/runtime_profiles.md — new ### :emily_fast section after the existing ### :emily section, bare-vs-fast comparison table, choosing-a-profile entry.

Tests (mirror the existing :emily TDD pattern):

  • test/trinity_coordinator/runtime_profile_test.exs:emily_fast resolves with the right Apple-shaped flags and Emily.Backend tuple; inherits Emily margin floors (0.33 / 0.82); accepts Emily.Backend labels; put_default_backend!/1 raises the same Emily-naming error as :emily when Emily.Backend is not loaded.
  • test/trinity_coordinator/runtime_profile_margins_test.exs:emily_fast inherits 0.33 / 0.82 floors.
  • test/trinity_coordinator/sakana/coordinator_fast_kernels_test.exs (new) — all three routing branches of maybe_apply_fast_kernels/2: non-:emily_fast profiles pass through unchanged; :emily_fast returns the documented error tuple on hosts without the Emily dep; :emily_fast invokes the rewriter on hosts with Emily loaded (gated on Code.ensure_loaded? so the happy-path test runs on Apple Silicon and is a no-op elsewhere).

All 40 tests in those three files pass locally on this Apple Silicon host with Emily loaded.

Design notes

  • Why a separate profile and not a flag on :emily? Two reasons. First, it keeps the bare :emily lane available as the no-rewrite reference baseline — useful if anyone needs to isolate fused-vs-composed numerical differences from other factors. Second, it matches the existing :emlx / :emily split: both are Apple-shaped, both differ at one well-defined layer of the stack (here it's the Bumblebee model rewrite layer; there it was the Nx backend module). Operators choose the lane that matches the contract they want.
  • Why @doc false public, not defp? So each of the three routing branches can be unit-tested without staging a full Bumblebee model. The rewriter's contract is "takes an Axon graph, returns an Axon graph," so a sentinel map with a :model key is enough to assert the routing logic. Same pattern the Runtime.BackendLabel extraction in 21c3088 used for testability.
  • Why no mix.exs change? Same convention you established for :emily and :emlx: the Emily dep lives in the consumer's parent application, not in trinity_coordinator itself. Emily.Bumblebee.FastKernels ships inside the Emily package and only defines when both :axon and :bumblebee are loaded; trinity already brings those in, so the rewriter is available as soon as the consumer adds Emily.

Test plan

  • You read the diff and decide whether the two-profile shape is right, or whether you'd rather absorb FastKernels into bare :emily (in which case the diff collapses to roughly the coordinator.ex hunk plus a renamed clause).
  • Run the existing mix test test/trinity_coordinator/runtime_profile_test.exs test/trinity_coordinator/runtime_profile_margins_test.exs test/trinity_coordinator/sakana/coordinator_fast_kernels_test.exs on CUDA to confirm the missing-Emily branch is exercised exactly as documented.
  • If you'd rather not take this as-is, the result table above is still load-bearing for the upstream :emily_fast (or whatever you call it) — feel free to lift the bare-vs-fast numbers and the structural-not-artefact framing of the escalate_to_human near-miss into your own commit message.

🤖 Generated with Claude Code

Adds an explicit :emily_fast runtime profile that mirrors :emily's
Apple-shaped flags and Emily-empirical margin floors but additionally
runs `Emily.Bumblebee.FastKernels.apply/1` on the loaded Bumblebee
model at `Coordinator.load/1` time. The rewrite swaps RMSNorm /
LayerNorm / RoPE / SDPA Axon layers for `Emily.Fast.*` helpers that
dispatch to fused `mx::fast::*` kernels under Emily.Backend (and fall
through to composed-defn equivalents on any other backend, so the
rewritten model stays evaluable on Nx.BinaryBackend / EXLA for
conformance).

Validation pass (Apple Silicon, Emily 0.4.0, against the existing
adapted bundle, `--determinism-runs 2`):

  * 37/37 decision-stable agreement with the CUDA snapshot
    (agent_id / role_id / token_count / transcript_hash).
  * In-process determinism stable across both runs.
  * Margin floors: 37/37 pass with the seeded 0.33 / 0.82 inherited
    from :emily (no overrides required). The escalate_to_human role
    margin is bitwise identical between bare :emily and :emily_fast
    (1.0291), so the same floor absorbs the near-miss in both lanes.
  * Wall-clock ~10.8s on 37 cases × 2 determinism runs, vs ~12.4s
    bare :emily — ~15% faster on this single-forward-pass workload.
    Larger relative win expected on generative workloads.

What changed:

  * `lib/trinity_coordinator/runtime_profile.ex`: new `resolve(:emily_fast)`
    clause; `:emily_fast` added to `builtin_names/0` and the
    unknown-profile error message; moduledoc updated.
  * `lib/trinity_coordinator/sakana/coordinator.ex`: public-with-@doc-false
    `maybe_apply_fast_kernels/2` keyed off `profile.name == :emily_fast`,
    dispatching via `Module.concat` + `apply/3` so this module compiles
    cleanly without the optional :emily dep.
  * `guides/runtime_profiles.md`: new `### :emily_fast` section, bare-vs-fast
    comparison table, choosing-a-profile entry.

Tests (mirror the existing :emily TDD pattern):

  * :emily_fast resolves with the right Apple-shaped flags + Emily backend
  * :emily_fast inherits :emily's empirical margin floors (0.33 / 0.82)
  * :emily_fast accepts Emily.Backend labels via accepts_backend_label?/2
  * :emily_fast put_default_backend!/1 raises the same Emily-naming error
    as :emily when Emily.Backend is not loaded
  * maybe_apply_fast_kernels/2: non-emily_fast profiles pass through;
    :emily_fast returns {:error, {:emily_fast_kernels_unavailable, _}}
    on hosts without the Emily dep; :emily_fast invokes the rewriter
    on hosts with Emily loaded (no-op pass-through on the test's
    minimal Axon graph since it contains none of the patterns the
    rewriter recognises).

mix.exs is intentionally unchanged — Emily stays an optional dep that
hosts add to their parent application's mix.exs, same convention as
:emlx and :emily.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant