Skip to content

fix(ruvLLM): P4.1 wire trainer backprop endpoint gradients#416

Open
Stricttype wants to merge 5 commits intoruvnet:mainfrom
Stricttype:feat/ruvllm-p4-1-backprop
Open

fix(ruvLLM): P4.1 wire trainer backprop endpoint gradients#416
Stricttype wants to merge 5 commits intoruvnet:mainfrom
Stricttype:feat/ruvllm-p4-1-backprop

Conversation

@Stricttype
Copy link
Copy Markdown

Summary

Follow-up to #414. Fixes the structural blocker on Acceptance #4 ("perplexity better than random-init baseline") by wiring actual gradient updates into Trainer::train_epoch.

Bug: train_epoch computed compute_loss but never called optimizer.step against any weight matrices — model parameters never changed regardless of training duration.

Approach: endpoint-only backprop (Option A)

Hand-rolled analytical gradients flow through cross-entropy → lm_head → final RMSNorm, with a coarse identity-bypass for embeddings. Body (attention QKV, FFN, per-layer norms) remains random-init, treated as a fixed feature extractor.

~150 LOC, no new deps. Public API unchanged: Trainer::{new, train, train_epoch, into_model, metrics_history, save_checkpoint_periodic, model} and TrainableModel::{save/load_checkpoint, from/to_checkpoint, to_q4, forward, compute_loss} keep their signatures. forward_with_cache is purely additive.

Acceptance evidence

Test Result
cargo check (default) pass
cargo check --features real-inference pass
cargo test --lib --features real-inference 105 / 105
cargo test --test wiki_pipeline_test --features real-inference 6 / 6
Unit test_train_epoch_decreases_loss (≥10 steps, synthetic) pass
Unit test_train_epoch_updates_lm_head (mutation sanity) pass
Integration test_perplexity_5pct_floor_with_backprop pass

Perplexity (fixture, 2 epochs, lr=1e-2): 51.736 → 45.504 (ratio 0.880, 12% reduction — well past 5% floor).

Caveats / future work

  • Embedding gradient uses an identity-bypass approximation (descent direction correct, biased magnitude) — adequate for small corpora. If scaling to full Wikipedia exposes the bias, follow-up could either backprop through the body or drop embedding updates.
  • Body weights (wq/wk/wv/wo/w1/w2/w3 + per-layer norms) intentionally remain at random-init. Endpoint-only by design.
  • output_norm gradient recovers hidden/rms from normed/g; defensive guard handles the rare zero case.

Test plan

  • CI green default features
  • CI green --features real-inference
  • No regression on --features persistence (orthogonal patch)

Depends on / built on top of #414.

🤖 Generated with Claude Code

Crew Worker and others added 5 commits May 2, 2026 20:41
- examples/ruvLLM/Cargo.lock
- examples/ruvLLM/Cargo.toml
- examples/ruvLLM/docs/api-reference.md
- examples/ruvLLM/docs/code-standards.md
- examples/ruvLLM/docs/codebase-summary.md
- examples/ruvLLM/docs/configuration-guide.md
- examples/ruvLLM/docs/deployment-guide.md
- examples/ruvLLM/docs/handoffs/2026-05-02-1943-auto.md
- examples/ruvLLM/docs/project-overview-pdr.md
- examples/ruvLLM/docs/system-architecture.md
- examples/ruvLLM/docs/testing-guide.md
- examples/ruvLLM/learn/260502-1900-init-ruvllm/learn-results.tsv
- examples/ruvLLM/learn/260502-1900-init-ruvllm/summary.md

Co-Authored-By: Pi Coding Agent <pi@localhost>
- examples/ruvLLM/Cargo.lock
- examples/ruvLLM/config/example.toml
- examples/ruvLLM/config/pretrain.toml
- examples/ruvLLM/scripts/fetch-simple-wiki.sh
- examples/ruvLLM/src/bin/pretrain.rs
- examples/ruvLLM/src/bin/sidecar.rs
- examples/ruvLLM/src/config.rs
- examples/ruvLLM/src/lib.rs
- examples/ruvLLM/src/sona/mod.rs
- examples/ruvLLM/src/sona/persist.rs
- examples/ruvLLM/src/training.rs
- examples/ruvLLM/tests/persist_integration.rs

Co-Authored-By: Pi Coding Agent <pi@localhost>
- rename src/data/ → src/corpus/ (gitignore conflict: data/ pattern blocks Rust source)
- add corpus module: wiki corpus iter, tokenizer wrapper, tokenized dataset
- add tests/wiki_pipeline_test.rs (5/5 PASS)
- surgical fixes for pre-existing candle 0.8 API drift in src/inference_real.rs
- add From<candle_core::Error> shim in src/error.rs (unblocks --features real-inference)
- extend src/training.rs: DatasetSource trait, ModelCheckpoint serde, save_checkpoint, measure_baseline_perplexity

P4 status: DONE_WITH_CONCERNS — pre-existing issues surfaced:
- Trainer::train computes loss but does not call optimizer.step (no backprop) → perplexity-delta is structurally 0% until follow-up patch
- SmallTransformer lacks from_checkpoint() constructor → trained checkpoints are saved but to_q4_weights() re-randomizes; follow-up needed to load saved checkpoint into inference path
- TokenizerWrapper::from_pretrained stubbed (requires tokenizers/http feature, not currently enabled); inline whitespace WordLevel fallback works for offline pilot

Smoke tests:
- cargo check: PASS
- cargo check --features persistence: PASS
- cargo check --features real-inference: PASS
- cargo check --features persistence,real-inference: PASS
- cargo test --features persistence --test persist_integration: 4/4 PASS
- cargo test --features real-inference --test wiki_pipeline_test: 5/5 PASS
Co-Authored-By: Pi Coding Agent <pi@localhost>
Trainer::train_epoch now computes analytical gradients for the output
endpoint (cross-entropy → lm_head → RMSNorm) and applies optimizer.step
to lm_head, output_norm, and embeddings. Transformer body remains a
fixed feature extractor (endpoint-only approximation, Option A).

- TrainableModel::forward_with_cache exposes the LM-head input (normed)
  required for analytical gradients without changing forward()'s signature.
- Per-batch gradient accumulation with averaging + global L2 clipping.
- TrainingMetrics.grad_norm now populated from accumulated batch norms.
- Public Trainer API (new/train/train_epoch/into_model/metrics_history/
  save_checkpoint_periodic/model) unchanged.

Tests:
- src/training.rs::tests::test_train_epoch_decreases_loss — ≥10 steps
  on synthetic data, asserts final < initial loss.
- src/training.rs::tests::test_train_epoch_updates_lm_head — verifies
  optimizer actually mutates lm_head between epochs.
- tests/wiki_pipeline_test.rs::test_perplexity_5pct_floor_with_backprop —
  fixture corpus, ≥1 epoch, asserts final_ppl < 0.95 * initial_ppl.
  Observed: 51.74 → 45.50 (ratio 0.880, 12% improvement).
- Existing test_perplexity_better_than_random tightened from non-
  regression (≤2.0) to improvement (<0.95).

cargo check (default + --features real-inference): pass
cargo test --lib --features real-inference: 105 passed
cargo test --test wiki_pipeline_test --features real-inference: 6 passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant