Skip to content

Latest commit

 

History

History
21 lines (16 loc) · 2 KB

File metadata and controls

21 lines (16 loc) · 2 KB

Developer Notes

This document describes the design philosophy behind the pilot codebase.

Philosophy

The core principle is modularity mixed with readable explicitness. Do not hide Hugging Face generation blocks beneath 4 layers of class inheritance if they can be represented as a clean functional wrapper.

Abstraction Choices

  1. Model Wrapper (hf_model.py): Rather than tying the experiment to the model, we wrap the AutoModelForCausalLM simply to cache tokenizer, device, and reference mappings (e.g. bridging .model.layers[X] logic independent of model series architecture differences).
  2. Hooking System (hooks.py): Built entirely functionally. We use context managers with apply_steering() to inject PyTorch .register_forward_hook. Since the hooks drop away automatically at the end of the context, state bleeding is impossible.
  3. Data Schemas (schemas.py): Kept pure to avoid JSON loading bleeding into ML code.

Design Limitations / Assumptions Built In

  • Last Token Strategy: By default, vector extraction captures difference only on the final returned token of the input prompts. While an average_over_assistant is theoretically stronger, it requires masking parsing which adds unnecessary friction to this pilot.
  • Evaluations (eval/*.py): To keep dependencies light and runnable immediately, evaluations are heuristics (REFUSAL_PREFIXES). Before scaling the true research effort, eval/harmful.py should be overridden to invoke a structured LLM-as-a-judge prompt or API.
  • Metrics Computation: superadditivity.py is independent, expecting risk probabilities. The math corresponds to linear combination (rAB - (rA + rB - r0)). It assumes the probabilities act linearly bounded.

Extending the Codebase

When ready to plug HarmBench or StrongReject:

  • Subclass BaseMetric in eval.
  • Create a loaders.py adapter that conforms their complex schemas into our EvalExample schema and streams dynamically. Keep the downstream hooks agnostic to dataset format.