This document describes the design philosophy behind the pilot codebase.
The core principle is modularity mixed with readable explicitness. Do not hide Hugging Face generation blocks beneath 4 layers of class inheritance if they can be represented as a clean functional wrapper.
- Model Wrapper (
hf_model.py): Rather than tying the experiment to the model, we wrap theAutoModelForCausalLMsimply to cachetokenizer,device, and reference mappings (e.g. bridging.model.layers[X]logic independent of model series architecture differences). - Hooking System (
hooks.py): Built entirely functionally. We use context managerswith apply_steering()to inject PyTorch.register_forward_hook. Since the hooks drop away automatically at the end of the context, state bleeding is impossible. - Data Schemas (
schemas.py): Kept pure to avoid JSON loading bleeding into ML code.
- Last Token Strategy: By default, vector extraction captures difference only on the final returned token of the input prompts. While an
average_over_assistantis theoretically stronger, it requires masking parsing which adds unnecessary friction to this pilot. - Evaluations (
eval/*.py): To keep dependencies light and runnable immediately, evaluations are heuristics (REFUSAL_PREFIXES). Before scaling the true research effort,eval/harmful.pyshould be overridden to invoke a structured LLM-as-a-judge prompt or API. - Metrics Computation:
superadditivity.pyis independent, expecting risk probabilities. The math corresponds to linear combination(rAB - (rA + rB - r0)). It assumes the probabilities act linearly bounded.
When ready to plug HarmBench or StrongReject:
- Subclass
BaseMetricin eval. - Create a
loaders.pyadapter that conforms their complex schemas into ourEvalExampleschema and streams dynamically. Keep the downstream hooks agnostic to dataset format.