Developer Notes

This document describes the design philosophy behind the pilot codebase.

Philosophy

The core principle is modularity mixed with readable explicitness. Do not hide Hugging Face generation blocks beneath 4 layers of class inheritance if they can be represented as a clean functional wrapper.

Abstraction Choices

Model Wrapper (hf_model.py): Rather than tying the experiment to the model, we wrap the AutoModelForCausalLM simply to cache tokenizer, device, and reference mappings (e.g. bridging .model.layers[X] logic independent of model series architecture differences).
Hooking System (hooks.py): Built entirely functionally. We use context managers with apply_steering() to inject PyTorch .register_forward_hook. Since the hooks drop away automatically at the end of the context, state bleeding is impossible.
Data Schemas (schemas.py): Kept pure to avoid JSON loading bleeding into ML code.

Design Limitations / Assumptions Built In

Last Token Strategy: By default, vector extraction captures difference only on the final returned token of the input prompts. While an average_over_assistant is theoretically stronger, it requires masking parsing which adds unnecessary friction to this pilot.
Evaluations (eval/*.py): To keep dependencies light and runnable immediately, evaluations are heuristics (REFUSAL_PREFIXES). Before scaling the true research effort, eval/harmful.py should be overridden to invoke a structured LLM-as-a-judge prompt or API.
Metrics Computation: superadditivity.py is independent, expecting risk probabilities. The math corresponds to linear combination (rAB - (rA + rB - r0)). It assumes the probabilities act linearly bounded.

Extending the Codebase

When ready to plug HarmBench or StrongReject:

Subclass BaseMetric in eval.
Create a loaders.py adapter that conforms their complex schemas into our EvalExample schema and streams dynamically. Keep the downstream hooks agnostic to dataset format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Developer Notes

Philosophy

Abstraction Choices

Design Limitations / Assumptions Built In

Extending the Codebase

FilesExpand file tree

developer_notes.md

Latest commit

History

developer_notes.md

File metadata and controls

Developer Notes

Philosophy

Abstraction Choices

Design Limitations / Assumptions Built In

Extending the Codebase