A longitudinal reliability benchmark foundation for agent lifespan engineering.
7+ scenarios | 4 aging mechanisms | AgingCard v1.0.0 | Compression · Interference · Revision · Maintenance.
Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability is therefore a lifespan property of the full agent harness, not a snapshot property of the base model.
AgingBench is a longitudinal reliability benchmark for agent lifespan engineering (ALE) — measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. It organizes agent aging into four mechanisms (compression, interference, revision, maintenance) and uses temporal dependency graphs + paired counterfactual probes to produce stage-level diagnostic profiles (write, retrieval, utilization). Both scenario mode (controlled scenarios against your model) and telemetry mode (production-trace analysis) emit the same AgingCard schema, so the same vocabulary covers both pre-deployment evaluation and post-deployment observability.
We are committed to actively maintaining this repository as a foundation for ALE research and engineering, and welcome contributions from the community.
- v0.3.1 — We are currently refactoring our codebase for cleaning and easy-to-use.
- v0.3.0 — Initial public release (2026-05-25). Seven core scenarios (S1–S7) across the four aging mechanisms, plus the S8 SWE-bench-Aging community extension. AgingCard schema v1.0.0,
agingbench+agingbench-liteCLIs in a singlepip install, behavioral-DAG telemetry-mode post-hoc trace analysis. Full log will be updated in docs/CHANGELOG.md.
A single AgingBench run simulates an agent's operational lifetime in a controlled loop:
For each session t = 0, 1, 2, …, N:
1. Agent reads its compressed memory M_t
2. Agent receives tasks and answers them
3. Benchmark scores the answers (keyword match / probe match)
4. Session history H_t is folded into memory:
M_{t+1} = compress(M_t + H_t) ← information leaks here
Over time M_t loses facts → scores drop → AgingBench fits the aging curve.
Key concepts
- Session — one round of tasks + memory update. ≈ "one day" in the agent's life.
- Memory policy — how the agent stores / compresses its history. The headline policy is
summarize_store, paired with one of the compaction prompts underprototype/experiments/prompts/—compact_lossy.txtfor the aggressive "lossy" style,compact_medium.txtfor the higher-fidelity "careful" style. Baselines likegrowing_history(no compression) andno_memory(frozen) live alongside several other policies (episodic, chain-compress, typed-state, workspace, observer, …) inprototype/agingbench/core/memory/— browse the folder for the full set. - Aging curve
m(t)— score vs. session. Half-life = sessions until 50% of capability is lost. - Memory is the independent variable — same model with different policies produces different aging curves.
| ID | Name | Tier | Sessions | What it tests |
|---|---|---|---|---|
| S1 | Research Literature | T1 | 8–20 | Fact survival under compression |
| S2 | Lifestyle Assistant | T1 | 8–10 | Constraint adherence + revision (forget, accumulator) |
| S3 | Knowledge Base | T1 | 8–12 | Decision fidelity under accumulation |
| S4 | Software Engineering | T1 | 8–12 | Code planning with retractions |
| S5 | Self-Planning Notebook | T1 | 8–20 | Agent manages its own workspace files |
| S6 | Naturalistic | T1 | 10–15 | Multi-domain recall with corrections |
| S7 | Research-Notes Coding Task | T2 | 10 | Production CLI (OpenHands / Claude Code) building a notes-app codebase |
| S8 | SWE-bench-Aging (newly added) | T2 | 8 | Production CLI on a curated chain of real Django GitHub issues |
Tier 1 = benchmark-driven loop; Tier 2 = external agent driving its own loop, wrapped via an adapter. Per-scenario READMEs at prototype/agingbench/scenarios/sN_*/README.md cover data design, scoring pipeline, and example invocations.
Want a new scenario? S8 was added in v0.3.0 and we're actively welcoming further scenario contributions — production agent deployments, domain-specific failure modes, anything that exercises a memory-aging axis we haven't covered yet. See docs/CONTRIBUTING.md#adding-a-scenario for the protocol (scenario manifest → generator → runner → tests).
Out-of-the-box AgingBench ships four Tier-2 adapters (claude_code, openhands, codex, cursor) and eleven Tier-1 memory policies. Custom agents and memory backbones plug in by subclassing one ABC each — no edits to AgingBench internals:
| Track | Subclass | SUT YAML hook | Runnable template |
|---|---|---|---|
| Tier-2 (S5, S7, S8) — agent framework / harness | AgentAdapter |
adapter: { type: custom, class: my_pkg.mod:MyAgent } |
prototype/examples/byo_agent_minimal.py |
| Tier-1 (S1–S4, S6) — memory backbone / RAG / KV store | MemoryPolicy |
memory_policy: { type: custom, class: my_pkg.mod:MyMemory } |
prototype/examples/byo_memory_minimal.py |
Both templates self-test under python examples/byo_*_minimal.py and are short enough (≈150 lines each, mostly comments) to read in one sitting. The type: custom dispatch loads any importable module:ClassName; extra YAML keys are forwarded as kwargs. If you don't want the SUT-YAML round-trip, both flags also work directly:
# No --generated flag → these run on the scenario's curated dataset (fixed session count).
# Add --generated to switch to the programmatic generator and set the count with --sessions.
agingbench run --scenario s7_research_notes \
--sut examples/sut_byo_agent.yaml \
--adapter my_pkg.my_agent:MyAgent --seeds 3 --card
agingbench run --scenario s1_research_literature \
--sut examples/sut_byo_memory.yaml \
--memory-policy my_pkg.my_memory:MyMemory --seeds 3 --cardAgentAdapter / MemoryPolicy are also usable standalone in your own driver code — they aren't coupled to the CLI.
Opaque-agent caveat. For Tier-2 BYO agents, the optional
get_workspace_stateandget_memory_texthooks are how AgingBench peeks at what your agent persists to disk. Leaving them at the default{}/""is fine — the run still produces a valid AgingCard — but file-survival and entity-recall probes will only credit information your agent recites back in its reply. If your agent writes notes / scratchpads / memos to a known directory, returning them from these hooks meaningfully sharpens the diagnosis.
pip install "git+https://github.com/VITA-Group/AgingBench.git@v0.3.0#subdirectory=prototype"Registers the agingbench and agingbench-lite CLIs and bundles the prompt templates, profile YAMLs, and JSON schemas the runner needs at runtime.
Local clone alternative (for development):
git clone https://github.com/VITA-Group/AgingBench.git && cd AgingBench && uv sync --extra api --project prototype.
API keys are required for any run that calls an API model (Anthropic / OpenAI / Gemini). Set ANTHROPIC_API_KEY / OPENAI_API_KEY / GEMINI_API_KEY / HF_TOKEN in your shell or in a .env at the repo root (auto-loaded by the CLI). Local-GPU SUTs (e.g. qwen3_8b_lossy_compress.yaml) need no API key.
Docker is required only for S8 (SWE-bench-Aging). Pre-pull the 8 images with the snippet in
prototype/agingbench/scenarios/s8_swe_bench/README.md. S1–S7 do not need Docker.
uv run --project prototype agingbench run \
--scenario s6_naturalistic \
--sut agingbench/registry/suts/qwen3_8b/qwen3_8b_lossy_compress.yaml \
--generated --sessions 10 --cardRuns the naturalistic multi-domain scenario for 10 sessions on Qwen3-8B locally (open weights, no API key). To run on an API model instead, swap the --sut to one under agingbench/registry/suts/haiku45/ (needs ANTHROPIC_API_KEY) or gpt4omini/ (needs OPENAI_API_KEY) — API runs are faster, billed at your provider's per-token rate.
Success looks like: m0=… m_final=… half_life=… slope=… printed at the end, plus aging_curve.png + aging_card.json written under experiments/results/....
Each run writes to prototype/experiments/results/<scenario>/<sut_id>/:
| File | Contents |
|---|---|
metrics.json |
m0, m_final, half_life, decay_slope, per-session checkpoints |
dependency_metrics.json |
DAG metrics: chain_recall_by_depth, version_accuracy, interference_resistance |
trace.jsonl |
OpenInference-style event log (every LLM call, tool call, probe) |
aging_curve.png |
Headline aging curve plot |
aging_card.json |
Consolidated cross-scenario card (v1.0.0 schema) — emitted with --card |
# Lite — S1, S2, S7 × 3 seeds × Haiku-class. ~30 min, no Docker.
uv run --project prototype agingbench run --suite lite --generated --seeds 3 --card
# Full — all scenarios (S1–S7 + S8 extension) × default SUTs × 3 seeds. ~6 hr. S8 needs Docker.
uv run --project prototype agingbench run --suite full --generated --seeds 3 --card
# Pressure sweep — S1+S2+S5 at light/medium/heavy PressureConfig presets.
uv run --project prototype agingbench run --suite pressure_sweep --generated --seeds 3 --cardOverride the default SUT for any suite with --sut <yaml> (browse prototype/agingbench/registry/suts/). Compare two run directories with agingbench compare <run_a> <run_b>. Product teams that want lite as a pre-deployment check on every PR can copy prototype/examples/ci/agingbench-lite-template.yml into their own .github/workflows/.
Every run with --card emits an aging_card.json. One card per (scenario × SUT × seed).
# Validate a card against the v1.0.0 JSON Schema
python -m agingbench.metrics.aging_card_validate \
experiments/results/*/seed_*/aging_card.jsonSchema: prototype/agingbench/metrics/aging_card_schema.json. Sample cards: prototype/examples/sample_cards/. Submission process: docs/LEADERBOARD.md.
Scenario mode runs constructed scenarios against your model. Telemetry mode is the inverse: feed it a JSONL trace from a deployed agent and it emits the same AgingCard schema — no probes, no gold answers, just per-mechanism inferences from what already happened. v0.3.0 verifies one production format end-to-end: Claude Code session files (~/.claude/projects/<id>/*.jsonl). The pipeline also accepts a generic JSONL shape (any custom log with session_id / role / content / token fields) for bring-your-own traces; adapters for OpenAI Assistants, OpenHands, Langfuse, LangSmith, and OpenTelemetry parse-test successfully but their extraction recipes against current third-party SDKs are not yet validated and will land in subsequent releases.
Telemetry mode is a Python library:
from agingbench.telemetry import trace_to_card_v11
result = trace_to_card_v11(
trace_jsonl="path/to/your_trace.jsonl",
trace_format="claude_code", # v0.3.0 verifies claude_code + generic; openai_assistants / openhands / langfuse / langsmith / otlp are parse-tested and will be validated in subsequent releases
profile="code_assistant", # or: generic
extract_outcomes=["claude_session_flags", "record_patterns"],
sut_hint={"sut_id": "prod_agent", "model_id": "claude-sonnet-4-5"},
)
print(result.card["headline"]) # same AgingCard fields as scenario runs
print(result.card["trace_audit"]) # per-mechanism trajectories + saturation-aware verdictsEach of the four aging mechanisms emits a per-session trajectory and a <metric>_verdict field (rising_degradation, floor_healthy, etc.). Full pipeline (trace formats, deployment profiles, OutcomeEvent extractor specs, privacy scrubbing): prototype/agingbench/telemetry/README.md.
For library use, set API keys as ordinary env vars (os.environ["ANTHROPIC_API_KEY"] = …) — the .env auto-loader documented above is CLI-only.
- Compression — write-before-query barrier destroys facts at compaction time
- Interference — growing state buries relevant facts behind stale ones
- Revision — system fails to track changing truth (latent state, selective forgetting)
- Maintenance — operational events (recompaction, model swap) cause silent regression
AgingBench treats memory policy as the independent variable: same model, different policies produce different aging curves.
| If you want to… | Read |
|---|---|
| Plug in your agent (Tier 2) | prototype/examples/byo_agent_minimal.py |
| Plug in your memory backbone (Tier 1) | prototype/examples/byo_memory_minimal.py |
| Add a new model / memory policy / scenario / adapter | docs/CONTRIBUTING.md |
| Submit an AgingCard to the public leaderboard | docs/LEADERBOARD.md |
| Use telemetry mode on production traces | prototype/agingbench/telemetry/README.md |
| See per-version release notes | docs/CHANGELOG.md |
If you find this work useful, please cite:
@inproceedings{agingbench2026,
title = {Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems},
author = {Zhu, Jianing and Ro, Yeonju and Robertson, John T and Wang, Kevin and
Li, Junbo and Vikalo, Haris and Akella, Aditya and Wang, Zhangyang},
booktitle = {arXiv preprint arXiv:2605.26302},
year = {2026}
}This project is released under the MIT License.
