feat(email): follow-up tracking — flag sent mail awaiting a reply by kovtcharov · Pull Request #1916 · amd/gaia

kovtcharov · 2026-07-01T21:41:36Z

The unanswered email you forgot you sent is the inbox's biggest silent failure mode — before this, a dropped thread simply disappeared. Now you can ask the email agent "who hasn't replied to me?" and the new read-only find_awaiting_reply tool scans the Sent folder and surfaces every thread still waiting on a response past a configurable window (default 3 days): message id, recipient, subject, and age, most overdue first. Detection only — it never drafts or sends a nudge (autonomous follow-up sending stays with #555, confirmation-gated), and the tests assert the detector touches no send path at all.

Scope notes for the reviewer:

Gmail-only, refused loudly otherwise — the Graph backend serves the inbox folder for unrecognized labels, so scanning Outlook would be silently wrong; a Microsoft-only setup gets a clear error, and a mixed setup names the skipped mailbox (same boundary as quarantine, bug(email): archive-undo fails for Outlook messages — folder-move changes the id (#1707 follow-up) #1738).
Agent-loop surface only (Agent UI POST /v1/email/query); no new REST route, so the REST contract and SCHEMA_VERSION are unchanged. A deterministic fixed-function route per the feat(email): expose search, pre-scan, archive/quarantine & calendar on the REST contract (schema 2.1) #1883 pattern can ride a follow-up if wanted.
Capability tracker: Email Agent — capability completion tracker (live status vs. spec) #1691 row 15 (Follow-up tracking) moves ⬜ → ✅ agent-wired; specification.html capability matrix updated to match.

Test plan

python -m pytest tests/unit/agents/email/ tests/unit/email/ hub/agents/python/email/tests/ — 729 passed (the one failure, test_agent_version_matches_package_metadata, is pre-existing local-venv metadata skew and fails identically on clean main)
New tests/unit/agents/email/test_followup_tracking.py (14 tests) locks the feat(email): follow-up tracking — flag sent mail awaiting a reply #1606 acceptance criteria: replied thread NOT flagged; unreplied flagged only past the window; latest-send-only flagging; no send_*/draft side effects (transport log + module source); fail-loud on empty user email, bad window, unparseable internalDate; Microsoft-only refusal; config-window wiring through the registered tool
python util/lint.py --all --fix clean

Closes #1606

) The dropped thread is the inbox's biggest silent failure mode: you send a question, nobody answers, and nothing resurfaces it. The agent can now scan the Sent folder and flag every thread whose newest message is still the user's own once it is older than a configurable window (followup_window_days, default 3 days, or per call) via the new read-only find_awaiting_reply tool — message id, recipient, subject, and age, most overdue first. Detection only, per the #555 boundary: the module imports no send path (the unit tests assert both the module source and the backend transport calls stay read-only), and any actual chaser goes through the confirmation-gated reply tools at the user's request. Gmail-only for now — the Graph backend serves the inbox folder for unrecognized labels, so a Microsoft-only setup gets a loud refusal instead of a silently wrong scan.

github-actions · 2026-07-01T21:44:30Z

Verdict: Approve with suggestions — clean, well-tested, well-documented feature.

This adds read-only follow-up tracking to the email agent: a new find_awaiting_reply tool scans your Sent folder and surfaces threads still waiting on a reply past a configurable window (default 3 days), most overdue first. Detection only — it never drafts or sends a nudge, and the tests prove it touches no send path. Gmail-only, and a Microsoft-only setup gets a loud error rather than a silently wrong scan. The change is genuinely bulletproof on the fundamentals: fail-loud invariants, no silent fallbacks, and every companion doc (README / SPEC / SKILL / CHANGELOG / guide / capability matrix) updated in lockstep.

One thing worth a look before merge: the scan only inspects the newest 100 sent messages. Because that list is newest-first, a thread whose last send was months ago and has since been buried past the 100 most-recent sends will never be inspected — which is exactly the "dropped thread you forgot you sent" the feature is meant to catch. For a heavy sender the answer to "what am I still waiting on?" could be quietly incomplete. Not a blocker (the bound is a reasonable interactive ceiling and threads_scanned is returned), but consider making the truncation visible to the user, or scanning older sends too.

🔍 Technical details

🟡 Important

Newest-100 truncation can hide the most-overdue threads (followup_tools.py:305-314)
list_messages(label_ids=["SENT"], max_results=DEFAULT_SENT_SCAN_CEILING) fetches only 100 stubs, and Gmail (and the fake, tests/fixtures/email/fake_gmail.py:344-347) returns them newest-first. The oldest, most-overdue sends — the feature's headline use case — are the ones most likely to fall outside that window for a user with a busy Sent folder, and the result gives no signal that the scan was partial. threads_scanned is returned but there's no "there may be more" hint. Suggest surfacing the truncation when the listing hits the ceiling, e.g. add a scan_truncated: true field when len(listing["messages"]) >= DEFAULT_SENT_SCAN_CEILING, so the agent can tell the user the answer isn't exhaustive. (Doc'ing the limit in the tool docstring/guide would be a lighter-weight alternative.)

🟢 Minor

max_threads cap is silently DEFAULT_SENT_SCAN_CEILING-bounded too (followup_tools.py:407, :305): the tool caps max_threads at 100, but list_messages also only pulls 100 stubs, so max_threads=100 can still inspect fewer threads than requested. Fine as-is; just note the two ceilings are coupled — a one-line comment tying max_threads's 100 cap to DEFAULT_SENT_SCAN_CEILING would prevent a future drift where someone raises one but not the other.

Strengths

Docs updated in lockstep — README, SPEC, SKILL, CHANGELOG, docs/guides/email.mdx, and the specification.html capability matrix all move capability 15 Planned→Wired with consistent Gmail-only / read-only / Autonomous mode: agent schedules follow-up messages and acts proactively without waiting for user input #555-vs-feat(email): follow-up tracking — flag sent mail awaiting a reply #1606 framing. This is exactly the multi-doc-sync discipline CLAUDE.md calls for.
Fail-loud throughout, no silent fallbacks — empty user email (:290), unparseable internalDate (:243), negative window, and empty-thread views all raise with actionable messages; the Microsoft-only path refuses loudly instead of serving the wrong folder. The tool's outer except Exception is a legitimate agent-boundary translation to a structured error envelope, not a swallow.
Test suite locks the acceptance criteria well (test_followup_tracking.py) — including a nice belt-and-suspenders read-only proof (transport-log assertion and source-level test_module_references_no_send_path), correct newest-message-only flagging, config-window wiring through the registered tool, and the confirmation-gating check (test_tool_is_not_confirmation_gated).
Clean mixin composition — registered after _register_read_tools() alongside the other read tools, system prompt updated to mark it detection-only, no new REST route so SCHEMA_VERSION correctly stays put.

TestToolRegistry.test_no_unexpected_tool_set guards against tools that bypass confirmation logic; the new read-only follow-up tracker belongs in its expected set.

…haustiveness The Sent listing is newest-first and capped at 100 stubs, so a heavy sender's oldest — most overdue — threads can fall outside one scan. The result now carries scan_truncated whenever a ceiling was hit (next page token, full listing page, or more threads than max_threads), the tool docstring tells the LLM to relay the incompleteness, and the max_threads cap is tied to DEFAULT_SENT_SCAN_CEILING so the two limits can't drift apart. Raised in the PR #1916 review.

kovtcharov · 2026-07-01T21:59:06Z

Both review points addressed in 6e16e60. The result now carries scan_truncated: true whenever a ceiling was hit (listing page full, next-page token present, or more sent threads than max_threads), the tool docstring instructs the agent to relay the incompleteness to the user, and the guide documents the 100-thread ceiling. The max_threads cap is now expressed as DEFAULT_SENT_SCAN_CEILING with a comment noting the coupling, so the two limits can't drift apart. New unit test covers both truncation paths.

kovtcharov requested a review from kovtcharov-amd as a code owner July 1, 2026 21:41

github-actions Bot added documentation Documentation changes tests Test changes agent::email Email agent changes labels Jul 1, 2026

kovtcharov added 2 commits July 1, 2026 14:50

test(email): add find_awaiting_reply to the expected-tool allowlist

298b34f

TestToolRegistry.test_no_unexpected_tool_set guards against tools that bypass confirmation logic; the new read-only follow-up tracker belongs in its expected set.

itomek approved these changes Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(email): follow-up tracking — flag sent mail awaiting a reply#1916

feat(email): follow-up tracking — flag sent mail awaiting a reply#1916
kovtcharov wants to merge 3 commits into
mainfrom
claudia/task-bfd8c261

kovtcharov commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

🟡 Important

🟢 Minor

Strengths

Uh oh!

kovtcharov commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kovtcharov commented Jul 1, 2026

Test plan

Uh oh!

github-actions Bot commented Jul 1, 2026

🟡 Important

🟢 Minor

Strengths

Uh oh!

kovtcharov commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants