Skip to content

feat(email): follow-up tracking — flag sent mail awaiting a reply#1916

Open
kovtcharov wants to merge 3 commits into
mainfrom
claudia/task-bfd8c261
Open

feat(email): follow-up tracking — flag sent mail awaiting a reply#1916
kovtcharov wants to merge 3 commits into
mainfrom
claudia/task-bfd8c261

Conversation

@kovtcharov

Copy link
Copy Markdown
Collaborator

The unanswered email you forgot you sent is the inbox's biggest silent failure mode — before this, a dropped thread simply disappeared. Now you can ask the email agent "who hasn't replied to me?" and the new read-only find_awaiting_reply tool scans the Sent folder and surfaces every thread still waiting on a response past a configurable window (default 3 days): message id, recipient, subject, and age, most overdue first. Detection only — it never drafts or sends a nudge (autonomous follow-up sending stays with #555, confirmation-gated), and the tests assert the detector touches no send path at all.

Scope notes for the reviewer:

Test plan

  • python -m pytest tests/unit/agents/email/ tests/unit/email/ hub/agents/python/email/tests/ — 729 passed (the one failure, test_agent_version_matches_package_metadata, is pre-existing local-venv metadata skew and fails identically on clean main)
  • New tests/unit/agents/email/test_followup_tracking.py (14 tests) locks the feat(email): follow-up tracking — flag sent mail awaiting a reply #1606 acceptance criteria: replied thread NOT flagged; unreplied flagged only past the window; latest-send-only flagging; no send_*/draft side effects (transport log + module source); fail-loud on empty user email, bad window, unparseable internalDate; Microsoft-only refusal; config-window wiring through the registered tool
  • python util/lint.py --all --fix clean

Closes #1606

)

The dropped thread is the inbox's biggest silent failure mode: you send a
question, nobody answers, and nothing resurfaces it. The agent can now scan
the Sent folder and flag every thread whose newest message is still the
user's own once it is older than a configurable window
(followup_window_days, default 3 days, or per call) via the new read-only
find_awaiting_reply tool — message id, recipient, subject, and age, most
overdue first.

Detection only, per the #555 boundary: the module imports no send path (the
unit tests assert both the module source and the backend transport calls
stay read-only), and any actual chaser goes through the confirmation-gated
reply tools at the user's request. Gmail-only for now — the Graph backend
serves the inbox folder for unrecognized labels, so a Microsoft-only setup
gets a loud refusal instead of a silently wrong scan.
@github-actions github-actions Bot added documentation Documentation changes tests Test changes agent::email Email agent changes labels Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Verdict: Approve with suggestions — clean, well-tested, well-documented feature.

This adds read-only follow-up tracking to the email agent: a new find_awaiting_reply tool scans your Sent folder and surfaces threads still waiting on a reply past a configurable window (default 3 days), most overdue first. Detection only — it never drafts or sends a nudge, and the tests prove it touches no send path. Gmail-only, and a Microsoft-only setup gets a loud error rather than a silently wrong scan. The change is genuinely bulletproof on the fundamentals: fail-loud invariants, no silent fallbacks, and every companion doc (README / SPEC / SKILL / CHANGELOG / guide / capability matrix) updated in lockstep.

One thing worth a look before merge: the scan only inspects the newest 100 sent messages. Because that list is newest-first, a thread whose last send was months ago and has since been buried past the 100 most-recent sends will never be inspected — which is exactly the "dropped thread you forgot you sent" the feature is meant to catch. For a heavy sender the answer to "what am I still waiting on?" could be quietly incomplete. Not a blocker (the bound is a reasonable interactive ceiling and threads_scanned is returned), but consider making the truncation visible to the user, or scanning older sends too.

🔍 Technical details

🟡 Important

Newest-100 truncation can hide the most-overdue threads (followup_tools.py:305-314)
list_messages(label_ids=["SENT"], max_results=DEFAULT_SENT_SCAN_CEILING) fetches only 100 stubs, and Gmail (and the fake, tests/fixtures/email/fake_gmail.py:344-347) returns them newest-first. The oldest, most-overdue sends — the feature's headline use case — are the ones most likely to fall outside that window for a user with a busy Sent folder, and the result gives no signal that the scan was partial. threads_scanned is returned but there's no "there may be more" hint. Suggest surfacing the truncation when the listing hits the ceiling, e.g. add a scan_truncated: true field when len(listing["messages"]) >= DEFAULT_SENT_SCAN_CEILING, so the agent can tell the user the answer isn't exhaustive. (Doc'ing the limit in the tool docstring/guide would be a lighter-weight alternative.)

🟢 Minor

  • max_threads cap is silently DEFAULT_SENT_SCAN_CEILING-bounded too (followup_tools.py:407, :305): the tool caps max_threads at 100, but list_messages also only pulls 100 stubs, so max_threads=100 can still inspect fewer threads than requested. Fine as-is; just note the two ceilings are coupled — a one-line comment tying max_threads's 100 cap to DEFAULT_SENT_SCAN_CEILING would prevent a future drift where someone raises one but not the other.

Strengths

  • Docs updated in lockstep — README, SPEC, SKILL, CHANGELOG, docs/guides/email.mdx, and the specification.html capability matrix all move capability 15 Planned→Wired with consistent Gmail-only / read-only / Autonomous mode: agent schedules follow-up messages and acts proactively without waiting for user input #555-vs-feat(email): follow-up tracking — flag sent mail awaiting a reply #1606 framing. This is exactly the multi-doc-sync discipline CLAUDE.md calls for.
  • Fail-loud throughout, no silent fallbacks — empty user email (:290), unparseable internalDate (:243), negative window, and empty-thread views all raise with actionable messages; the Microsoft-only path refuses loudly instead of serving the wrong folder. The tool's outer except Exception is a legitimate agent-boundary translation to a structured error envelope, not a swallow.
  • Test suite locks the acceptance criteria well (test_followup_tracking.py) — including a nice belt-and-suspenders read-only proof (transport-log assertion and source-level test_module_references_no_send_path), correct newest-message-only flagging, config-window wiring through the registered tool, and the confirmation-gating check (test_tool_is_not_confirmation_gated).
  • Clean mixin composition — registered after _register_read_tools() alongside the other read tools, system prompt updated to mark it detection-only, no new REST route so SCHEMA_VERSION correctly stays put.

TestToolRegistry.test_no_unexpected_tool_set guards against tools that
bypass confirmation logic; the new read-only follow-up tracker belongs in
its expected set.
…haustiveness

The Sent listing is newest-first and capped at 100 stubs, so a heavy
sender's oldest — most overdue — threads can fall outside one scan. The
result now carries scan_truncated whenever a ceiling was hit (next page
token, full listing page, or more threads than max_threads), the tool
docstring tells the LLM to relay the incompleteness, and the max_threads
cap is tied to DEFAULT_SENT_SCAN_CEILING so the two limits can't drift
apart. Raised in the PR #1916 review.
@kovtcharov

Copy link
Copy Markdown
Collaborator Author

Both review points addressed in 6e16e60. The result now carries scan_truncated: true whenever a ceiling was hit (listing page full, next-page token present, or more sent threads than max_threads), the tool docstring instructs the agent to relay the incompleteness to the user, and the guide documents the 100-thread ceiling. The max_threads cap is now expressed as DEFAULT_SENT_SCAN_CEILING with a comment noting the coupling, so the two limits can't drift apart. New unit test covers both truncation paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent::email Email agent changes documentation Documentation changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(email): follow-up tracking — flag sent mail awaiting a reply

2 participants