Skip to content

Fix /v1/users/people and /v1/conversations 500 errors#5441

Merged
beastoin merged 9 commits intomainfrom
fix/people-conversations-500s
Mar 9, 2026
Merged

Fix /v1/users/people and /v1/conversations 500 errors#5441
beastoin merged 9 commits intomainfrom
fix/people-conversations-500s

Conversation

@beastoin
Copy link
Collaborator

@beastoin beastoin commented Mar 8, 2026

Summary

Changes

File Change
backend/models/other.py Make Person.created_at and updated_at Optional with None default
backend/database/users.py Inject doc ID in get_person(), get_people(), get_person_by_name(), get_people_by_ids()
backend/database/conversations.py Add folder_id and starred params to get_conversations_without_photos()
backend/routers/conversations.py Switch list endpoint from get_conversations() to get_conversations_without_photos()
backend/tests/unit/test_people_conversations_500s.py 14 regression tests for both fixes
backend/test.sh Add new test file to test runner

Review cycle changes

  • Round 1: Switched get_people_by_ids() from where("id", "in", ...) to db.get_all() with document refs — legacy docs missing stored id field would not be returned by the query approach.
  • Round 1: Added 10 regression tests (3 Person model, 5 doc ID injection, 2 conversations list).
  • Round 2: Added 4 more tests per tester feedback: large batch >30 IDs, empty list boundary, decorator verification for @with_photos presence/absence.

Test evidence

  • 14/14 unit tests pass
  • Live dev Firestore validation: legacy person docs (no id/timestamps) pass Person model, get_conversations_without_photos returns 0 photos

Closes #5423
Closes #5424

🤖 Generated with Claude Code

by AI for @beastoin

beastoin and others added 4 commits March 8, 2026 02:38
Legacy Firestore person documents may lack these fields, causing
ResponseValidationError (500) on /v1/users/people for 8 users.
Make both fields Optional with None default.

Fixes part of #5423

Co-Authored-By: Claude Opus 4.6 <[email protected]>
get_people(), get_person(), get_person_by_name(), get_people_by_ids()
all returned raw to_dict() without the document ID. Legacy docs
missing the 'id' field caused ResponseValidationError on Person model.

Fixes #5423

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Enables the list endpoint to use this lighter function that skips
loading full base64 photo content per conversation.

Fixes part of #5424

Co-Authored-By: Claude Opus 4.6 <[email protected]>
GET /v1/conversations was loading full base64 photos for every
conversation via @with_photos decorator. 50 convos x 1.2MB = 57MB
exceeded Cloud Run 32MB response limit. The list endpoint doesn't
need photo content — individual conversation GET still loads them.

Fixes #5424

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@beastoin beastoin force-pushed the fix/people-conversations-500s branch from 936f25a to 9f31394 Compare March 8, 2026 01:38
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 8, 2026

Greptile Summary

This PR fixes two production 500 errors: /v1/users/people (ResponseValidationError due to missing Firestore document IDs) and /v1/conversations (payload too large due to loading base64 photos for every conversation). It also bundles an unrelated but useful improvement to the pusher conversation-processing retry mechanism.

  • backend/models/other.py: Person.created_at and updated_at made Optional[datetime] to tolerate legacy Firestore docs that lack these timestamps.
  • backend/database/users.py: All four person-query functions now inject the Firestore document key via setdefault('id', doc.id), fixing ResponseValidationError. A missing-document guard is also added to get_person().
  • backend/database/conversations.py: get_conversations_without_photos() gains folder_id and starred params so it is a drop-in replacement for get_conversations() on the list endpoint.
  • backend/routers/conversations.py: GET /v1/conversations switches to get_conversations_without_photos(), eliminating the ~57 MB payload that exhausted Cloud Run's 32 MB response limit.
  • backend/routers/transcribe.py: pending_conversation_requests is promoted from a Set to a Dict[str, dict] tracking sent_at and retries. A timeout-based retry loop and a reconnect-resend path are added to prevent conversations getting stuck in in_progress when pusher WebSocket messages are silently lost. A logic bug exists here: retry counts are incremented after a ConnectionClosed event even when pusher is already disconnected and the actual resend cannot happen, potentially exhausting all retries during a burst of connection drops.
  • backend/tests/unit/test_pusher_conversation_retry.py: New test file covering the retry and reconnect logic, though the test helper diverges slightly from production behaviour in the disconnect-during-retry path.

Confidence Score: 3/5

  • The two targeted bug fixes (people 500 and conversations payload) are safe and correct; the transcribe retry refactor contains a logic bug that can prematurely exhaust retries under unstable connections.
  • The database/ and models/ changes are minimal and well-reasoned. The router change is a clean one-line swap. The transcribe.py retry overhaul is broader and introduces a logic flaw: after a ConnectionClosed event, timed-out pending requests have their retry counter incremented even though pusher is disconnected and no actual retry is performed, potentially burning all 3 retries in a short burst of connection drops and silently giving up on conversations.
  • backend/routers/transcribe.py — the retry-count increment in the ConnectionClosed path needs a pusher_connected guard.

Important Files Changed

Filename Overview
backend/routers/transcribe.py Adds timeout-based retry and reconnect-retry for pending conversation processing requests. Contains a logic bug where retry counts are incremented after a ConnectionClosed event even though the actual send cannot happen (pusher disconnected), potentially exhausting all retries in rapid succession during unstable connections.
backend/database/conversations.py Adds folder_id and starred filter params to get_conversations_without_photos(), making it functionally identical to get_conversations() minus the @with_photos decorator. Fix correctly addresses the 32 MB response limit issue by skipping photo subcollection loading on the list endpoint.
backend/database/users.py Injects Firestore document ID via setdefault() in all person query functions, fixing the ResponseValidationError for /v1/users/people. Also adds an exists check in get_person() to avoid calling .to_dict() on a non-existent doc. Pre-existing get_people_by_ids() query may miss legacy docs without a stored id field.
backend/models/other.py Makes Person.created_at and updated_at Optional with None defaults, allowing legacy Firestore documents without these timestamps to validate correctly.
backend/routers/conversations.py Switches the GET /v1/conversations list endpoint from get_conversations() (with @with_photos) to get_conversations_without_photos(), passing all existing query parameters including the newly added folder_id and starred filters. Clean, minimal change with no regressions to the individual conversation GET endpoint.
backend/tests/unit/test_pusher_conversation_retry.py New unit test file with good coverage of the happy path, overflow eviction, and reconnect scenarios. The _check_timed_out_requests helper diverges slightly from production code by directly resetting sent_at, hiding the disconnect-during-retry bug. Missing a test for the ConnectionClosed mid-retry edge case.
backend/test.sh Adds the new test_pusher_conversation_retry.py to the test runner. No issues.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ConvRouter as GET /v1/conversations
    participant ConvDB as conversations.py
    participant Firestore

    Client->>ConvRouter: GET /v1/conversations?limit=50
    ConvRouter->>ConvDB: get_conversations_without_photos(uid, limit, ...)
    ConvDB->>Firestore: Query conversations collection (no photos subcollection)
    Firestore-->>ConvDB: conversation docs (no base64 photos)
    ConvDB-->>ConvRouter: List[dict] (~KB not ~57MB)
    ConvRouter-->>Client: 200 List[Conversation] (photos=[])

    Note over Client,Firestore: Individual GET still loads photos
    Client->>ConvRouter: GET /v1/conversations/{id}
    ConvRouter->>ConvDB: get_conversation() with @with_photos
    ConvDB->>Firestore: Fetch doc + photos subcollection
    Firestore-->>ConvDB: conversation + base64 photos
    ConvDB-->>ConvRouter: dict with photos populated
    ConvRouter-->>Client: 200 Conversation (photos=[...])
Loading

Comments Outside Diff (2)

  1. backend/routers/transcribe.py, line 1131-1145 (link)

    Retry count incremented even when actual retry is impossible

    After a ConnectionClosed exception, pusher_connected is set to False on line 1119. The code then falls through (past the try/except) to this retry check. For any timed-out entries, info['retries'] += 1 is called, then await request_conversation_processing(cid) is invoked. But inside request_conversation_processing, since pusher_connected is False, the function hits the early-return branch and does nothing — it does not reset sent_at. On the very next loop iteration, not pusher_connected causes a continue back to the top, so the timer is never reset for these entries.

    The practical consequence: every ConnectionClosed event that coincides with a timed-out pending request burns a retry without actually retrying. With MAX_RETRIES_PER_REQUEST = 3, three connection drops can exhaust all retries in seconds and cause the conversation to be permanently given up.

    Fix: guard the retry block so retries are only incremented when the connection is actually usable, or only increment info['retries'] inside request_conversation_processing after a successful send:

    # Retry timed-out pending requests (handles silent WS death)
    now = time.time()
    if pusher_connected and pusher_ws:  # Only retry when we can actually send
        timed_out = [
            cid
            for cid, info in list(pending_conversation_requests.items())
            if now - info['sent_at'] > PENDING_REQUEST_TIMEOUT
        ]
        for cid in timed_out:
            info = pending_conversation_requests.get(cid)
            if not info:
                continue
            if info['retries'] >= MAX_RETRIES_PER_REQUEST:
                logger.error(
                    f"Conversation {cid} failed after {MAX_RETRIES_PER_REQUEST} retries, giving up {uid} {session_id}"
                )
                pending_conversation_requests.pop(cid, None)
                continue
            info['retries'] += 1
            logger.warning(
                f"Retrying process_conversation for {cid} (attempt {info['retries']}/{MAX_RETRIES_PER_REQUEST}) {uid} {session_id}"
            )
            await request_conversation_processing(cid)
  2. backend/tests/unit/test_pusher_conversation_retry.py, line 59-73 (link)

    Test helper diverges from production behavior on disconnected retry

    The _check_timed_out_requests helper resets info['sent_at'] = now directly (line 69), but in production (transcribe.py), the sent_at reset happens inside request_conversation_processing — only when the pusher WebSocket is connected and the dict is updated before the send. If pusher is disconnected during a timed-out retry (the bug described in the main comment), sent_at is never reset and the helper's behaviour diverges.

    This means tests like test_timeout_retry_increments_count and test_timeout_multiple_requests pass while not covering the disconnect-during-retry edge case. Consider adding an integration-style test that simulates a ConnectionClosed mid-retry to catch the retry-count exhaustion issue.

Last reviewed commit: 9f31394

@@ -88,8 +98,10 @@ def get_people_by_ids(uid: str, person_ids: list[str]):
for i in range(0, len(person_ids), 30):
chunk_ids = person_ids[i : i + 30]
people_query = people_ref.where("id", 'in', chunk_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy documents silently excluded by get_people_by_ids

This query filters on the id field stored inside the document data. For legacy person documents that were created without an explicit id field in their data (only the Firestore document key), this query will return no results — setdefault('id', person.id) only helps after a document is already fetched, not during the query itself.

While create_person always writes id into the data, any legacy documents that predate this pattern will be silently dropped from get_people_by_ids results. This is a pre-existing issue, but since the PR is already fixing person-related 500 errors, it's worth noting that callers relying on get_people_by_ids (e.g. the shared conversation endpoint that fetches speaker names) may still return incomplete data for legacy users. Consider migrating legacy docs or querying by document key using Firestore's document().get() pattern instead.

beastoin and others added 4 commits March 8, 2026 02:51
Reviewer feedback: where("id", "in", ...) misses legacy docs that
don't have a stored 'id' field. Use db.get_all() with doc refs instead.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
10 tests covering:
- Person model resilience with missing created_at/updated_at (#5423)
- Doc ID injection in get_people, get_person, get_people_by_ids (#5423)
- Conversations list endpoint uses without-photos function (#5424)
- get_conversations_without_photos supports folder_id/starred (#5424)

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Large batch test for get_people_by_ids (>30 IDs, old where-in limit)
- Empty list boundary test
- Verify get_conversations_without_photos lacks @with_photos decorator
- Verify get_conversations retains @with_photos for individual use

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@beastoin
Copy link
Collaborator Author

beastoin commented Mar 8, 2026

Local dev live test evidence (dev Firestore)

Test 1: Legacy person doc (no id, no created_at, no updated_at)

  • Created Firestore doc with only name and speech_samples fields
  • get_person(): injected doc ID → Person(**data) OK, created_at=None, updated_at=None
  • get_people(): found legacy person, Person model validates with None timestamps
  • ✅ PASS

Test 2: Real user people data

  • User 25bN... has 3 people in dev Firestore
  • All 3 pass Person(**data) validation
  • Doc ID injection works (setdefault preserves existing IDs)
  • ✅ PASS

Test 3: get_conversations_without_photos() real call

  • User 02ar... has conversations in dev Firestore
  • get_conversations_without_photos(uid, limit=2) returns 2 convos
  • photos count: 0 — no base64 photo content loaded
  • ✅ PASS

Test 4: Unit tests (14/14 pass)

14 passed in 0.59s (test_people_conversations_500s.py)

by AI for @beastoin

@beastoin
Copy link
Collaborator Author

beastoin commented Mar 8, 2026

All Checkpoints Passed — Ready for Merge

Checkpoint Status Notes
CP0 done Skills discovery + preflight
CP1 done Issues #5423 + #5424 understood
CP2 done Branch fix/people-conversations-500s from main
CP3 done Exploration — 4 target files, no real-time audio path
CP4 done CODEx consult — approach validated, risks identified
CP5 done Implementation + tests (14 unit tests pass)
CP6 done PR created with summary, evidence, risks
CP7 done Reviewer approved (2 rounds — fixed get_people_by_ids to use db.get_all)
CP8 done Tester approved (2 rounds — added boundary + decorator tests)
CP9 skipped Not a real-time audio path change

Live validation: Tested against dev Firestore (based-hardware-dev) — legacy person docs without id/timestamps pass Person model validation, conversations list returns without photo base64 content.

Test evidence: All 14 unit tests pass (test.sh updated).

PR is ready for merge. Awaiting explicit merge approval.


by AI for @beastoin

@beastoin
Copy link
Collaborator Author

beastoin commented Mar 8, 2026

Review Feedback from Combined Verification (PR #5445)

Hey @yuki — 14/14 tests pass, live API verified, legacy docs handled correctly. Quick note from architecture review:

NOTE: get_people_by_ids() returns unordered results

File: database/users.py, line 100

Firestore db.get_all() returns results in arbitrary order, not matching the input person_ids list order. All current callers use unordered semantics so this is safe today, but recommend adding a docstring to prevent future bugs:

def get_people_by_ids(uid: str, person_ids: list[str]):
    """Fetch people by IDs using batch document fetch.
    
    Note: Results are returned in arbitrary order (Firestore db.get_all behavior).
    If order matters, caller should sort by person_ids after retrieval.
    """

Everything else is clean — decorator usage correct, data.setdefault('id', doc.id) handles both legacy and modern docs properly, batch fetch is more efficient than where("id","in",...) for this use case.

by AI for @beastoin

Per kelvin UAT review: callers must not assume result order matches
input person_ids order.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@beastoin
Copy link
Collaborator Author

beastoin commented Mar 8, 2026

Thanks for the thorough UAT @kelvin-yu-rh. Added the ordering docstring in b58e493 — good catch on the db.get_all() constraint.

14/14 tests still pass after the commit.

by AI for @beastoin

@beastoin beastoin merged commit 35c9f55 into main Mar 9, 2026
1 check passed
@beastoin beastoin deleted the fix/people-conversations-500s branch March 9, 2026 01:04
@beastoin
Copy link
Collaborator Author

beastoin commented Mar 9, 2026

Post-deploy verification — PR #5441 (via combined #5459)

Deploy completed 01:17 UTC. Verified at T+14m (01:31 UTC):

GET /v1/users/people — all 200 OK, zero 500s

  • Multiple successful requests confirmed in Cloud Logging since 01:29 UTC
  • Zero ResponseValidationError occurrences

GET /v1/conversations — all 200 OK, zero 500s

  • Multiple successful requests confirmed in Cloud Logging since 01:29 UTC
  • Zero response size errors

Overall backend health — zero errors (severity >= ERROR) since deploy

Post-deploy monitoring scheduled: T+30m, T+1h, T+2h, T+4h, then every 4h for 24h.

by AI for @beastoin

@beastoin
Copy link
Collaborator Author

beastoin commented Mar 9, 2026

Physical Device E2E — PR #5441

Pixel 7a physical device test on combined branch verify/combined-5441-5442-5443:

  • Home screen Conversations section renders without 500 errors
  • People tab loads with search and person cards
  • No crashes or ANR during navigation

Evidence: home | people

Combined with 14/14 unit tests and live API verification — PASS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant