Fix /v1/users/people and /v1/conversations 500 errors by beastoin · Pull Request #5441 · BasedHardware/omi

beastoin · 2026-03-08T01:36:51Z

Summary

Bug: /v1/users/people returns 500 — ResponseValidationError on Person model #5423: /v1/users/people returns 500 (ResponseValidationError) because get_people() returns raw to_dict() without injecting the Firestore document ID, and Person model requires id, created_at, updated_at as non-optional. Fix: inject doc ID via setdefault('id', doc.id) in all person query functions + make created_at/updated_at Optional.
Bug: /v1/conversations returns 500 — response exceeds Cloud Run 32MB size limit #5424: /v1/conversations returns 500 ("Response size was too large") because @with_photos decorator loads full base64 photo content for every conversation in the list. 50 convos × ~1.2MB = ~57MB exceeds Cloud Run 32MB limit. Fix: switch list endpoint to get_conversations_without_photos() (photos load on individual GET only).

Changes

File	Change
`backend/models/other.py`	Make `Person.created_at` and `updated_at` Optional with None default
`backend/database/users.py`	Inject doc ID in `get_person()`, `get_people()`, `get_person_by_name()`, `get_people_by_ids()`
`backend/database/conversations.py`	Add `folder_id` and `starred` params to `get_conversations_without_photos()`
`backend/routers/conversations.py`	Switch list endpoint from `get_conversations()` to `get_conversations_without_photos()`
`backend/tests/unit/test_people_conversations_500s.py`	14 regression tests for both fixes
`backend/test.sh`	Add new test file to test runner

Review cycle changes

Round 1: Switched get_people_by_ids() from where("id", "in", ...) to db.get_all() with document refs — legacy docs missing stored id field would not be returned by the query approach.
Round 1: Added 10 regression tests (3 Person model, 5 doc ID injection, 2 conversations list).
Round 2: Added 4 more tests per tester feedback: large batch >30 IDs, empty list boundary, decorator verification for @with_photos presence/absence.

Test evidence

14/14 unit tests pass
Live dev Firestore validation: legacy person docs (no id/timestamps) pass Person model, get_conversations_without_photos returns 0 photos

Closes #5423
Closes #5424

🤖 Generated with Claude Code

by AI for @beastoin

Legacy Firestore person documents may lack these fields, causing ResponseValidationError (500) on /v1/users/people for 8 users. Make both fields Optional with None default. Fixes part of #5423 Co-Authored-By: Claude Opus 4.6 <[email protected]>

get_people(), get_person(), get_person_by_name(), get_people_by_ids() all returned raw to_dict() without the document ID. Legacy docs missing the 'id' field caused ResponseValidationError on Person model. Fixes #5423 Co-Authored-By: Claude Opus 4.6 <[email protected]>

Enables the list endpoint to use this lighter function that skips loading full base64 photo content per conversation. Fixes part of #5424 Co-Authored-By: Claude Opus 4.6 <[email protected]>

GET /v1/conversations was loading full base64 photos for every conversation via @with_photos decorator. 50 convos x 1.2MB = 57MB exceeded Cloud Run 32MB response limit. The list endpoint doesn't need photo content — individual conversation GET still loads them. Fixes #5424 Co-Authored-By: Claude Opus 4.6 <[email protected]>

greptile-apps · 2026-03-08T01:41:44Z

Greptile Summary

This PR fixes two production 500 errors: /v1/users/people (ResponseValidationError due to missing Firestore document IDs) and /v1/conversations (payload too large due to loading base64 photos for every conversation). It also bundles an unrelated but useful improvement to the pusher conversation-processing retry mechanism.

backend/models/other.py: Person.created_at and updated_at made Optional[datetime] to tolerate legacy Firestore docs that lack these timestamps.
backend/database/users.py: All four person-query functions now inject the Firestore document key via setdefault('id', doc.id), fixing ResponseValidationError. A missing-document guard is also added to get_person().
backend/database/conversations.py: get_conversations_without_photos() gains folder_id and starred params so it is a drop-in replacement for get_conversations() on the list endpoint.
backend/routers/conversations.py: GET /v1/conversations switches to get_conversations_without_photos(), eliminating the ~57 MB payload that exhausted Cloud Run's 32 MB response limit.
backend/routers/transcribe.py: pending_conversation_requests is promoted from a Set to a Dict[str, dict] tracking sent_at and retries. A timeout-based retry loop and a reconnect-resend path are added to prevent conversations getting stuck in in_progress when pusher WebSocket messages are silently lost. A logic bug exists here: retry counts are incremented after a ConnectionClosed event even when pusher is already disconnected and the actual resend cannot happen, potentially exhausting all retries during a burst of connection drops.
backend/tests/unit/test_pusher_conversation_retry.py: New test file covering the retry and reconnect logic, though the test helper diverges slightly from production behaviour in the disconnect-during-retry path.

Confidence Score: 3/5

The two targeted bug fixes (people 500 and conversations payload) are safe and correct; the transcribe retry refactor contains a logic bug that can prematurely exhaust retries under unstable connections.
The database/ and models/ changes are minimal and well-reasoned. The router change is a clean one-line swap. The transcribe.py retry overhaul is broader and introduces a logic flaw: after a ConnectionClosed event, timed-out pending requests have their retry counter incremented even though pusher is disconnected and no actual retry is performed, potentially burning all 3 retries in a short burst of connection drops and silently giving up on conversations.
backend/routers/transcribe.py — the retry-count increment in the ConnectionClosed path needs a pusher_connected guard.

Important Files Changed

Filename	Overview
backend/routers/transcribe.py	Adds timeout-based retry and reconnect-retry for pending conversation processing requests. Contains a logic bug where retry counts are incremented after a ConnectionClosed event even though the actual send cannot happen (pusher disconnected), potentially exhausting all retries in rapid succession during unstable connections.
backend/database/conversations.py	Adds folder_id and starred filter params to get_conversations_without_photos(), making it functionally identical to get_conversations() minus the @with_photos decorator. Fix correctly addresses the 32 MB response limit issue by skipping photo subcollection loading on the list endpoint.
backend/database/users.py	Injects Firestore document ID via setdefault() in all person query functions, fixing the ResponseValidationError for /v1/users/people. Also adds an exists check in get_person() to avoid calling .to_dict() on a non-existent doc. Pre-existing get_people_by_ids() query may miss legacy docs without a stored id field.
backend/models/other.py	Makes Person.created_at and updated_at Optional with None defaults, allowing legacy Firestore documents without these timestamps to validate correctly.
backend/routers/conversations.py	Switches the GET /v1/conversations list endpoint from get_conversations() (with @with_photos) to get_conversations_without_photos(), passing all existing query parameters including the newly added folder_id and starred filters. Clean, minimal change with no regressions to the individual conversation GET endpoint.
backend/tests/unit/test_pusher_conversation_retry.py	New unit test file with good coverage of the happy path, overflow eviction, and reconnect scenarios. The _check_timed_out_requests helper diverges slightly from production code by directly resetting sent_at, hiding the disconnect-during-retry bug. Missing a test for the ConnectionClosed mid-retry edge case.
backend/test.sh	Adds the new test_pusher_conversation_retry.py to the test runner. No issues.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ConvRouter as GET /v1/conversations
    participant ConvDB as conversations.py
    participant Firestore

    Client->>ConvRouter: GET /v1/conversations?limit=50
    ConvRouter->>ConvDB: get_conversations_without_photos(uid, limit, ...)
    ConvDB->>Firestore: Query conversations collection (no photos subcollection)
    Firestore-->>ConvDB: conversation docs (no base64 photos)
    ConvDB-->>ConvRouter: List[dict] (~KB not ~57MB)
    ConvRouter-->>Client: 200 List[Conversation] (photos=[])

    Note over Client,Firestore: Individual GET still loads photos
    Client->>ConvRouter: GET /v1/conversations/{id}
    ConvRouter->>ConvDB: get_conversation() with @with_photos
    ConvDB->>Firestore: Fetch doc + photos subcollection
    Firestore-->>ConvDB: conversation + base64 photos
    ConvDB-->>ConvRouter: dict with photos populated
    ConvRouter-->>Client: 200 Conversation (photos=[...])

Comments Outside Diff (2)

backend/routers/transcribe.py, line 1131-1145 (link)

Retry count incremented even when actual retry is impossible

After a ConnectionClosed exception, pusher_connected is set to False on line 1119. The code then falls through (past the try/except) to this retry check. For any timed-out entries, info['retries'] += 1 is called, then await request_conversation_processing(cid) is invoked. But inside request_conversation_processing, since pusher_connected is False, the function hits the early-return branch and does nothing — it does not reset sent_at. On the very next loop iteration, not pusher_connected causes a continue back to the top, so the timer is never reset for these entries.

The practical consequence: every ConnectionClosed event that coincides with a timed-out pending request burns a retry without actually retrying. With MAX_RETRIES_PER_REQUEST = 3, three connection drops can exhaust all retries in seconds and cause the conversation to be permanently given up.

Fix: guard the retry block so retries are only incremented when the connection is actually usable, or only increment info['retries'] inside request_conversation_processing after a successful send:

# Retry timed-out pending requests (handles silent WS death)
now = time.time()
if pusher_connected and pusher_ws:  # Only retry when we can actually send
    timed_out = [
        cid
        for cid, info in list(pending_conversation_requests.items())
        if now - info['sent_at'] > PENDING_REQUEST_TIMEOUT
    ]
    for cid in timed_out:
        info = pending_conversation_requests.get(cid)
        if not info:
            continue
        if info['retries'] >= MAX_RETRIES_PER_REQUEST:
            logger.error(
                f"Conversation {cid} failed after {MAX_RETRIES_PER_REQUEST} retries, giving up {uid} {session_id}"
            )
            pending_conversation_requests.pop(cid, None)
            continue
        info['retries'] += 1
        logger.warning(
            f"Retrying process_conversation for {cid} (attempt {info['retries']}/{MAX_RETRIES_PER_REQUEST}) {uid} {session_id}"
        )
        await request_conversation_processing(cid)

backend/tests/unit/test_pusher_conversation_retry.py, line 59-73 (link)

Test helper diverges from production behavior on disconnected retry

The _check_timed_out_requests helper resets info['sent_at'] = now directly (line 69), but in production (transcribe.py), the sent_at reset happens inside request_conversation_processing — only when the pusher WebSocket is connected and the dict is updated before the send. If pusher is disconnected during a timed-out retry (the bug described in the main comment), sent_at is never reset and the helper's behaviour diverges.

This means tests like test_timeout_retry_increments_count and test_timeout_multiple_requests pass while not covering the disconnect-during-retry edge case. Consider adding an integration-style test that simulates a ConnectionClosed mid-retry to catch the retry-count exhaustion issue.

_{Last reviewed commit: 9f31394}

greptile-apps · 2026-03-08T01:41:54Z

backend/database/users.py

@@ -88,8 +98,10 @@ def get_people_by_ids(uid: str, person_ids: list[str]):
    for i in range(0, len(person_ids), 30):
        chunk_ids = person_ids[i : i + 30]
        people_query = people_ref.where("id", 'in', chunk_ids)


Legacy documents silently excluded by get_people_by_ids

This query filters on the id field stored inside the document data. For legacy person documents that were created without an explicit id field in their data (only the Firestore document key), this query will return no results — setdefault('id', person.id) only helps after a document is already fetched, not during the query itself.

While create_person always writes id into the data, any legacy documents that predate this pattern will be silently dropped from get_people_by_ids results. This is a pre-existing issue, but since the PR is already fixing person-related 500 errors, it's worth noting that callers relying on get_people_by_ids (e.g. the shared conversation endpoint that fetches speaker names) may still return incomplete data for legacy users. Consider migrating legacy docs or querying by document key using Firestore's document().get() pattern instead.

Reviewer feedback: where("id", "in", ...) misses legacy docs that don't have a stored 'id' field. Use db.get_all() with doc refs instead. Co-Authored-By: Claude Opus 4.6 <[email protected]>

10 tests covering: - Person model resilience with missing created_at/updated_at (#5423) - Doc ID injection in get_people, get_person, get_people_by_ids (#5423) - Conversations list endpoint uses without-photos function (#5424) - get_conversations_without_photos supports folder_id/starred (#5424) Co-Authored-By: Claude Opus 4.6 <[email protected]>

Co-Authored-By: Claude Opus 4.6 <[email protected]>

- Large batch test for get_people_by_ids (>30 IDs, old where-in limit) - Empty list boundary test - Verify get_conversations_without_photos lacks @with_photos decorator - Verify get_conversations retains @with_photos for individual use Co-Authored-By: Claude Opus 4.6 <[email protected]>

beastoin · 2026-03-08T02:03:56Z

Local dev live test evidence (dev Firestore)

Test 1: Legacy person doc (no id, no created_at, no updated_at)

Created Firestore doc with only name and speech_samples fields
get_person(): injected doc ID → Person(**data) OK, created_at=None, updated_at=None
get_people(): found legacy person, Person model validates with None timestamps
✅ PASS

Test 2: Real user people data

User 25bN... has 3 people in dev Firestore
All 3 pass Person(**data) validation
Doc ID injection works (setdefault preserves existing IDs)
✅ PASS

Test 3: get_conversations_without_photos() real call

User 02ar... has conversations in dev Firestore
get_conversations_without_photos(uid, limit=2) returns 2 convos
photos count: 0 — no base64 photo content loaded
✅ PASS

Test 4: Unit tests (14/14 pass)

14 passed in 0.59s (test_people_conversations_500s.py)

by AI for @beastoin

beastoin · 2026-03-08T02:06:15Z

All Checkpoints Passed — Ready for Merge

Checkpoint	Status	Notes
CP0	done	Skills discovery + preflight
CP1	done	Issues #5423 + #5424 understood
CP2	done	Branch `fix/people-conversations-500s` from main
CP3	done	Exploration — 4 target files, no real-time audio path
CP4	done	CODEx consult — approach validated, risks identified
CP5	done	Implementation + tests (14 unit tests pass)
CP6	done	PR created with summary, evidence, risks
CP7	done	Reviewer approved (2 rounds — fixed `get_people_by_ids` to use `db.get_all`)
CP8	done	Tester approved (2 rounds — added boundary + decorator tests)
CP9	skipped	Not a real-time audio path change

Live validation: Tested against dev Firestore (based-hardware-dev) — legacy person docs without id/timestamps pass Person model validation, conversations list returns without photo base64 content.

Test evidence: All 14 unit tests pass (test.sh updated).

PR is ready for merge. Awaiting explicit merge approval.

by AI for @beastoin

beastoin · 2026-03-08T06:07:24Z

Review Feedback from Combined Verification (PR #5445)

Hey @yuki — 14/14 tests pass, live API verified, legacy docs handled correctly. Quick note from architecture review:

NOTE: `get_people_by_ids()` returns unordered results

File: database/users.py, line 100

Firestore db.get_all() returns results in arbitrary order, not matching the input person_ids list order. All current callers use unordered semantics so this is safe today, but recommend adding a docstring to prevent future bugs:

def get_people_by_ids(uid: str, person_ids: list[str]):
    """Fetch people by IDs using batch document fetch.
    
    Note: Results are returned in arbitrary order (Firestore db.get_all behavior).
    If order matters, caller should sort by person_ids after retrieval.
    """

Everything else is clean — decorator usage correct, data.setdefault('id', doc.id) handles both legacy and modern docs properly, batch fetch is more efficient than where("id","in",...) for this use case.

by AI for @beastoin

Per kelvin UAT review: callers must not assume result order matches input person_ids order. Co-Authored-By: Claude Opus 4.6 <[email protected]>

beastoin · 2026-03-08T06:11:19Z

Thanks for the thorough UAT @kelvin-yu-rh. Added the ordering docstring in b58e493 — good catch on the db.get_all() constraint.

14/14 tests still pass after the commit.

by AI for @beastoin

…#5442 #5443) (#5459)

beastoin · 2026-03-09T01:32:23Z

Post-deploy verification — PR #5441 (via combined #5459)

Deploy completed 01:17 UTC. Verified at T+14m (01:31 UTC):

GET /v1/users/people — all 200 OK, zero 500s

Multiple successful requests confirmed in Cloud Logging since 01:29 UTC
Zero ResponseValidationError occurrences

GET /v1/conversations — all 200 OK, zero 500s

Multiple successful requests confirmed in Cloud Logging since 01:29 UTC
Zero response size errors

Overall backend health — zero errors (severity >= ERROR) since deploy

Post-deploy monitoring scheduled: T+30m, T+1h, T+2h, T+4h, then every 4h for 24h.

by AI for @beastoin

beastoin · 2026-03-09T07:09:41Z

Physical Device E2E — PR #5441

Pixel 7a physical device test on combined branch verify/combined-5441-5442-5443:

Home screen Conversations section renders without 500 errors
People tab loads with search and person cards
No crashes or ANR during navigation

Evidence: home | people

Combined with 14/14 unit tests and live API verification — PASS.

beastoin and others added 4 commits March 8, 2026 02:38

Add folder_id and starred params to get_conversations_without_photos

8146ffc

Enables the list endpoint to use this lighter function that skips loading full base64 photo content per conversation. Fixes part of #5424 Co-Authored-By: Claude Opus 4.6 <[email protected]>

beastoin force-pushed the fix/people-conversations-500s branch from 936f25a to 9f31394 Compare March 8, 2026 01:38

greptile-apps bot reviewed Mar 8, 2026

View reviewed changes

beastoin and others added 4 commits March 8, 2026 02:51

Switch get_people_by_ids to document ID fetches for legacy doc support

f072709

Reviewer feedback: where("id", "in", ...) misses legacy docs that don't have a stored 'id' field. Use db.get_all() with doc refs instead. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Add test_people_conversations_500s to test.sh

e8bba4c

Co-Authored-By: Claude Opus 4.6 <[email protected]>

beastoin mentioned this pull request Mar 8, 2026

report: Combined verification + architecture analysis (#5441, #5442, #5443) #5445

Closed

Add docstring noting db.get_all() arbitrary order in get_people_by_ids

b58e493

Per kelvin UAT review: callers must not assume result order matches input person_ids order. Co-Authored-By: Claude Opus 4.6 <[email protected]>

beastoin mentioned this pull request Mar 8, 2026

Prerelease: 3 verified fixes — 500s, 401 retry, Firestore cache (#5441 #5442 #5443) #5459

Merged

beastoin merged commit 35c9f55 into main Mar 9, 2026
1 check passed

beastoin deleted the fix/people-conversations-500s branch March 9, 2026 01:04

beastoin added a commit that referenced this pull request Mar 9, 2026

Prerelease: 3 verified fixes — 500s, 401 retry, Firestore cache (#5441 …

8c7ca81

…#5442 #5443) (#5459)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix /v1/users/people and /v1/conversations 500 errors#5441

Fix /v1/users/people and /v1/conversations 500 errors#5441
beastoin merged 9 commits intomainfrom
fix/people-conversations-500s

beastoin commented Mar 8, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 8, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Mar 8, 2026

Uh oh!

beastoin commented Mar 8, 2026

Uh oh!

beastoin commented Mar 8, 2026

Uh oh!

beastoin commented Mar 8, 2026

Uh oh!

beastoin commented Mar 8, 2026

Uh oh!

Uh oh!

beastoin commented Mar 9, 2026

Uh oh!

beastoin commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beastoin commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Review cycle changes

Test evidence

Uh oh!

greptile-apps bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

beastoin commented Mar 8, 2026

Local dev live test evidence (dev Firestore)

Uh oh!

beastoin commented Mar 8, 2026

All Checkpoints Passed — Ready for Merge

Uh oh!

beastoin commented Mar 8, 2026

Review Feedback from Combined Verification (PR #5445)

NOTE: get_people_by_ids() returns unordered results

Uh oh!

beastoin commented Mar 8, 2026

Uh oh!

Uh oh!

beastoin commented Mar 9, 2026

Uh oh!

beastoin commented Mar 9, 2026

Physical Device E2E — PR #5441

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

beastoin commented Mar 8, 2026 •

edited

Loading

greptile-apps bot commented Mar 8, 2026 •

edited

Loading

NOTE: `get_people_by_ids()` returns unordered results