Skip to content

Pipeline resilience#128

Closed
howethomas wants to merge 12 commits intomainfrom
pipeline-resilience
Closed

Pipeline resilience#128
howethomas wants to merge 12 commits intomainfrom
pipeline-resilience

Conversation

@howethomas
Copy link
Contributor

@howethomas howethomas commented Mar 2, 2026

Note

Medium Risk
Medium risk because it changes the vCon ingest indexing path and adds a new unauthenticated /stats/queue endpoint that exposes Redis list names/depths, which can affect search/monitoring behavior and surface operational data.

Overview
Adds a new transcription pipeline path centered on vfun integration: a wtf_transcribe link that transcribes recording dialogs, supports multiple vfun URLs with health-aware failover, and caches transcription results in Redis; plus a keyword_tagger link that tags vCons based on keyword matches in transcriptions.

Improves operational resilience/interop by forcing outbound webhook payloads (links.webhook and new storage.webhook) to emit vcon: "0.3.0", adding a public /stats/queue endpoint for Redis queue depth monitoring, and optimizing ingest-time indexing by indexing parties directly (index_vcon_parties) to avoid redundant Redis reads.

Adds optional SigNoz/OpenTelemetry docker-compose stack and configs, updates the Dockerfile to use HTTPS apt sources, and introduces NAS/vfun performance testing docs and helper scripts for stress testing and auto-restart runs.

Written by Cursor Bugbot for commit 631ca58. Configure here.

howethomas and others added 12 commits January 27, 2026 22:24
Updates webhook link to set vcon version to 0.3.0 for
compatibility with vcon-mcp REST API.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Configure apt to use HTTPS sources for environments
where HTTP port 80 is blocked.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Includes docker-compose and config files for SigNoz
observability stack with OpenTelemetry collector.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Public endpoint (no auth) that returns the depth of any Redis list,
used by the audio adapter for backpressure control.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The post_vcon and external_ingress_vcon paths called index_vcon() which
re-read the vCon from Redis (JSON.GET) and duplicated the sorted set add
(ZADD) that was already done by the caller. This added 2 unnecessary
Redis round-trips per ingest.

Extract index_vcon_parties() that takes the vCon dict directly, and use
it in both POST paths. The original index_vcon() is preserved for the
bulk re-indexing endpoint. Reduces ingest from 11 to 9 Redis ops per
vCon, measured 4.9x improvement in adapter posting throughput.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The supabase_webhook was running as a sequential chain link, blocking
each worker for ~560ms per vCon. By moving it to a storage slot, the
webhook now executes post-chain in parallel via ThreadPoolExecutor,
reducing per-vCon P50 latency from 617ms to 123ms (5x improvement).

New module server/storage/webhook/ wraps the existing HTTP POST logic
with the storage save() interface.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The wtf_transcribe link had no retry logic — a single vfun failure
silently dropped the transcription. This adds:

- _VfunHealthTracker: thread-safe singleton tracking instance health
  across all workers, with 30-second self-healing recovery window
- get_vfun_urls(): returns URLs in priority order (healthy shuffled,
  then recovering oldest-first, then down instances)
- Fallback loop: tries all configured vfun instances before giving up
- Redis transcription cache: skips vfun calls for previously transcribed
  audio files (7-day TTL)

On failure, instances are marked DOWN and bypassed until the recovery
window expires, then automatically retried and restored on success.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
- SignOz OTEL collector config and docker-compose integration
- Performance testing and vfun crash/stress test reports
- Utility scripts for NAS pipeline operations and debugging

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Free Tier Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

- DASHBOARDS_PATH=/root/config/dashboards
- STORAGE=clickhouse
- GODEBUG=netdns=go
- TELEMETRY_ENABLED=true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TELEMETRY_ENABLED=true contradicts README fix for SigNoz crash

Medium Severity

TELEMETRY_ENABLED is set to true in the compose file, but the README explicitly documents that this causes the SigNoz query service to panic (nil pointer crash) because the telemetry cron checks TTL for signoz_logs.logs, a table that doesn't exist after schema migration. The README even states the fix (TELEMETRY_ENABLED=false) is "already set in this repo," but it isn't.

Additional Locations (1)

Fix in Cursor Fix in Web

return JSONResponse(content={"list_name": list_name, "depth": depth})
except Exception as e:
logger.error(f"Error getting queue depth for '{list_name}': {str(e)}")
raise HTTPException(status_code=500, detail="Failed to get queue depth")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unauthenticated endpoint allows arbitrary Redis key probing

Medium Severity

The /stats/queue endpoint is public (no auth) and accepts any user-supplied list_name directly passed to redis_async.llen(). While the intention is monitoring known queue names, nothing restricts the parameter to valid queue names. An unauthenticated caller can probe any Redis key to learn whether it exists and its list length, leaking internal infrastructure details.

Fix in Cursor Fix in Web

--server 1 \
--workers "$WORKERS" \
--limit "$BATCH_SIZE" \
--store-vcons 2>&1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restart script reprocesses same files without offset tracking

Medium Severity

Each batch iteration calls the pipeline with the same --date, --server, and --limit but no offset or checkpoint. Since find_audio_files always returns the first N files, every batch reprocesses the same files. With --store-vcons, this creates duplicate vCons (each gets a new uuid.uuid4()). The loop runs until the safety limit of 100 batches instead of progressing through all files.

Fix in Cursor Fix in Web

resp = requests.post(url, json=json_dict, headers=headers)
logger.info(
f"webhook storage response for {vcon_uuid}: {resp.status_code} {resp.text}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage webhook duplicates links webhook implementation

Low Severity

The save function in server/storage/webhook/__init__.py is nearly identical to the run function in server/links/webhook/__init__.py. Both fetch the vCon from Redis, convert to dict, patch the version from 0.0.1 to 0.3.0, and POST to each configured webhook URL. This duplicated logic means a bug fix in one location could easily be missed in the other.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Details are very specific to one env. We should remove it from open source repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Details are very specific to one env. We should remove it from open source repo.

@pavanputhra
Copy link
Contributor

pavanputhra commented Mar 4, 2026

We don't need this PR. Its client specific implementation

@pavanputhra pavanputhra closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants