Skip to content

Optimize ingest indexing#129

Closed
howethomas wants to merge 8 commits intomainfrom
optimize-ingest-indexing
Closed

Optimize ingest indexing#129
howethomas wants to merge 8 commits intomainfrom
optimize-ingest-indexing

Conversation

@howethomas
Copy link
Contributor

@howethomas howethomas commented Mar 2, 2026

Note

Medium Risk
Moderate risk: adds a new unauthenticated monitoring endpoint and changes ingest-time indexing behavior; also introduces new optional link modules and a sizeable observability docker-compose stack that could affect local deployments if enabled.

Overview
Adds an optional SigNoz observability stack via docker-compose.signoz.yml plus ClickHouse/collector configs and docs under signoz/.

Updates the server to expose a new public GET /stats/queue endpoint for Redis list depth, and optimizes ingest-time indexing by indexing parties directly (index_vcon_parties) rather than re-reading the vCon from Redis.

Introduces two new link modules: links/wtf_transcribe (sends audio dialogs to an external vfun service and stores results as wtf_transcription analysis) and links/keyword_tagger (tags vCons based on keyword matches in transcription/WTF analysis). The webhook link now normalizes the vcon version to 0.3.0 for downstream compatibility.

Also tweaks the Docker image build to force Debian APT sources to HTTPS.

Written by Cursor Bugbot for commit bd0715f. Configure here.

howethomas and others added 8 commits January 27, 2026 22:24
Updates webhook link to set vcon version to 0.3.0 for
compatibility with vcon-mcp REST API.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Configure apt to use HTTPS sources for environments
where HTTP port 80 is blocked.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Includes docker-compose and config files for SigNoz
observability stack with OpenTelemetry collector.

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Public endpoint (no auth) that returns the depth of any Redis list,
used by the audio adapter for backpressure control.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
The post_vcon and external_ingress_vcon paths called index_vcon() which
re-read the vCon from Redis (JSON.GET) and duplicated the sorted set add
(ZADD) that was already done by the caller. This added 2 unnecessary
Redis round-trips per ingest.

Extract index_vcon_parties() that takes the vCon dict directly, and use
it in both POST paths. The original index_vcon() is preserved for the
bulk re-indexing endpoint. Reduces ingest from 11 to 9 Redis ops per
vCon, measured 4.9x improvement in adapter posting throughput.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Free Tier Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

return JSONResponse(content={"list_name": list_name, "depth": depth})
except Exception as e:
logger.error(f"Error getting queue depth for '{list_name}': {str(e)}")
raise HTTPException(status_code=500, detail="Failed to get queue depth")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unauthenticated endpoint allows arbitrary Redis key querying

Medium Severity

The /stats/queue endpoint is mounted directly on app (no auth, like /health and /version) but accepts an arbitrary list_name parameter that's passed directly to redis_async.llen(). Unlike the health endpoint, this allows unauthenticated callers to probe any Redis key — discovering queue names, measuring queue depths, and distinguishing key types (list keys return a count while non-list keys trigger a WRONGTYPE error, returned as a 500). Consider restricting the list_name to a known allowlist of queue names or placing this endpoint behind api_router authentication.

Fix in Cursor Fix in Web


# Other Content
"other": {
"profanity": ["fuck", "shit", "damn", "ass"],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substring matching causes false positives for short keywords

Medium Severity

The find_keywords function uses Python's in operator for substring matching. The profanity keyword "ass" will match extremely common words like "class", "assist", "pass", "mass", "assessment", "classic", etc. This will produce a high volume of false positive profanity tags on virtually any transcription of meaningful length. While some partial keywords (like "enforc") are intentionally designed for substring matching, "ass" is too short and appears in too many benign words.

Additional Locations (1)

Fix in Cursor Fix in Web

### Start with SigNoz

```bash
cd /home/thomas/bds/vcon-dev/vcon-server
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded personal filesystem path in documentation

Low Severity

The README contains a hardcoded personal development path /home/thomas/bds/vcon-dev/vcon-server in its usage examples. This is a developer-specific local filesystem path that won't work for other contributors and leaks information about a developer's local environment.

Additional Locations (1)

Fix in Cursor Fix in Web

"categories": None, # None means all categories
"custom_keywords": {}, # Additional tag -> keywords mappings
"case_sensitive": False,
"min_confidence": 0.0, # Minimum transcription confidence to process
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declared min_confidence option is never used

Low Severity

The min_confidence option is declared in default_options with a comment suggesting it controls minimum transcription confidence to process, but it's never read or checked anywhere in the run function. Users who configure this value will see no effect—low-confidence transcriptions will still be processed and tagged.

Fix in Cursor Fix in Web

@pavanputhra pavanputhra closed this Mar 4, 2026
@pavanputhra
Copy link
Contributor

We are going to remove index party feature in future. Not needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants