feat: knowledge records and topics implementation#7
feat: knowledge records and topics implementation#7mattinannt wants to merge 27 commits intomainfrom
Conversation
- Add Knowledge Records endpoints and schemas for contextual AI enrichment - Add hierarchical Topics (taxonomy) endpoints and schemas - Add documentation for AI enrichment architecture and design decisions
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dbcdf66277
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
openapi.yaml
Outdated
| parent_id: | ||
| type: string | ||
| format: uuid | ||
| nullable: true | ||
| description: Parent topic ID (null for top-level topics) |
There was a problem hiding this comment.
Model null parent_id with JSON Schema, not nullable
The spec declares openapi: 3.1.0, which uses JSON Schema 2020-12, but parent_id is marked with nullable: true. In OAS 3.1, nullable isn’t part of the schema vocabulary, so many validators and code generators ignore it and treat parent_id as a plain string. That will reject or mis-deserialize responses where top-level topics return parent_id: null (as shown in the examples). To avoid validation/client breakage, model this as type: [string, "null"] or anyOf for both TopicData.parent_id and CreateTopicInputBody.parent_id.
Useful? React with 👍 / 👎.
Implements the full backend for knowledge records and topics: - Database schema migration (sql/002_knowledge_and_topics.sql) - Models with validation (internal/models/) - Repository layer with CRUD operations (internal/repository/) - Service layer with business logic (internal/service/) - HTTP handlers with proper error handling (internal/api/handlers/) - ConflictError type for 409 responses (internal/errors/) - Route registration in main.go - Comprehensive integration tests Key features: - Topics: hierarchical with auto-calculated levels, cascade delete - Knowledge Records: bulk delete by tenant_id - Cross-tenant validation for parent topics - Title uniqueness within parent+tenant scope
This commit adds the foundation for AI-powered feedback enrichment: ## New Features - Automatic embedding generation for knowledge records, topics, and text feedback - pgvector integration for vector similarity search - OpenAI text-embedding-3-small model support (1536 dimensions) ## Changes ### Database Schema (sql/003_embeddings.sql) - Added embedding vector columns to knowledge_records, topics, and feedback_records - Added AI enrichment fields to feedback_records: topic_id, classification_confidence, sentiment, sentiment_score, emotion - Created HNSW indexes for fast vector similarity search - Added indexes for sentiment/emotion filtering ### New Package: internal/embeddings/ - client.go: Embedding client interface - openai.go: OpenAI implementation using text-embedding-3-small - mock.go: Deterministic mock client for testing ### Configuration - Added OPENAI_API_KEY to config (optional - AI enrichment disabled if not set) ### Services - KnowledgeRecordsService: Auto-generates embedding on create/update - TopicsService: Auto-generates embedding on create/update - FeedbackRecordsService: Auto-generates embedding for text feedback ### Repositories - Added UpdateEmbedding methods to all repositories - Added UpdateEnrichment for feedback records with full AI fields - Extended GetByID/List queries to include new AI enrichment fields ## Dependencies - github.com/sashabaranov/go-openai v1.36.1 - github.com/pgvector/pgvector-go v0.3.0 ## Usage Set OPENAI_API_KEY environment variable to enable AI enrichment. When enabled, embeddings are generated asynchronously after record creation.
Removes sentiment, sentiment_score, and emotion fields from feedback_records. These fields were added prematurely - they require separate LLM API calls (not just embeddings) which adds cost and complexity beyond the original requirements. Keeping only embedding-based enrichment: - embedding: vector for semantic search - topic_id: classification via vector similarity - classification_confidence: confidence score for topic match Changes: - sql/003_embeddings.sql: Removed sentiment/emotion columns and indexes - sql/004_remove_sentiment_fields.sql: Migration to drop existing columns - internal/models/feedback_records.go: Removed Sentiment, SentimentScore, Emotion - internal/repository/feedback_records_repository.go: Updated queries
Implements automatic topic classification using vector similarity search. When feedback is created, it's now automatically classified against existing topics based on embedding similarity. ## New Features ### Topic Classification - Feedback records are automatically matched to the most similar topic - Uses cosine similarity with configurable threshold (default: 0.5) - Classification happens asynchronously after embedding generation - Results stored in topic_id and classification_confidence fields ### Filter by Topic - Added topic_id filter to GET /v1/feedback-records endpoint - Allows querying all feedback classified under a specific topic ## Changes ### Models - Added TopicMatch struct to models/topics.go (shared type) - Added TopicID filter to ListFeedbackRecordsFilters ### Repository - Added FindSimilarTopic method to TopicsRepository - Uses pgvector cosine distance operator (<=>) - Added topic_id condition to feedback list query ### Service - Added TopicClassifier interface - Added NewFeedbackRecordsServiceWithClassification constructor - Updated enrichRecord to classify after embedding generation - Logs classification results at debug level ### Main - Reordered initialization (topics repo before feedback service) - Wired topics repo as classifier for feedback service ## Usage 1. Create topics with embeddings (auto-generated on create) 2. Create feedback records - they auto-classify to best matching topic 3. Query feedback by topic: GET /v1/feedback-records?topic_id=<uuid>
…upport - Add theme_id column to feedback_records for hierarchical taxonomy - Implement threshold-based classification (0.30 for themes, 0.40 for subtopics) - Update FindSimilarTopic to support level filtering - Add TopicMatch model for classification results - Update OpenAPI spec with theme_id field and filter - Add pgAdmin to docker-compose for database visualization - Add CSV ingestion script for testing with sample data - Include sample feedback data in testdata/
…ification only
- Add parent_id column back to topics table for explicit Level 1 → Level 2 hierarchy
- Update /topics/{id}/similar to /topics/{id}/children endpoint
- Level 2 topics now require parent_id when created
- Embeddings are used only for feedback → topic classification
- Update OpenAPI spec with parent_id field and children endpoint
- Add embedding-classification documentation
- Update ingestion script to create topics with parent_id
- Simplify classification to only classify to Level 2 topics
This new ClassificationWorker to periodically retry classification of feedback records with embeddings but no topic classification. - add configuration option for classification retry interval and batch size. - Update feedback records model with UnclassifiedRecord type for handling records needing re-classification. - Implemented repository methods to list unclassified records and update their classification. - Modified feedback records service to support retrying classification in batches. - Added a new CLI tool for ingesting feedback from CSV files into the system.
f7ca4be to
dec18a0
Compare
…T-4o labeling
This commit introduces a complete taxonomy generation pipeline for automatically
categorizing feedback records into hierarchical topics.
## Python Microservice (services/taxonomy-generator/)
- FastAPI service for ML-intensive clustering operations
- UMAP dimensionality reduction (1536 → 10 dimensions)
- HDBSCAN clustering for automatic cluster discovery
- GPT-4o labeling to generate human-readable topic names
- Supports Level 1 (broad categories) and Level 2 (sub-topics)
- Level 2 topics generated only for dense clusters (500+ items)
## Go API Integration
- TaxonomyClient: HTTP client to communicate with Python service
- TaxonomyHandler: REST endpoints for taxonomy operations
- POST /v1/taxonomy/{tenant_id}/generate (async)
- POST /v1/taxonomy/{tenant_id}/generate/sync (blocking)
- GET /v1/taxonomy/{tenant_id}/status
- Schedule management endpoints
## Periodic Re-clustering
- clustering_jobs table for scheduling taxonomy regeneration
- TaxonomyScheduler worker polls for due jobs
- Supports daily, weekly, monthly intervals per tenant
## Infrastructure
- Dockerfile for Go API (multi-stage build)
- Docker Compose orchestration for all services
- Environment configuration for taxonomy service URL
…nt scheduling - Resolved conflicts with origin/feat/taxonomies - Kept taxonomy scheduler (required for per-tenant periodic clustering) - Removed classification retry worker (was removed in remote) - Added TaxonomyServiceURL, TaxonomySchedulerEnabled, TaxonomyPollInterval config - Scheduler disabled by default (TAXONOMY_SCHEDULER_ENABLED=false)
- Added ListByTopicWithDescendants repository method for direct topic_id lookup - Modified service to use direct lookup by default instead of similarity search - Added UseSimilarity filter option to explicitly request vector similarity search - Direct lookup uses pre-computed topic assignments from taxonomy generation - Includes descendant topics (Level 1 shows all Level 2 feedback) Benefits: - Much faster queries (simple WHERE clause vs vector similarity) - Accurate cluster-based results matching taxonomy generation - Falls back to similarity search with use_similarity=true query param
…omies Resolved conflicts: - .env.example: Combined River job queue and Taxonomy service settings - internal/config/config.go: Combined both configurations with all helper functions - api: Removed binary from tracking, added to .gitignore Fixed linting issues in incoming code: - taxonomy_client.go: Check resp.Body.Close() error returns - taxonomy_scheduler.go: Check UpdateAfterRun() error returns - feedback_records_repository.go: Remove duplicate query assignment
- Changed level2_min_cluster_size from 500 to 50 to allow for more flexible clustering.
This adds support for configurable taxonomy hierarchy depth, allowing
users to generate taxonomies with 1 to 4+ levels without code changes.
## Changes
### Python Taxonomy Service
- Refactored clustering to use recursive algorithm supporting N levels
- Added `max_levels` config parameter (default: 4)
- Added per-level cluster size configurations
- Updated GPT-4o prompts with level-aware context
### Configuration Options
- `max_levels`: Maximum taxonomy depth (1-10, default: 4)
- `level_min_cluster_sizes`: Min items needed to create children per level
- `level_hdbscan_min_cluster_sizes`: HDBSCAN cluster size per level
### CSV Ingestion Script
- Added semicolon delimiter support for normalized CSV format
- Fixed column mapping for hub-combined-test-data format
## Usage
Change depth via API request (no code changes needed):
```bash
# 2 levels
curl -X POST "http://localhost:8080/v1/taxonomy/TENANT/generate" \
-H "Authorization: Bearer API_KEY" \
-d '{"max_levels": 2}'
# 4 levels (default)
curl -X POST "http://localhost:8080/v1/taxonomy/TENANT/generate" \
-H "Authorization: Bearer API_KEY" \
-d '{"max_levels": 4}'
```
## Sample 4-Level Hierarchy
```
Account Testing
└─ Email Errors
└─ Email Change Issues
└─ Workspace Creation Errors
```
After embedding generation completes for feedback records, automatically assign the most similar topic based on vector similarity. This provides immediate topic classification without waiting for batch clustering. - Add TenantID to EmbeddingJobArgs for tenant-isolated topic lookup - Add FindMostSpecificTopic() to find highest-level topic above threshold - Add AssignTopic() with idempotent behavior (preserves manual overrides) - Extend EmbeddingWorker with topic assignment after embedding success - Wire TopicMatcher and FeedbackAssigner dependencies in main.go Failures during topic assignment are logged but don't fail the embedding job, ensuring graceful degradation when no topics exist yet. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add FeedbackCount field to Topic model for API response - Reduce HDBSCAN min_cluster_size thresholds for smaller datasets - Lower level_min_cluster_sizes for more granular subdivision - Change default max_levels from 4 to 3 for <10k datasets - Suppress expected UMAP n_jobs warning when random_state is set - Remove unused strPtr function from ingest script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Summary
This PR implements the Knowledge Records and Topics (taxonomy) feature for AI Enrichment in Formbricks Hub.
Changes
API Specification:
/v1/knowledge-records) for contextual AI enrichment data/v1/topics) for hierarchical feedback classificationopenapi.yamlwith full CRUD support, schema definitions, and comprehensive examplesGo Implementation:
sql/002_knowledge_and_topics.sql)internal/models/)internal/repository/)internal/service/)internal/api/handlers/)internal/errors/)main.goDocumentation:
docs/enrichment.mddetailing the architecture, design decisions, and roadmapKey Features
Test Plan
make lint(0 issues)go build ./...make tests)To run tests locally: