feat: knowledge records and topics implementation by mattinannt · Pull Request #7 · formbricks/hub

mattinannt · 2026-01-26T11:41:00Z

Summary

This PR implements the Knowledge Records and Topics (taxonomy) feature for AI Enrichment in Formbricks Hub.

Changes

API Specification:

Added Knowledge Records endpoints (/v1/knowledge-records) for contextual AI enrichment data
Added Topics endpoints (/v1/topics) for hierarchical feedback classification
Updated openapi.yaml with full CRUD support, schema definitions, and comprehensive examples

Go Implementation:

Database schema migration (sql/002_knowledge_and_topics.sql)
Models with validation tags (internal/models/)
Repository layer with CRUD operations (internal/repository/)
Service layer with business logic (internal/service/)
HTTP handlers with proper error handling (internal/api/handlers/)
ConflictError type for 409 responses (internal/errors/)
Route registration in main.go
Comprehensive integration tests

Documentation:

Added docs/enrichment.md detailing the architecture, design decisions, and roadmap

Key Features

Topics: Hierarchical structure with auto-calculated levels, cascade delete on parent removal
Knowledge Records: Contextual data for AI enrichment, bulk delete by tenant_id
Cross-tenant validation: Parent topics must belong to the same tenant
Title uniqueness: Enforced within parent+tenant scope for topics
Enterprise-grade: Full validation, proper error handling, RFC 7807 problem details

Test Plan

OpenAPI spec validated with YAML parser
Go code passes make lint (0 issues)
Go code compiles with go build ./...
Integration tests pass with running database (make tests)
Manual API testing via Swagger UI or curl

To run tests locally:

make dev-setup    # Start Postgres via Docker
make init-db      # Apply schema migrations (including new 002_knowledge_and_topics.sql)
make tests        # Run integration tests

- Add Knowledge Records endpoints and schemas for contextual AI enrichment - Add hierarchical Topics (taxonomy) endpoints and schemas - Add documentation for AI enrichment architecture and design decisions

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbcdf66277

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-26T11:45:48Z

openapi.yaml

+                parent_id:
+                    type: string
+                    format: uuid
+                    nullable: true
+                    description: Parent topic ID (null for top-level topics)


Model null parent_id with JSON Schema, not nullable

The spec declares openapi: 3.1.0, which uses JSON Schema 2020-12, but parent_id is marked with nullable: true. In OAS 3.1, nullable isn’t part of the schema vocabulary, so many validators and code generators ignore it and treat parent_id as a plain string. That will reject or mis-deserialize responses where top-level topics return parent_id: null (as shown in the examples). To avoid validation/client breakage, model this as type: [string, "null"] or anyOf for both TopicData.parent_id and CreateTopicInputBody.parent_id.

Useful? React with 👍 / 👎.

Implements the full backend for knowledge records and topics: - Database schema migration (sql/002_knowledge_and_topics.sql) - Models with validation (internal/models/) - Repository layer with CRUD operations (internal/repository/) - Service layer with business logic (internal/service/) - HTTP handlers with proper error handling (internal/api/handlers/) - ConflictError type for 409 responses (internal/errors/) - Route registration in main.go - Comprehensive integration tests Key features: - Topics: hierarchical with auto-calculated levels, cascade delete - Knowledge Records: bulk delete by tenant_id - Cross-tenant validation for parent topics - Title uniqueness within parent+tenant scope

This commit adds the foundation for AI-powered feedback enrichment: ## New Features - Automatic embedding generation for knowledge records, topics, and text feedback - pgvector integration for vector similarity search - OpenAI text-embedding-3-small model support (1536 dimensions) ## Changes ### Database Schema (sql/003_embeddings.sql) - Added embedding vector columns to knowledge_records, topics, and feedback_records - Added AI enrichment fields to feedback_records: topic_id, classification_confidence, sentiment, sentiment_score, emotion - Created HNSW indexes for fast vector similarity search - Added indexes for sentiment/emotion filtering ### New Package: internal/embeddings/ - client.go: Embedding client interface - openai.go: OpenAI implementation using text-embedding-3-small - mock.go: Deterministic mock client for testing ### Configuration - Added OPENAI_API_KEY to config (optional - AI enrichment disabled if not set) ### Services - KnowledgeRecordsService: Auto-generates embedding on create/update - TopicsService: Auto-generates embedding on create/update - FeedbackRecordsService: Auto-generates embedding for text feedback ### Repositories - Added UpdateEmbedding methods to all repositories - Added UpdateEnrichment for feedback records with full AI fields - Extended GetByID/List queries to include new AI enrichment fields ## Dependencies - github.com/sashabaranov/go-openai v1.36.1 - github.com/pgvector/pgvector-go v0.3.0 ## Usage Set OPENAI_API_KEY environment variable to enable AI enrichment. When enabled, embeddings are generated asynchronously after record creation.

Removes sentiment, sentiment_score, and emotion fields from feedback_records. These fields were added prematurely - they require separate LLM API calls (not just embeddings) which adds cost and complexity beyond the original requirements. Keeping only embedding-based enrichment: - embedding: vector for semantic search - topic_id: classification via vector similarity - classification_confidence: confidence score for topic match Changes: - sql/003_embeddings.sql: Removed sentiment/emotion columns and indexes - sql/004_remove_sentiment_fields.sql: Migration to drop existing columns - internal/models/feedback_records.go: Removed Sentiment, SentimentScore, Emotion - internal/repository/feedback_records_repository.go: Updated queries

Implements automatic topic classification using vector similarity search. When feedback is created, it's now automatically classified against existing topics based on embedding similarity. ## New Features ### Topic Classification - Feedback records are automatically matched to the most similar topic - Uses cosine similarity with configurable threshold (default: 0.5) - Classification happens asynchronously after embedding generation - Results stored in topic_id and classification_confidence fields ### Filter by Topic - Added topic_id filter to GET /v1/feedback-records endpoint - Allows querying all feedback classified under a specific topic ## Changes ### Models - Added TopicMatch struct to models/topics.go (shared type) - Added TopicID filter to ListFeedbackRecordsFilters ### Repository - Added FindSimilarTopic method to TopicsRepository - Uses pgvector cosine distance operator (<=>) - Added topic_id condition to feedback list query ### Service - Added TopicClassifier interface - Added NewFeedbackRecordsServiceWithClassification constructor - Updated enrichRecord to classify after embedding generation - Logs classification results at debug level ### Main - Reordered initialization (topics repo before feedback service) - Wired topics repo as classifier for feedback service ## Usage 1. Create topics with embeddings (auto-generated on create) 2. Create feedback records - they auto-classify to best matching topic 3. Query feedback by topic: GET /v1/feedback-records?topic_id=<uuid>

…upport - Add theme_id column to feedback_records for hierarchical taxonomy - Implement threshold-based classification (0.30 for themes, 0.40 for subtopics) - Update FindSimilarTopic to support level filtering - Add TopicMatch model for classification results - Update OpenAPI spec with theme_id field and filter - Add pgAdmin to docker-compose for database visualization - Add CSV ingestion script for testing with sample data - Include sample feedback data in testdata/

…ification only - Add parent_id column back to topics table for explicit Level 1 → Level 2 hierarchy - Update /topics/{id}/similar to /topics/{id}/children endpoint - Level 2 topics now require parent_id when created - Embeddings are used only for feedback → topic classification - Update OpenAPI spec with parent_id field and children endpoint - Add embedding-classification documentation - Update ingestion script to create topics with parent_id - Simplify classification to only classify to Level 2 topics

This new ClassificationWorker to periodically retry classification of feedback records with embeddings but no topic classification. - add configuration option for classification retry interval and batch size. - Update feedback records model with UnclassifiedRecord type for handling records needing re-classification. - Implemented repository methods to list unclassified records and update their classification. - Modified feedback records service to support retrying classification in batches. - Added a new CLI tool for ingesting feedback from CSV files into the system.

…T-4o labeling This commit introduces a complete taxonomy generation pipeline for automatically categorizing feedback records into hierarchical topics. ## Python Microservice (services/taxonomy-generator/) - FastAPI service for ML-intensive clustering operations - UMAP dimensionality reduction (1536 → 10 dimensions) - HDBSCAN clustering for automatic cluster discovery - GPT-4o labeling to generate human-readable topic names - Supports Level 1 (broad categories) and Level 2 (sub-topics) - Level 2 topics generated only for dense clusters (500+ items) ## Go API Integration - TaxonomyClient: HTTP client to communicate with Python service - TaxonomyHandler: REST endpoints for taxonomy operations - POST /v1/taxonomy/{tenant_id}/generate (async) - POST /v1/taxonomy/{tenant_id}/generate/sync (blocking) - GET /v1/taxonomy/{tenant_id}/status - Schedule management endpoints ## Periodic Re-clustering - clustering_jobs table for scheduling taxonomy regeneration - TaxonomyScheduler worker polls for due jobs - Supports daily, weekly, monthly intervals per tenant ## Infrastructure - Dockerfile for Go API (multi-stage build) - Docker Compose orchestration for all services - Environment configuration for taxonomy service URL

…nt scheduling - Resolved conflicts with origin/feat/taxonomies - Kept taxonomy scheduler (required for per-tenant periodic clustering) - Removed classification retry worker (was removed in remote) - Added TaxonomyServiceURL, TaxonomySchedulerEnabled, TaxonomyPollInterval config - Scheduler disabled by default (TAXONOMY_SCHEDULER_ENABLED=false)

- Added ListByTopicWithDescendants repository method for direct topic_id lookup - Modified service to use direct lookup by default instead of similarity search - Added UseSimilarity filter option to explicitly request vector similarity search - Direct lookup uses pre-computed topic assignments from taxonomy generation - Includes descendant topics (Level 1 shows all Level 2 feedback) Benefits: - Much faster queries (simple WHERE clause vs vector similarity) - Accurate cluster-based results matching taxonomy generation - Falls back to similarity search with use_similarity=true query param

…omies Resolved conflicts: - .env.example: Combined River job queue and Taxonomy service settings - internal/config/config.go: Combined both configurations with all helper functions - api: Removed binary from tracking, added to .gitignore Fixed linting issues in incoming code: - taxonomy_client.go: Check resp.Body.Close() error returns - taxonomy_scheduler.go: Check UpdateAfterRun() error returns - feedback_records_repository.go: Remove duplicate query assignment

- Changed level2_min_cluster_size from 500 to 50 to allow for more flexible clustering.

This adds support for configurable taxonomy hierarchy depth, allowing users to generate taxonomies with 1 to 4+ levels without code changes. ## Changes ### Python Taxonomy Service - Refactored clustering to use recursive algorithm supporting N levels - Added `max_levels` config parameter (default: 4) - Added per-level cluster size configurations - Updated GPT-4o prompts with level-aware context ### Configuration Options - `max_levels`: Maximum taxonomy depth (1-10, default: 4) - `level_min_cluster_sizes`: Min items needed to create children per level - `level_hdbscan_min_cluster_sizes`: HDBSCAN cluster size per level ### CSV Ingestion Script - Added semicolon delimiter support for normalized CSV format - Fixed column mapping for hub-combined-test-data format ## Usage Change depth via API request (no code changes needed): ```bash # 2 levels curl -X POST "http://localhost:8080/v1/taxonomy/TENANT/generate" \ -H "Authorization: Bearer API_KEY" \ -d '{"max_levels": 2}' # 4 levels (default) curl -X POST "http://localhost:8080/v1/taxonomy/TENANT/generate" \ -H "Authorization: Bearer API_KEY" \ -d '{"max_levels": 4}' ``` ## Sample 4-Level Hierarchy ``` Account Testing └─ Email Errors └─ Email Change Issues └─ Workspace Creation Errors ```

After embedding generation completes for feedback records, automatically assign the most similar topic based on vector similarity. This provides immediate topic classification without waiting for batch clustering. - Add TenantID to EmbeddingJobArgs for tenant-isolated topic lookup - Add FindMostSpecificTopic() to find highest-level topic above threshold - Add AssignTopic() with idempotent behavior (preserves manual overrides) - Extend EmbeddingWorker with topic assignment after embedding success - Wire TopicMatcher and FeedbackAssigner dependencies in main.go Failures during topic assignment are logged but don't fail the embedding job, ensuring graceful degradation when no topics exist yet. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add FeedbackCount field to Topic model for API response - Reduce HDBSCAN min_cluster_size thresholds for smaller datasets - Lower level_min_cluster_sizes for more granular subdivision - Change default max_levels from 4 to 3 for <10k datasets - Suppress expected UMAP n_jobs warning when random_state is set - Remove unused strPtr function from ingest script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat(api): add knowledge records and topics to openapi spec

dbcdf66

- Add Knowledge Records endpoints and schemas for contextual AI enrichment - Add hierarchical Topics (taxonomy) endpoints and schemas - Add documentation for AI enrichment architecture and design decisions

chatgpt-codex-connector bot reviewed Jan 26, 2026

View reviewed changes

mattinannt and others added 12 commits January 26, 2026 16:16

docs: update enrichment.md roadmap to reflect completed implementation

dfb5009

fix: update init-db to apply all schema files

34109ca

chore: add cors support

ec50968

chore: update openapi yml

60d6156

chore: dynamic feedback-records retrieval by topic id

f7ca4be

BhagyaAmarasinghe force-pushed the feat/taxonomies branch from f7ca4be to dec18a0 Compare January 28, 2026 07:10

mattinannt and others added 14 commits January 28, 2026 12:09

chore: improve topic and level logic

c765902

merge: resolve conflicts and remove periodic clustering logic

45490b6

chore: fix tests

075cfcd

feat: add river background job queue

1fcfa81

fix: adjust minimum cluster size for Level 2 subdivision

20da745

- Changed level2_min_cluster_size from 500 to 50 to allow for more flexible clustering.

chore: update agents.md file

e99129e

chore: update dev environment changes

5c79c87

mattinannt marked this pull request as draft January 30, 2026 10:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: knowledge records and topics implementation#7

feat: knowledge records and topics implementation#7
mattinannt wants to merge 27 commits intomainfrom
feat/taxonomies

mattinannt commented Jan 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattinannt commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Key Features

Test Plan

To run tests locally:

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattinannt commented Jan 26, 2026 •

edited

Loading