Problem
Users often search with abbreviations while code uses full names:
- Search:
auth → Code: authenticate, authentication
- Search:
db → Code: database, DatabaseConnection
- Search:
config → Code: configuration, settings
Current hybrid search (queries.py:hybrid_search) passes queries directly to BM25 and the embedding model without expansion, causing missed results.
Proposed Solution
Add a query expansion layer that handles developer abbreviations and synonyms dynamically, without hard-coded mappings.
Approaches to Consider
Option A: Learn from Codebase (Recommended)
Extract abbreviation patterns from symbol names in the indexed codebase:
AuthService → learns auth ↔ authentication
DBConnection → learns db ↔ database
ConfigManager → learns config ↔ configuration
Build a dynamic abbreviation map during indexing by detecting common prefixes and compound words.
Option B: Embedding-Based Expansion
Use the existing embedding model to find semantically similar terms:
- Embed the query
- Find similar words/phrases from the indexed symbol names
- Expand query with high-similarity matches
Leverages the existing jina-code-embeddings-0.5b model already in use.
Option C: NLP Stemming/Lemmatization
Use a lightweight NLP library (e.g., nltk, spaCy, or porter2stemmer) to handle:
- Plural/singular:
users ↔ user
- Verb forms:
authenticate ↔ authenticating ↔ authenticated
- Common English morphological variants
Implementation Location
queries.py:99-194 (hybrid_search function) - add expansion step before BM25/vector search.
Trade-offs
| Approach |
Pro |
Con |
| Learn from codebase |
Domain-specific, no maintenance |
Requires re-scan on significant changes |
| Embedding-based |
Reuses existing model |
May add latency, less precise |
| Stemming only |
Fast, deterministic |
Misses domain-specific abbreviations |
Acceptance Criteria
Problem
Users often search with abbreviations while code uses full names:
auth→ Code:authenticate,authenticationdb→ Code:database,DatabaseConnectionconfig→ Code:configuration,settingsCurrent hybrid search (
queries.py:hybrid_search) passes queries directly to BM25 and the embedding model without expansion, causing missed results.Proposed Solution
Add a query expansion layer that handles developer abbreviations and synonyms dynamically, without hard-coded mappings.
Approaches to Consider
Option A: Learn from Codebase (Recommended)
Extract abbreviation patterns from symbol names in the indexed codebase:
AuthService→ learnsauth↔authenticationDBConnection→ learnsdb↔databaseConfigManager→ learnsconfig↔configurationBuild a dynamic abbreviation map during indexing by detecting common prefixes and compound words.
Option B: Embedding-Based Expansion
Use the existing embedding model to find semantically similar terms:
Leverages the existing
jina-code-embeddings-0.5bmodel already in use.Option C: NLP Stemming/Lemmatization
Use a lightweight NLP library (e.g.,
nltk,spaCy, orporter2stemmer) to handle:users↔userauthenticate↔authenticating↔authenticatedImplementation Location
queries.py:99-194(hybrid_searchfunction) - add expansion step before BM25/vector search.Trade-offs
Acceptance Criteria