Skip to content

fix: PGVector.add_embeddings silently truncates on length mismatch#310

Open
Humphrey (HumphreySun98) wants to merge 1 commit into
langchain-ai:mainfrom
HumphreySun98:fix/v1-pgvector-add-embeddings-silent-truncation
Open

fix: PGVector.add_embeddings silently truncates on length mismatch#310
Humphrey (HumphreySun98) wants to merge 1 commit into
langchain-ai:mainfrom
HumphreySun98:fix/v1-pgvector-add-embeddings-silent-truncation

Conversation

@HumphreySun98

Copy link
Copy Markdown

Description

`PGVector.add_embeddings` (and `aadd_embeddings`) build the SQL insert payload via:

```python
data = [
{...}
for text, metadata, embedding, id in zip(texts, metadatas, embeddings, ids_)
]
return ids_
```

Plain `zip` truncates to the shortest argument, then the method returns the full `ids_` list regardless. So when an upstream embedder bug returns fewer embeddings than texts — e.g. langchain-ai/langchain-google#1704's `gemini-embedding-2` regression — the caller sees N IDs but only M rows actually land in Postgres, with no exception. This silent data loss is exactly the symptom reported in #300.

This PR validates the input lengths up-front via a new `_validate_lengths_match` helper that raises a clear `ValueError` before any DB session is opened, and adds `strict=True` to the trailing `zip` calls as a backstop. Both sync and async paths are covered.

Relevant issues

Fixes #300

Changes

  • `langchain_postgres/vectorstores.py`:
    • Add module-level `_validate_lengths_match(texts, embeddings, metadatas, ids)` raising `ValueError` with the actual lengths in the message.
    • Call it at the top of `add_embeddings` and `aadd_embeddings` (after the sync/async early-init blocks, before any DB session opens).
    • Add `strict=True` to the trailing `zip` in both methods so any future internal mismatch turns into an exception instead of a silent truncation.
  • `tests/unit_tests/v1/test_pgvector_length_validation.py`:
    • Five focused tests that bypass `PGVector.init` (no Postgres needed):
      • too few embeddings → `ValueError`
      • mismatched metadatas length → `ValueError`
      • mismatched ids length → `ValueError`
      • matching lengths → validation passes through (test asserts it reaches the DB-session line)
      • async `aadd_embeddings` analog

Testing

```
$ python -m pytest tests/unit_tests/v1/test_pgvector_length_validation.py -v
5 passed in 0.64s
```

Reverting only the `_validate_lengths_match` call and `strict=True` flag while keeping the new tests makes the mismatch tests fail back to silent truncation, confirming the regression coverage pins the buggy behavior. `ruff check` and `ruff format --check` pass.

Note

The same anti-pattern exists in `langchain_postgres/v2/async_vectorstore.py:297` (`for id, content, embedding, metadata in zip(ids, texts, embeddings, metadatas):`). v1 is the path explicitly called out in #300 and the path users hit through the deprecated `PGVector` import; happy to fix the v2 analog in a follow-up if you'd like it bundled differently.

Disclaimer: this PR was prepared with the assistance of an AI agent (Claude Code). All code and test changes were reviewed by the author before submission.

`PGVector.add_embeddings` and `aadd_embeddings` built the SQL insert
payload via `zip(texts, metadatas, embeddings, ids_)`, which silently
truncates to the shortest argument. They then returned the full `ids_`
list regardless of how many rows were actually written, so an upstream
embedder bug that returned fewer embeddings than texts saw N IDs
returned to the caller but only M rows persisted, with no exception
raised. As called out in langchain-ai#300 (and observed downstream of
langchain-google-genai's `gemini-embedding-2` regression in
langchain-ai/langchain-google#1704).

This change validates lengths up-front via `_validate_lengths_match` —
raising a clear `ValueError` before any DB session is opened — and
adds `strict=True` to the trailing `zip` calls as a backstop. Adds
unit-test coverage that bypasses `PGVector.__init__` (no Postgres
needed) for both sync and async paths.

Fixes langchain-ai#300
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: PGVector.add_documents/add_embeddings silently truncates documents via zip and falsely reports all IDs as inserted

1 participant