fix: `PGVector.add_embeddings` silently truncates on length mismatch by HumphreySun98 · Pull Request #310 · langchain-ai/langchain-postgres

Humphrey (HumphreySun98) · 2026-06-04T20:59:29Z

Description

`PGVector.add_embeddings` (and `aadd_embeddings`) build the SQL insert payload via:

```python
data = [
{...}
for text, metadata, embedding, id in zip(texts, metadatas, embeddings, ids_)
]
return ids_
```

Plain `zip` truncates to the shortest argument, then the method returns the full `ids_` list regardless. So when an upstream embedder bug returns fewer embeddings than texts — e.g. langchain-ai/langchain-google#1704's `gemini-embedding-2` regression — the caller sees N IDs but only M rows actually land in Postgres, with no exception. This silent data loss is exactly the symptom reported in #300.

This PR validates the input lengths up-front via a new `_validate_lengths_match` helper that raises a clear `ValueError` before any DB session is opened, and adds `strict=True` to the trailing `zip` calls as a backstop. Both sync and async paths are covered.

Relevant issues

Fixes #300

Changes

`langchain_postgres/vectorstores.py`:
- Add module-level `_validate_lengths_match(texts, embeddings, metadatas, ids)` raising `ValueError` with the actual lengths in the message.
- Call it at the top of `add_embeddings` and `aadd_embeddings` (after the sync/async early-init blocks, before any DB session opens).
- Add `strict=True` to the trailing `zip` in both methods so any future internal mismatch turns into an exception instead of a silent truncation.
`tests/unit_tests/v1/test_pgvector_length_validation.py`:
- Five focused tests that bypass `PGVector.init` (no Postgres needed):
  - too few embeddings → `ValueError`
  - mismatched metadatas length → `ValueError`
  - mismatched ids length → `ValueError`
  - matching lengths → validation passes through (test asserts it reaches the DB-session line)
  - async `aadd_embeddings` analog

Testing

```
$ python -m pytest tests/unit_tests/v1/test_pgvector_length_validation.py -v
5 passed in 0.64s
```

Reverting only the `_validate_lengths_match` call and `strict=True` flag while keeping the new tests makes the mismatch tests fail back to silent truncation, confirming the regression coverage pins the buggy behavior. `ruff check` and `ruff format --check` pass.

Note

The same anti-pattern exists in `langchain_postgres/v2/async_vectorstore.py:297` (`for id, content, embedding, metadata in zip(ids, texts, embeddings, metadatas):`). v1 is the path explicitly called out in #300 and the path users hit through the deprecated `PGVector` import; happy to fix the v2 analog in a follow-up if you'd like it bundled differently.

Disclaimer: this PR was prepared with the assistance of an AI agent (Claude Code). All code and test changes were reviewed by the author before submission.

`PGVector.add_embeddings` and `aadd_embeddings` built the SQL insert payload via `zip(texts, metadatas, embeddings, ids_)`, which silently truncates to the shortest argument. They then returned the full `ids_` list regardless of how many rows were actually written, so an upstream embedder bug that returned fewer embeddings than texts saw N IDs returned to the caller but only M rows persisted, with no exception raised. As called out in langchain-ai#300 (and observed downstream of langchain-google-genai's `gemini-embedding-2` regression in langchain-ai/langchain-google#1704). This change validates lengths up-front via `_validate_lengths_match` — raising a clear `ValueError` before any DB session is opened — and adds `strict=True` to the trailing `zip` calls as a backstop. Adds unit-test coverage that bypasses `PGVector.__init__` (no Postgres needed) for both sync and async paths. Fixes langchain-ai#300

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `PGVector.add_embeddings` silently truncates on length mismatch#310

fix: `PGVector.add_embeddings` silently truncates on length mismatch#310
Humphrey (HumphreySun98) wants to merge 1 commit into
langchain-ai:mainfrom
HumphreySun98:fix/v1-pgvector-add-embeddings-silent-truncation

Humphrey (HumphreySun98) commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Humphrey (HumphreySun98) commented Jun 4, 2026

Description

Relevant issues

Changes

Testing

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant