Skip to content

perf(ivf): tiled-GEMM k-means assignment (~1.6× IVF-PQ training)#101

Draft
konjoinfinity wants to merge 2 commits into
mainfrom
claude/continue-optimizations-hbl7wb
Draft

perf(ivf): tiled-GEMM k-means assignment (~1.6× IVF-PQ training)#101
konjoinfinity wants to merge 2 commits into
mainfrom
claude/continue-optimizations-hbl7wb

Conversation

@konjoinfinity

Copy link
Copy Markdown
Collaborator

Summary

Replaces the Lloyd assignment step in IVF / IVF-PQ k-means — a parallel-over-points, serial-over-k centroid scan (k separate dot loops per point, poor centroid-matrix reuse) — with a tiled [chunk, d]·[d, k] GEMM + per-row argmax, in a new shared rust/vectro_lib/src/index/kmeans.rs::assign_nearest. This closes the CHANGELOG-flagged "~3.5× slower than FAISS IVF training at high n_lists" gap.

Both distance metrics reduce to a per-row argmax, so the GEMM output is consumed identically:

  • Cosine (unit-norm vectors) → argmax of the raw dot product.
  • L2 (non-unit PQ-style sub-vectors) → argmax of dot − ½‖c‖² (standard centroid-norm trick).

Wired into IvfPqIndex's coarse k-means (cosine) and IvfFlat's (L2). PQ codebook training (quant/pq.rs) already uses a LUT/SIMD-across-K assignment and is left unchanged.

Also refreshes docs: PLAN.md was stale (5.6.0 / 8.1.05.24.0 / 8.17.0) — added sprint sections for the NEON PQ4 fast-scan and this work; CHANGELOG.md gains the GEMM entry, the konjo-gates CI toolchain note (#100), and confirmation the NEON path is green on real Apple Silicon.

Type of change

  • Performance improvement

Checklist

  • cargo test passes locally — full vectro_lib lib suite (238 passed), incl. new index::kmeans parity tests
  • cargo clippy -- -D warnings clean; new kmeans.rs also clean under -W clippy::pedantic (the kiban/G1 flags)
  • cargo fmt --check clean on all changed/new lines (pre-existing hand-formatted lines untouched)
  • pytest / ruff — n/a (no Python changed)
  • No hardcoded absolute paths; no embedding/model files staged
  • Changes scoped to one concern (k-means assignment + the doc refresh)
  • Performance-sensitive change includes a before/after measurement (below)

Measurement (honest A/B on this x86_64 host)

IVF-PQ training, n=50k · d=128 · n_lists=512, 25 iters, best-of-3:

Variant Time
old scalar assignment (main) 2.47s
tiled GEMM (this PR) 1.52s (~1.6×)
naive monolithic single-.dot() GEMM 5.2s (2× slower — rejected)

The monolithic GEMM was slower because ndarray's matrixmultiply is single-threaded (losing the old par_iter-over-points parallelism) and materialises the full [n, k] (~102 MB here) each iteration. Tiling across rayon workers — each running its own small GEMM, mirroring IvfPqIndex::search_batch_flat — recovers the parallelism and caps live memory at [chunk, k]. Recall is unchanged (full index/recall suite green); assign_nearest is validated byte-for-byte against a scalar oracle for both metrics on x86_64 and under qemu-aarch64.

Notes

  • No new dependency — reuses ndarray (rayon feature) already in the crate.
  • No unsafe added.
  • The ~1.6× is on this CI-class x86_64 host; the headline bench-darwin-arm64 number should be captured on the reference bench hardware.

🤖 Generated with Claude Code

https://claude.ai/code/session_01F3xg2gYTLXVue2TaEoC5So


Generated by Claude Code

claude and others added 2 commits July 1, 2026 12:21
Replace the Lloyd assignment step's parallel-over-points/serial-over-k
centroid scan (k dot loops per point) with a tiled [chunk,d]·[d,k] GEMM
plus per-row argmax, in a new shared index/kmeans.rs::assign_nearest.
Both metrics reduce to a per-row argmax — Cosine on the raw dot (unit-norm
vectors), L2 on dot − ½‖c‖² (centroid-norm trick, for PQ-style non-unit
sub-vectors). Wired into IvfPqIndex (cosine) and IvfFlat (L2) k-means; PQ
codebook training already used a LUT/SIMD-across-K assignment and is left
unchanged.

Measured on this x86_64 host (n=50k, d=128, n_lists=512, 25 iters):
IVF-PQ train 2.47s → 1.52s (~1.6x) at unchanged recall. A first cut using
one monolithic .dot() was 2x SLOWER (ndarray matrixmultiply is
single-threaded and materialises the full [n,k]); tiling across rayon
workers — mirroring search_batch_flat — recovers the parallelism and caps
live memory at [chunk,k]. assign_nearest is validated byte-for-byte against
a scalar oracle for both metrics, x86_64 + qemu-aarch64.

Also refreshes PLAN.md (stale 5.6.0/8.1.0 header → 5.24.0/8.17.0, plus
sprint sections for the NEON PQ4 fast-scan and this work) and adds the
CHANGELOG entries, including the konjo-gates CI toolchain note (PR #100)
and confirmation the NEON path is green on real Apple Silicon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01F3xg2gYTLXVue2TaEoC5So
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants