perf(ivf): tiled-GEMM k-means assignment (~1.6× IVF-PQ training) by konjoinfinity · Pull Request #101 · konjoai/vectro

konjoinfinity · 2026-07-01T12:22:08Z

Summary

Replaces the Lloyd assignment step in IVF / IVF-PQ k-means — a parallel-over-points, serial-over-k centroid scan (k separate dot loops per point, poor centroid-matrix reuse) — with a tiled [chunk, d]·[d, k] GEMM + per-row argmax, in a new shared rust/vectro_lib/src/index/kmeans.rs::assign_nearest. This closes the CHANGELOG-flagged "~3.5× slower than FAISS IVF training at high n_lists" gap.

Both distance metrics reduce to a per-row argmax, so the GEMM output is consumed identically:

Cosine (unit-norm vectors) → argmax of the raw dot product.
L2 (non-unit PQ-style sub-vectors) → argmax of dot − ½‖c‖² (standard centroid-norm trick).

Wired into IvfPqIndex's coarse k-means (cosine) and IvfFlat's (L2). PQ codebook training (quant/pq.rs) already uses a LUT/SIMD-across-K assignment and is left unchanged.

Also refreshes docs: PLAN.md was stale (5.6.0 / 8.1.0 → 5.24.0 / 8.17.0) — added sprint sections for the NEON PQ4 fast-scan and this work; CHANGELOG.md gains the GEMM entry, the konjo-gates CI toolchain note (#100), and confirmation the NEON path is green on real Apple Silicon.

Type of change

Performance improvement

Checklist

cargo test passes locally — full vectro_lib lib suite (238 passed), incl. new index::kmeans parity tests
cargo clippy -- -D warnings clean; new kmeans.rs also clean under -W clippy::pedantic (the kiban/G1 flags)
cargo fmt --check clean on all changed/new lines (pre-existing hand-formatted lines untouched)
pytest / ruff — n/a (no Python changed)
No hardcoded absolute paths; no embedding/model files staged
Changes scoped to one concern (k-means assignment + the doc refresh)
Performance-sensitive change includes a before/after measurement (below)

Measurement (honest A/B on this x86_64 host)

IVF-PQ training, n=50k · d=128 · n_lists=512, 25 iters, best-of-3:

Variant	Time
old scalar assignment (`main`)	2.47s
tiled GEMM (this PR)	1.52s (~1.6×)
naive monolithic single-`.dot()` GEMM	5.2s (2× slower — rejected)

The monolithic GEMM was slower because ndarray's matrixmultiply is single-threaded (losing the old par_iter-over-points parallelism) and materialises the full [n, k] (~102 MB here) each iteration. Tiling across rayon workers — each running its own small GEMM, mirroring IvfPqIndex::search_batch_flat — recovers the parallelism and caps live memory at [chunk, k]. Recall is unchanged (full index/recall suite green); assign_nearest is validated byte-for-byte against a scalar oracle for both metrics on x86_64 and under qemu-aarch64.

Notes

No new dependency — reuses ndarray (rayon feature) already in the crate.
No unsafe added.
The ~1.6× is on this CI-class x86_64 host; the headline bench-darwin-arm64 number should be captured on the reference bench hardware.

🤖 Generated with Claude Code

https://claude.ai/code/session_01F3xg2gYTLXVue2TaEoC5So

Generated by Claude Code

Replace the Lloyd assignment step's parallel-over-points/serial-over-k centroid scan (k dot loops per point) with a tiled [chunk,d]·[d,k] GEMM plus per-row argmax, in a new shared index/kmeans.rs::assign_nearest. Both metrics reduce to a per-row argmax — Cosine on the raw dot (unit-norm vectors), L2 on dot − ½‖c‖² (centroid-norm trick, for PQ-style non-unit sub-vectors). Wired into IvfPqIndex (cosine) and IvfFlat (L2) k-means; PQ codebook training already used a LUT/SIMD-across-K assignment and is left unchanged. Measured on this x86_64 host (n=50k, d=128, n_lists=512, 25 iters): IVF-PQ train 2.47s → 1.52s (~1.6x) at unchanged recall. A first cut using one monolithic .dot() was 2x SLOWER (ndarray matrixmultiply is single-threaded and materialises the full [n,k]); tiling across rayon workers — mirroring search_batch_flat — recovers the parallelism and caps live memory at [chunk,k]. assign_nearest is validated byte-for-byte against a scalar oracle for both metrics, x86_64 + qemu-aarch64. Also refreshes PLAN.md (stale 5.6.0/8.1.0 header → 5.24.0/8.17.0, plus sprint sections for the NEON PQ4 fast-scan and this work) and adds the CHANGELOG entries, including the konjo-gates CI toolchain note (PR #100) and confirmation the NEON path is green on real Apple Silicon. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01F3xg2gYTLXVue2TaEoC5So

claude and others added 2 commits July 1, 2026 12:21

Merge branch 'main' into claude/continue-optimizations-hbl7wb

304a021

konjoinfinity mentioned this pull request Jul 1, 2026

ci(konjo-gates): bump KIBAN_REF to v1.1.2 #103

Draft

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(ivf): tiled-GEMM k-means assignment (~1.6× IVF-PQ training)#101

perf(ivf): tiled-GEMM k-means assignment (~1.6× IVF-PQ training)#101
konjoinfinity wants to merge 2 commits into
mainfrom
claude/continue-optimizations-hbl7wb

konjoinfinity commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

konjoinfinity commented Jul 1, 2026

Summary

Type of change

Checklist

Measurement (honest A/B on this x86_64 host)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants