perf(ivf): tiled-GEMM k-means assignment (~1.6× IVF-PQ training)#101
Draft
konjoinfinity wants to merge 2 commits into
Draft
perf(ivf): tiled-GEMM k-means assignment (~1.6× IVF-PQ training)#101konjoinfinity wants to merge 2 commits into
konjoinfinity wants to merge 2 commits into
Conversation
Replace the Lloyd assignment step's parallel-over-points/serial-over-k centroid scan (k dot loops per point) with a tiled [chunk,d]·[d,k] GEMM plus per-row argmax, in a new shared index/kmeans.rs::assign_nearest. Both metrics reduce to a per-row argmax — Cosine on the raw dot (unit-norm vectors), L2 on dot − ½‖c‖² (centroid-norm trick, for PQ-style non-unit sub-vectors). Wired into IvfPqIndex (cosine) and IvfFlat (L2) k-means; PQ codebook training already used a LUT/SIMD-across-K assignment and is left unchanged. Measured on this x86_64 host (n=50k, d=128, n_lists=512, 25 iters): IVF-PQ train 2.47s → 1.52s (~1.6x) at unchanged recall. A first cut using one monolithic .dot() was 2x SLOWER (ndarray matrixmultiply is single-threaded and materialises the full [n,k]); tiling across rayon workers — mirroring search_batch_flat — recovers the parallelism and caps live memory at [chunk,k]. assign_nearest is validated byte-for-byte against a scalar oracle for both metrics, x86_64 + qemu-aarch64. Also refreshes PLAN.md (stale 5.6.0/8.1.0 header → 5.24.0/8.17.0, plus sprint sections for the NEON PQ4 fast-scan and this work) and adds the CHANGELOG entries, including the konjo-gates CI toolchain note (PR #100) and confirmation the NEON path is green on real Apple Silicon. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01F3xg2gYTLXVue2TaEoC5So
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the Lloyd assignment step in IVF / IVF-PQ k-means — a
parallel-over-points, serial-over-kcentroid scan (kseparate dot loops per point, poor centroid-matrix reuse) — with a tiled[chunk, d]·[d, k]GEMM + per-row argmax, in a new sharedrust/vectro_lib/src/index/kmeans.rs::assign_nearest. This closes the CHANGELOG-flagged "~3.5× slower than FAISS IVF training at highn_lists" gap.Both distance metrics reduce to a per-row argmax, so the GEMM output is consumed identically:
dot − ½‖c‖²(standard centroid-norm trick).Wired into
IvfPqIndex's coarse k-means (cosine) andIvfFlat's (L2). PQ codebook training (quant/pq.rs) already uses a LUT/SIMD-across-K assignment and is left unchanged.Also refreshes docs:
PLAN.mdwas stale (5.6.0 / 8.1.0→5.24.0 / 8.17.0) — added sprint sections for the NEON PQ4 fast-scan and this work;CHANGELOG.mdgains the GEMM entry, thekonjo-gatesCI toolchain note (#100), and confirmation the NEON path is green on real Apple Silicon.Type of change
Checklist
cargo testpasses locally — fullvectro_liblib suite (238 passed), incl. newindex::kmeansparity testscargo clippy -- -D warningsclean; newkmeans.rsalso clean under-W clippy::pedantic(the kiban/G1 flags)cargo fmt --checkclean on all changed/new lines (pre-existing hand-formatted lines untouched)pytest/ruff— n/a (no Python changed)Measurement (honest A/B on this x86_64 host)
IVF-PQ training, n=50k · d=128 · n_lists=512, 25 iters, best-of-3:
main).dot()GEMMThe monolithic GEMM was slower because ndarray's
matrixmultiplyis single-threaded (losing the oldpar_iter-over-points parallelism) and materialises the full[n, k](~102 MB here) each iteration. Tiling across rayon workers — each running its own small GEMM, mirroringIvfPqIndex::search_batch_flat— recovers the parallelism and caps live memory at[chunk, k]. Recall is unchanged (full index/recall suite green);assign_nearestis validated byte-for-byte against a scalar oracle for both metrics on x86_64 and underqemu-aarch64.Notes
ndarray(rayonfeature) already in the crate.unsafeadded.~1.6×is on this CI-class x86_64 host; the headlinebench-darwin-arm64number should be captured on the reference bench hardware.🤖 Generated with Claude Code
https://claude.ai/code/session_01F3xg2gYTLXVue2TaEoC5So
Generated by Claude Code