Curated subset of #90 + K-side norm accounting fix by TheTom · Pull Request #91 · TheTom/turboquant_plus

TheTom · 2026-05-09T15:39:13Z

Selectively merging from @brosequist's #90 — keeping the actual fixes, deferring API surface, plus an additional K-side fix that #90 missed.

Thanks @brosequist for the bundle — review surface was much easier this way. Re-bundling further per the curation below.

What's included

commit	author	what
`b75813b`	@brosequist	fix: V-norm in `memory_stats`, SeedSequence PRNG, `TurboQuantMSE.compressed_size_bits`
`1074625`	@brosequist	test: QJL regression-guard in `test_turboquant_improves_over_polarquant`
`f23570e`	@brosequist	test: correctness + round-trip tests for fast rotation functions
`3e37572`	@brosequist	chore: ruff config in pyproject.toml
`0ca5bcc`	@brosequist	ci: drop lint workflow (kept ruff config only)
`8afc4bf`	@TheTom	fix: K-side norm accounting — count vector_norm AND residual_norm

Brett's first commit was split — credit preserved

The original 0fd5de9 bundle (5 changes) was cherry-picked with the streaming + serialization API deferred. What's kept:

✅ V-norm fix in KVCacheCompressor.memory_stats()
✅ TurboQuantMSE.compressed_size_bits() (was missing; TurboQuant already had it)
✅ SeedSequence.spawn(2) replacing seed + 1000 magic offset

What's deferred (no caller yet, want to design for the production integration):

KVCacheCompressor.compress_token() / get_compressed_cache() streaming API
CompressedVector.to_bytes() / from_bytes() binary serialization
CompressedKVCache.save() / load() npz serialization

The split commit retains @brosequist as author.

What's added

8afc4bf is a parallel fix to the V-norm fix in b75813b. TurboQuant.CompressedVector stores two float32 norms (vector_norms = ||x||_2 and residual_norms = ||residual||_2), but TurboQuant.compressed_size_bits and KVCacheCompressor.memory_stats only counted one. V uses TurboQuantMSE (single norm — 32 is correct). K uses full TurboQuant (two norms — 64 is correct).

Numerical effect (verified live):

TurboQuant(d=128, b=3).compression_ratio(): 4.92× → 4.57× (true)
KVCacheCompressor(d=128, k=3, v=3).memory_stats(...)['compression_ratio']: ~2.46× → 2.37× (true)

No quantization-output changes — accounting only.

What's not included from #90

feat: add calibrate() to OutlierTurboQuant — OutlierTurboQuant is a deprecated path per docs/turboquant-plus-experiments.md ("Outlier channeling doesn't work… kurtosis stays 8-50… WHT rotation gets it to 2.9"). Calibrate code is well-written, just on a dead module.
docs: HIP/AMD NaN warning — root-cause story ("large K norms → NaN") contradicts docs/papers/asymmetric-kv-compression.md:218, which finds extreme K norms compress better (more Gaussian after normalization). Real cause is HIP-kernel-specific. Will revisit after kernel triage.

Consistency with `docs/papers/why-mse-fails-for-kv-quantization.md`

The new MSE paper argues MSE is a broken proxy for K cache quantization in deployment because attention is non-linear and sparse. Brett's QJL regression-guard test (1074625) measures inner-product distortion on synthetic Gaussian pairs (d=256) — the linear-operator regime where the new paper explicitly says IP/MSE does proxy quality (alongside RaBitQ-style top-k IP search). The test is a regression-guard for IP distortion only; the production decision to drop QJL is justified separately by docs/papers/turbo4-resurrection.md's PPL ablation. No conflict.

Test plan

pytest tests/ refract/tests/ — 982 passed, 1 skipped (7 fewer than Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs #90 baseline because streaming/serialization tests deferred with their code)
Smoke-tested TurboQuant.compression_ratio() and KVCacheCompressor.memory_stats() numerics against expected values

🤖 Generated with Claude Code

@brosequist

Subset of @brosequist's #90 commit 0fd5de9 — keeping the actual fixes, deferring the streaming + serialization API surface until a production caller exists. Included: - KVCacheCompressor.memory_stats() was omitting the float32 norm stored per V vector, inflating reported compression ratio. Adds v_bits_total += n_vectors * 32. - TurboQuantMSE.compressed_size_bits() — was missing (TurboQuant already had it). - Replaces seed + 1000 magic offset with np.random.SeedSequence(seed).spawn(2) for true PRNG independence between PolarQuant and QJL stages, and between K and V quantizers. Deferred (not in this commit): - compress_token() / get_compressed_cache() streaming API - CompressedVector.to_bytes() / from_bytes() binary serialization - CompressedKVCache.save() / load() npz serialization

…uant The existing test ended with a print() and no assertion, silently allowing QJL to be worse than PolarQuant. This updates the test to assert the known finding: QJL (TurboQuant 2-bit) is actively worse than MSE-only PolarQuant at the same bit budget. The assertion will alert if QJL is ever fixed and starts winning, prompting re-evaluation of the production path. See turbo4-resurrection.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

TestFastRotationExtended covers: round-trip invertibility (x → rotate → unrotate = x), batch vs single-vector consistency, and energy distribution uniformity after rotation. All three property tests were previously untested. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a [tool.ruff] section to pyproject.toml (line-length=120, E/W/F rules, ignoring E501/E741) and a GitHub Actions workflow (.github/workflows/lint.yml) that runs ruff check on every push and pull request. Replaces ad-hoc style discussions with an enforced, zero-config lint gate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The lint workflow added in 46efe26 ran 'ruff check .' against the whole repo and failed immediately because the existing codebase has 233 pre-existing ruff violations (78 F401 unused imports, 68 I001 import sorting, 40 F541 empty f-strings, 32 F841 unused vars, etc.) across benchmarks/ and scripts/. Adding a CI gate that the legacy code doesn't pass is unhelpful, so remove .github/workflows/lint.yml. Keep the [tool.ruff] block in pyproject.toml as opt-in documentation: anyone running 'ruff check' locally still gets the configured rules, and the workflow can be re-enabled later once the legacy violations are addressed (most are auto-fixable via 'ruff check --fix' across 187 of the 233).

TurboQuant.CompressedVector stores TWO float32 norms per vector (vector_norms = ||x||_2, residual_norms = ||residual||_2), but compressed_size_bits and KVCacheCompressor.memory_stats only counted one (32 bits instead of 64). Pre-existing on main, parallel to the V-side undercount fixed in the previous commit. V uses TurboQuantMSE which stores a single norm — 32 is correct there. K uses full TurboQuant which stores two norms. Effect: K compressed size was understated by 32 bits per vector, inflating reported compression ratio. With d=128 b=3 the TurboQuant ratio drops from 4.92× → 4.57× (true value), and the combined KV ratio at d=128 k=v=3 drops from ~2.46× → ~2.37×. No quantization-output changes, accounting only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…JL fix Pulls in 3 upstream commits since merge-base 1224fef: - c46f6b9 docs(papers): block-selector sparse attention WIP log - 0cb20bc fix(qjl): orthogonal projection + sqrt(d) scale (TheTom TheTom#93) - 280b466 README: mark QJL as reference-only Clean auto-merge. Only file touched by both sides was turboquant.py; upstream added a `shrinkage` kwarg to TurboQuant.dequantize that slots in alongside our V-norm/MSE accounting fix without conflict. Our fork-local commits retained: V-norm in memory_stats, SeedSequence PRNG, MSE compressed_size_bits, QJL regression test, rotation tests, ruff config + CI drop, OutlierTurboQuant.calibrate, HIP/AMD NaN doc. PR TheTom#91 (ship/pr-90-curated) — TheTom's curated cherry-pick of 5 of these — remains open; once it merges to upstream/main we'll want to rebase/reset to drop redundant commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…annel split" This reverts 3215eb3. Dropping per fork-curation review: - TheTom deferred this from PR TheTom#91 because OutlierTurboQuant is on a deprecated path (turboquant-plus-experiments.md — outlier channeling loses to WHT rotation, which drives kurtosis 8-50 → 2.9). - Our llama-cpp-turboquant implementation never wired in OutlierTurboQuant — production kernels (q8_0-turbo3_0, turbo4_0-turbo2_0) use the standard TurboQuant / TurboQuantMSE path. Zero consumers downstream. Keeping the commit hash on record in docs/fork-notes/upstream-pr-90-status.md for credit; the code itself has no reason to survive future upstream churn on a dead module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Upstream renamed pyproject.toml to package refract-llm only; the [dev] extra was reduced to pytest/coverage/build/twine. The in-repo `turboquant` Python package (dev-only, not shipped) still depends on numpy + scipy, so `python -m pytest tests/` fails at import in CI with ModuleNotFoundError. Adding numpy>=1.21 and scipy>=1.7 to [dev] keeps `pip install -e ".[dev]"` self-contained for contributors and gets the GitHub Actions matrix green. Fork-local for now; candidate upstream PR after TheTom#91 merges, since the same failure is hitting `TheTom/turboquant_plus` main (every run since 2026-05-09 fails the same way). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… models" This reverts f74cb67. Dropping per fork-curation review + live validation on 2026-05-28: - TheTom deferred this from PR TheTom#91 because the int8-overflow root-cause story contradicts asymmetric-kv-compression.md:218 (large K norms compress *better* post-normalization, not worse). - Validation on llama-cpp-turboquant b1-5aeb2fd (the production HIP build on rog-mega-pc) ran the exact failing config -ctk q8_0 -ctv turbo3 on Qwen2.5-7B-Instruct Q4_K_M across 5 prompts (smoke through 3700-token near-capacity, both temp=0 and temp=0.9 sampling) on both Navi44 (gfx1200) and Navi48 (gfx1201) GPUs. Results byte-identical between cards and coherent in all cases. No NaN, no gibberish. Either the bug was fixed in the codebase since the doc was written or it was always specific to AMD hardware we don't run (MI300X gfx942 likely). Leaving the warning in place would mislead RDNA4 users about a config that works fine for them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

brett and others added 6 commits May 9, 2026 10:42

TheTom force-pushed the ship/pr-90-curated branch from 20daf78 to 8afc4bf Compare May 9, 2026 15:46

TheTom mentioned this pull request May 9, 2026

Bundled fixes & tests: V-norm accounting, OutlierTurboQuant.calibrate, rotation tests, ruff CI, HIP/AMD NaN docs #90

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Curated subset of #90 + K-side norm accounting fix#91

Curated subset of #90 + K-side norm accounting fix#91
TheTom wants to merge 6 commits into
mainfrom
ship/pr-90-curated

TheTom commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TheTom commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's included

Brett's first commit was split — credit preserved

What's added

What's not included from #90

Consistency with docs/papers/why-mse-fails-for-kv-quantization.md

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheTom commented May 9, 2026 •

edited

Loading

Consistency with `docs/papers/why-mse-fails-for-kv-quantization.md`