Skip to content

Curated subset of #90 + K-side norm accounting fix#91

Open
TheTom wants to merge 6 commits into
mainfrom
ship/pr-90-curated
Open

Curated subset of #90 + K-side norm accounting fix#91
TheTom wants to merge 6 commits into
mainfrom
ship/pr-90-curated

Conversation

@TheTom
Copy link
Copy Markdown
Owner

@TheTom TheTom commented May 9, 2026

Selectively merging from @brosequist's #90 — keeping the actual fixes, deferring API surface, plus an additional K-side fix that #90 missed.

Thanks @brosequist for the bundle — review surface was much easier this way. Re-bundling further per the curation below.

What's included

commit author what
b75813b @brosequist fix: V-norm in memory_stats, SeedSequence PRNG, TurboQuantMSE.compressed_size_bits
1074625 @brosequist test: QJL regression-guard in test_turboquant_improves_over_polarquant
f23570e @brosequist test: correctness + round-trip tests for fast rotation functions
3e37572 @brosequist chore: ruff config in pyproject.toml
0ca5bcc @brosequist ci: drop lint workflow (kept ruff config only)
8afc4bf @TheTom fix: K-side norm accounting — count vector_norm AND residual_norm

Brett's first commit was split — credit preserved

The original 0fd5de9 bundle (5 changes) was cherry-picked with the streaming + serialization API deferred. What's kept:

  • ✅ V-norm fix in KVCacheCompressor.memory_stats()
  • TurboQuantMSE.compressed_size_bits() (was missing; TurboQuant already had it)
  • SeedSequence.spawn(2) replacing seed + 1000 magic offset

What's deferred (no caller yet, want to design for the production integration):

  • KVCacheCompressor.compress_token() / get_compressed_cache() streaming API
  • CompressedVector.to_bytes() / from_bytes() binary serialization
  • CompressedKVCache.save() / load() npz serialization

The split commit retains @brosequist as author.

What's added

8afc4bf is a parallel fix to the V-norm fix in b75813b. TurboQuant.CompressedVector stores two float32 norms (vector_norms = ||x||_2 and residual_norms = ||residual||_2), but TurboQuant.compressed_size_bits and KVCacheCompressor.memory_stats only counted one. V uses TurboQuantMSE (single norm — 32 is correct). K uses full TurboQuant (two norms — 64 is correct).

Numerical effect (verified live):

  • TurboQuant(d=128, b=3).compression_ratio(): 4.92× → 4.57× (true)
  • KVCacheCompressor(d=128, k=3, v=3).memory_stats(...)['compression_ratio']: ~2.46× → 2.37× (true)

No quantization-output changes — accounting only.

What's not included from #90

  • feat: add calibrate() to OutlierTurboQuantOutlierTurboQuant is a deprecated path per docs/turboquant-plus-experiments.md ("Outlier channeling doesn't work… kurtosis stays 8-50… WHT rotation gets it to 2.9"). Calibrate code is well-written, just on a dead module.
  • docs: HIP/AMD NaN warning — root-cause story ("large K norms → NaN") contradicts docs/papers/asymmetric-kv-compression.md:218, which finds extreme K norms compress better (more Gaussian after normalization). Real cause is HIP-kernel-specific. Will revisit after kernel triage.

Consistency with docs/papers/why-mse-fails-for-kv-quantization.md

The new MSE paper argues MSE is a broken proxy for K cache quantization in deployment because attention is non-linear and sparse. Brett's QJL regression-guard test (1074625) measures inner-product distortion on synthetic Gaussian pairs (d=256) — the linear-operator regime where the new paper explicitly says IP/MSE does proxy quality (alongside RaBitQ-style top-k IP search). The test is a regression-guard for IP distortion only; the production decision to drop QJL is justified separately by docs/papers/turbo4-resurrection.md's PPL ablation. No conflict.

Test plan

🤖 Generated with Claude Code

brett and others added 6 commits May 9, 2026 10:42
Subset of @brosequist's #90 commit 0fd5de9 — keeping the actual
fixes, deferring the streaming + serialization API surface until
a production caller exists.

Included:
- KVCacheCompressor.memory_stats() was omitting the float32 norm
  stored per V vector, inflating reported compression ratio. Adds
  v_bits_total += n_vectors * 32.
- TurboQuantMSE.compressed_size_bits() — was missing (TurboQuant
  already had it).
- Replaces seed + 1000 magic offset with
  np.random.SeedSequence(seed).spawn(2) for true PRNG independence
  between PolarQuant and QJL stages, and between K and V quantizers.

Deferred (not in this commit):
- compress_token() / get_compressed_cache() streaming API
- CompressedVector.to_bytes() / from_bytes() binary serialization
- CompressedKVCache.save() / load() npz serialization
…uant

The existing test ended with a print() and no assertion, silently allowing QJL
to be worse than PolarQuant. This updates the test to assert the known finding:
QJL (TurboQuant 2-bit) is actively worse than MSE-only PolarQuant at the same
bit budget. The assertion will alert if QJL is ever fixed and starts winning,
prompting re-evaluation of the production path. See turbo4-resurrection.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TestFastRotationExtended covers: round-trip invertibility (x → rotate → unrotate = x),
batch vs single-vector consistency, and energy distribution uniformity after rotation.
All three property tests were previously untested.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a [tool.ruff] section to pyproject.toml (line-length=120, E/W/F rules,
ignoring E501/E741) and a GitHub Actions workflow (.github/workflows/lint.yml)
that runs ruff check on every push and pull request. Replaces ad-hoc style
discussions with an enforced, zero-config lint gate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The lint workflow added in 46efe26 ran 'ruff check .' against the
whole repo and failed immediately because the existing codebase has
233 pre-existing ruff violations (78 F401 unused imports, 68 I001
import sorting, 40 F541 empty f-strings, 32 F841 unused vars, etc.)
across benchmarks/ and scripts/.

Adding a CI gate that the legacy code doesn't pass is unhelpful, so
remove .github/workflows/lint.yml. Keep the [tool.ruff] block in
pyproject.toml as opt-in documentation: anyone running 'ruff check'
locally still gets the configured rules, and the workflow can be
re-enabled later once the legacy violations are addressed (most are
auto-fixable via 'ruff check --fix' across 187 of the 233).
TurboQuant.CompressedVector stores TWO float32 norms per vector
(vector_norms = ||x||_2, residual_norms = ||residual||_2), but
compressed_size_bits and KVCacheCompressor.memory_stats only
counted one (32 bits instead of 64).

Pre-existing on main, parallel to the V-side undercount fixed in
the previous commit. V uses TurboQuantMSE which stores a single
norm — 32 is correct there. K uses full TurboQuant which stores
two norms.

Effect: K compressed size was understated by 32 bits per vector,
inflating reported compression ratio. With d=128 b=3 the
TurboQuant ratio drops from 4.92× → 4.57× (true value), and the
combined KV ratio at d=128 k=v=3 drops from ~2.46× → ~2.37×.

No quantization-output changes, accounting only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TheTom TheTom force-pushed the ship/pr-90-curated branch from 20daf78 to 8afc4bf Compare May 9, 2026 15:46
brosequist pushed a commit to brosequist/turboquant_plus that referenced this pull request May 28, 2026
…JL fix

Pulls in 3 upstream commits since merge-base 1224fef:
- c46f6b9 docs(papers): block-selector sparse attention WIP log
- 0cb20bc fix(qjl): orthogonal projection + sqrt(d) scale (TheTom TheTom#93)
- 280b466 README: mark QJL as reference-only

Clean auto-merge. Only file touched by both sides was turboquant.py;
upstream added a `shrinkage` kwarg to TurboQuant.dequantize that slots
in alongside our V-norm/MSE accounting fix without conflict.

Our fork-local commits retained: V-norm in memory_stats, SeedSequence
PRNG, MSE compressed_size_bits, QJL regression test, rotation tests,
ruff config + CI drop, OutlierTurboQuant.calibrate, HIP/AMD NaN doc.

PR TheTom#91 (ship/pr-90-curated) — TheTom's curated cherry-pick of 5 of
these — remains open; once it merges to upstream/main we'll want to
rebase/reset to drop redundant commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brosequist pushed a commit to brosequist/turboquant_plus that referenced this pull request May 28, 2026
…annel split"

This reverts 3215eb3.

Dropping per fork-curation review:
- TheTom deferred this from PR TheTom#91 because OutlierTurboQuant is on a
  deprecated path (turboquant-plus-experiments.md — outlier channeling
  loses to WHT rotation, which drives kurtosis 8-50 → 2.9).
- Our llama-cpp-turboquant implementation never wired in OutlierTurboQuant
  — production kernels (q8_0-turbo3_0, turbo4_0-turbo2_0) use the standard
  TurboQuant / TurboQuantMSE path. Zero consumers downstream.

Keeping the commit hash on record in docs/fork-notes/upstream-pr-90-status.md
for credit; the code itself has no reason to survive future upstream churn
on a dead module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brosequist pushed a commit to brosequist/turboquant_plus that referenced this pull request May 28, 2026
Upstream renamed pyproject.toml to package refract-llm only; the [dev]
extra was reduced to pytest/coverage/build/twine. The in-repo `turboquant`
Python package (dev-only, not shipped) still depends on numpy + scipy,
so `python -m pytest tests/` fails at import in CI with ModuleNotFoundError.

Adding numpy>=1.21 and scipy>=1.7 to [dev] keeps `pip install -e ".[dev]"`
self-contained for contributors and gets the GitHub Actions matrix green.

Fork-local for now; candidate upstream PR after TheTom#91 merges, since the same
failure is hitting `TheTom/turboquant_plus` main (every run since 2026-05-09
fails the same way).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brosequist pushed a commit to brosequist/turboquant_plus that referenced this pull request May 28, 2026
… models"

This reverts f74cb67.

Dropping per fork-curation review + live validation on 2026-05-28:
- TheTom deferred this from PR TheTom#91 because the int8-overflow root-cause
  story contradicts asymmetric-kv-compression.md:218 (large K norms
  compress *better* post-normalization, not worse).
- Validation on llama-cpp-turboquant b1-5aeb2fd (the production HIP build
  on rog-mega-pc) ran the exact failing config -ctk q8_0 -ctv turbo3 on
  Qwen2.5-7B-Instruct Q4_K_M across 5 prompts (smoke through 3700-token
  near-capacity, both temp=0 and temp=0.9 sampling) on both Navi44
  (gfx1200) and Navi48 (gfx1201) GPUs. Results byte-identical between
  cards and coherent in all cases. No NaN, no gibberish.

Either the bug was fixed in the codebase since the doc was written or it
was always specific to AMD hardware we don't run (MI300X gfx942 likely).
Leaving the warning in place would mislead RDNA4 users about a config
that works fine for them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants