Skip to content

fix: adaptive MPS throttler + tier switch safety#354

Merged
buildingjoshbetter merged 7 commits into
mainfrom
fix/throttler-v2-tier-safety
May 17, 2026
Merged

fix: adaptive MPS throttler + tier switch safety#354
buildingjoshbetter merged 7 commits into
mainfrom
fix/throttler-v2-tier-safety

Conversation

@buildingjoshbetter
Copy link
Copy Markdown
Owner

Summary

Replaces the simple RAM-check throttler (101 lines) with an adaptive 3-channel MPS throttler that prevents the 17GB memory balloon during tier switch re-embedding.

Root cause: The old throttler used psutil.virtual_memory() which doesn't see MPS allocations, started at batch=16 (too aggressive), and had no thermal monitoring. Result: MPS ballooned to 17GB on a 24GB machine, causing overheating and lag.

Fix: 9 files changed, +774 lines:

  • MPS cap (PYTORCH_MPS_HIGH_WATERMARK_RATIO) set at model server startup before torch import
  • Config-flip-at-finalize — tier only changes after 100% rebuild completion
  • 2.5-hour hard timeout — prevents indefinite rebuilds
  • 3-channel sensor stack — MPS memory level, growth rate, thermal pressure
  • State machine — PROBING/STABLE/BACKOFF with asymmetric intervals (fast decrease, slow increase)
  • Adaptive batch sizing — starts at 1, ramps via triple-sample verification
  • Conditional MPS flush — only on WARNING/BACKOFF, not every batch
  • Server-side sustained workload detection — throttler activates only during re-embedding

Machine profiles:

RAM MPS Cap Start Max Step
8GB 4.0GB 1 4 +1
16GB 8.0GB 1 8 +1
24GB 12.0GB 1 12 +2
32GB+ 17.6GB 1 16 +2

Test plan

  • 637 tests pass (611 existing + 12 sensor + 15 state machine - 1 pre-existing failure in test_spawn_gate)
  • Ruff lint clean
  • Architecture diagram compliance verified
  • CONTEXT.md lessons 1-8 compliance verified
  • Reviewed by Gemini 2.5 Pro + Qwen3 235B (pre and post implementation)
  • Phase 09: Live Edge→Base→Edge test with Activity Monitor (manual)

- Set PYTORCH_MPS_HIGH_WATERMARK_RATIO at model_server startup before
  torch import (0.50 for <32GB, 0.55 for 32GB+)
- Remove premature config["tier"] write in truememory_configure; tier
  change now only happens in _finalize_rebuild after 100% completion
- Add explicit socket.timeout to _request_with_autostart except clause
- Add 2.5-hour hard timeout to RebuildWorker batch loop
New file truememory/tier_switch/sensors.py with three monitoring
channels: read_mps_memory (torch.mps.driver_allocated_memory),
GrowthRateTracker (slope detection), read_thermal_pressure (pmset).
Each returns status dict with ok/warning/critical classification.
12 tests in tests/test_sensors.py.
…rvals

ThrottlerStateMachine: fast decrease (every 5 batches), slow increase
(every 120s + triple-sample + 3 good streaks). OR logic across channels
for WARNING/CRITICAL. 15 tests covering all state transitions.
…hine, asymmetric intervals

Replace simple RAM-check throttler with adaptive 3-channel version:
- Starts at batch=1, ramps via state machine (not batch=16+)
- MPS memory level via torch.mps.driver_allocated_memory
- Growth rate tracking with slope detection
- Thermal pressure via pmset
- Machine profiles: 8/16/24/32GB with per-machine caps
- should_flush_cache() for conditional MPS flush
- on_oom() for OOM-to-BACKOFF integration
- Backward compatible interface (worker.py unchanged)
…ditional MPS flush

Add throttler integration to embed handler: detect sustained workloads
(>10 requests in 30s), activate DynamicThrottler, conditionally flush
MPS cache on WARNING/BACKOFF. Deactivate when workload ends.
Server never rejects requests — only monitors and flushes.
…nd OOM-to-BACKOFF

- Replace unconditional flush_gpu_cache() with conditional check via
  throttler.should_flush_cache() (only flushes on WARNING/BACKOFF)
- OOM handler now calls throttler.on_oom() to properly trigger BACKOFF
  state in the state machine instead of direct batch_size manipulation
PyTorch 2.11 requires PYTORCH_MPS_LOW_WATERMARK_RATIO=0.0 when
setting a custom high watermark, otherwise MPS allocation fails
with "invalid low watermark ratio 1.4". Discovered during live test.
@buildingjoshbetter buildingjoshbetter merged commit ee20945 into main May 17, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant