fix: adaptive MPS throttler + tier switch safety by buildingjoshbetter · Pull Request #354 · buildingjoshbetter/TrueMemory

buildingjoshbetter · 2026-05-17T04:19:10Z

Summary

Replaces the simple RAM-check throttler (101 lines) with an adaptive 3-channel MPS throttler that prevents the 17GB memory balloon during tier switch re-embedding.

Root cause: The old throttler used psutil.virtual_memory() which doesn't see MPS allocations, started at batch=16 (too aggressive), and had no thermal monitoring. Result: MPS ballooned to 17GB on a 24GB machine, causing overheating and lag.

Fix: 9 files changed, +774 lines:

MPS cap (PYTORCH_MPS_HIGH_WATERMARK_RATIO) set at model server startup before torch import
Config-flip-at-finalize — tier only changes after 100% rebuild completion
2.5-hour hard timeout — prevents indefinite rebuilds
3-channel sensor stack — MPS memory level, growth rate, thermal pressure
State machine — PROBING/STABLE/BACKOFF with asymmetric intervals (fast decrease, slow increase)
Adaptive batch sizing — starts at 1, ramps via triple-sample verification
Conditional MPS flush — only on WARNING/BACKOFF, not every batch
Server-side sustained workload detection — throttler activates only during re-embedding

Machine profiles:

RAM	MPS Cap	Start	Max	Step
8GB	4.0GB	1	4	+1
16GB	8.0GB	1	8	+1
24GB	12.0GB	1	12	+2
32GB+	17.6GB	1	16	+2

Test plan

637 tests pass (611 existing + 12 sensor + 15 state machine - 1 pre-existing failure in test_spawn_gate)
Ruff lint clean
Architecture diagram compliance verified
CONTEXT.md lessons 1-8 compliance verified
Reviewed by Gemini 2.5 Pro + Qwen3 235B (pre and post implementation)
Phase 09: Live Edge→Base→Edge test with Activity Monitor (manual)

- Set PYTORCH_MPS_HIGH_WATERMARK_RATIO at model_server startup before torch import (0.50 for <32GB, 0.55 for 32GB+) - Remove premature config["tier"] write in truememory_configure; tier change now only happens in _finalize_rebuild after 100% completion - Add explicit socket.timeout to _request_with_autostart except clause - Add 2.5-hour hard timeout to RebuildWorker batch loop

New file truememory/tier_switch/sensors.py with three monitoring channels: read_mps_memory (torch.mps.driver_allocated_memory), GrowthRateTracker (slope detection), read_thermal_pressure (pmset). Each returns status dict with ok/warning/critical classification. 12 tests in tests/test_sensors.py.

…rvals ThrottlerStateMachine: fast decrease (every 5 batches), slow increase (every 120s + triple-sample + 3 good streaks). OR logic across channels for WARNING/CRITICAL. 15 tests covering all state transitions.

…hine, asymmetric intervals Replace simple RAM-check throttler with adaptive 3-channel version: - Starts at batch=1, ramps via state machine (not batch=16+) - MPS memory level via torch.mps.driver_allocated_memory - Growth rate tracking with slope detection - Thermal pressure via pmset - Machine profiles: 8/16/24/32GB with per-machine caps - should_flush_cache() for conditional MPS flush - on_oom() for OOM-to-BACKOFF integration - Backward compatible interface (worker.py unchanged)

…ditional MPS flush Add throttler integration to embed handler: detect sustained workloads (>10 requests in 30s), activate DynamicThrottler, conditionally flush MPS cache on WARNING/BACKOFF. Deactivate when workload ends. Server never rejects requests — only monitors and flushes.

…nd OOM-to-BACKOFF - Replace unconditional flush_gpu_cache() with conditional check via throttler.should_flush_cache() (only flushes on WARNING/BACKOFF) - OOM handler now calls throttler.on_oom() to properly trigger BACKOFF state in the state machine instead of direct batch_size manipulation

PyTorch 2.11 requires PYTORCH_MPS_LOW_WATERMARK_RATIO=0.0 when setting a custom high watermark, otherwise MPS allocation fails with "invalid low watermark ratio 1.4". Discovered during live test.

buildingjoshbetter added 7 commits May 16, 2026 22:46

phase 09: fix MPS low watermark ratio for PyTorch 2.11

ee20945

PyTorch 2.11 requires PYTORCH_MPS_LOW_WATERMARK_RATIO=0.0 when setting a custom high watermark, otherwise MPS allocation fails with "invalid low watermark ratio 1.4". Discovered during live test.

buildingjoshbetter merged commit ee20945 into main May 17, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: adaptive MPS throttler + tier switch safety#354

fix: adaptive MPS throttler + tier switch safety#354
buildingjoshbetter merged 7 commits into
mainfrom
fix/throttler-v2-tier-safety

buildingjoshbetter commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

buildingjoshbetter commented May 17, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant