fix: adaptive MPS throttler + tier switch safety#354
Merged
Conversation
- Set PYTORCH_MPS_HIGH_WATERMARK_RATIO at model_server startup before torch import (0.50 for <32GB, 0.55 for 32GB+) - Remove premature config["tier"] write in truememory_configure; tier change now only happens in _finalize_rebuild after 100% completion - Add explicit socket.timeout to _request_with_autostart except clause - Add 2.5-hour hard timeout to RebuildWorker batch loop
New file truememory/tier_switch/sensors.py with three monitoring channels: read_mps_memory (torch.mps.driver_allocated_memory), GrowthRateTracker (slope detection), read_thermal_pressure (pmset). Each returns status dict with ok/warning/critical classification. 12 tests in tests/test_sensors.py.
…rvals ThrottlerStateMachine: fast decrease (every 5 batches), slow increase (every 120s + triple-sample + 3 good streaks). OR logic across channels for WARNING/CRITICAL. 15 tests covering all state transitions.
…hine, asymmetric intervals Replace simple RAM-check throttler with adaptive 3-channel version: - Starts at batch=1, ramps via state machine (not batch=16+) - MPS memory level via torch.mps.driver_allocated_memory - Growth rate tracking with slope detection - Thermal pressure via pmset - Machine profiles: 8/16/24/32GB with per-machine caps - should_flush_cache() for conditional MPS flush - on_oom() for OOM-to-BACKOFF integration - Backward compatible interface (worker.py unchanged)
…ditional MPS flush Add throttler integration to embed handler: detect sustained workloads (>10 requests in 30s), activate DynamicThrottler, conditionally flush MPS cache on WARNING/BACKOFF. Deactivate when workload ends. Server never rejects requests — only monitors and flushes.
…nd OOM-to-BACKOFF - Replace unconditional flush_gpu_cache() with conditional check via throttler.should_flush_cache() (only flushes on WARNING/BACKOFF) - OOM handler now calls throttler.on_oom() to properly trigger BACKOFF state in the state machine instead of direct batch_size manipulation
PyTorch 2.11 requires PYTORCH_MPS_LOW_WATERMARK_RATIO=0.0 when setting a custom high watermark, otherwise MPS allocation fails with "invalid low watermark ratio 1.4". Discovered during live test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the simple RAM-check throttler (101 lines) with an adaptive 3-channel MPS throttler that prevents the 17GB memory balloon during tier switch re-embedding.
Root cause: The old throttler used
psutil.virtual_memory()which doesn't see MPS allocations, started at batch=16 (too aggressive), and had no thermal monitoring. Result: MPS ballooned to 17GB on a 24GB machine, causing overheating and lag.Fix: 9 files changed, +774 lines:
PYTORCH_MPS_HIGH_WATERMARK_RATIO) set at model server startup before torch importMachine profiles:
Test plan