fix: skip redundant TheRock download when already installed (#2413)#2467
fix: skip redundant TheRock download when already installed (#2413)#2467bong-water-water-bong wants to merge 14 commits into
Conversation
…r OpenAI compat Closes lemonade-sdk#1370 — OpenCode / @ai-sdk/openai-compatible streaming crash Three fixes for reasoning model streaming: 1. Streaming proxy normalization (streaming_proxy.cpp): - Intercepts each SSE data: {...} line in forward_sse_stream() - Injects content: "" when reasoning_content is present without content - Injects role: "assistant" when null/missing on assistant deltas - Only applies to chat.completion.chunk objects (non-chat passthrough) 2. Non-streaming response (server.cpp): - Same content injection for REST chat completions response 3. thinking: false passthrough (server.cpp): - Replaced strip_handled_thinking_fields() (which erased enable_thinking/ thinking before forwarding) with normalize_thinking_fields() which renames thinking → enable_thinking and keeps it in the forwarded request. FLM/vLLM/cloud backends now see enable_thinking. - /no_think prefix retained for llama.cpp compatibility Tests: 11 unit tests covering role normalization, reasoning content normalization, carriage return, multi-choice, multi-line streams. 7/7 C++ tests pass (100%).
Closes lemonade-sdk#2371 — multi-GPU systems only got ROCm for the first GPU arch get_rocm_arch() iterates AMD GPUs (iGPU first, then dGPU) and returns only the first match. On systems with both an iGPU and dGPU with different architectures, TheRock was only installed for the iGPU. Fix: - Add get_rocm_arches() returning ALL detected AMD GPU architectures (deduplicated, iGPU-first ordering preserved) - Update install_therock_if_needed() to install TheRock for every arch - Keep get_rocm_arch() for backward compat (rocm_channel, display)
…dk#1364, lemonade-sdk#1546) lemonade-sdk#1364 — Large Prompts Timing Out - Add SSE keepalive heartbeat thread in forward_sse_stream() - Sends : keepalive\n\n every 10s during prefill while waiting for first token - Prevents client-side read timeouts on long-running prompt processing - Thread-safe via shared mutex with the libcurl write callback lemonade-sdk#1546 — Model Download Resilience - Add .completed sentinel written after all files are verified in download - is_checkpoint_path_complete() checks for .completed as authoritative marker - Prevents corrupt partially-downloaded files from appearing complete - Hardened recursive_directory_iterator with skip_permission_denied + error_code
Add a pre-load memory check in Router::load_model() that compares model_info.size (file size in GB) against get_available_memory_gb() for the target device. Logs a warning when the model may not fit. Chose warn-only (not block) because: 1. GGUF file size != load-time memory (mmap'd, paged) 2. Auto-tune ctx_size resolver will reduce context to fit 3. A hard block would frustrate users who know their setup
Completes fixes for lemonade-sdk#1364, lemonade-sdk#1546, lemonade-sdk#1804: ## lemonade-sdk#1364 — SSE heartbeat during long prefill Injects : keepalive\n\n every 10s during prefill to prevent client-side read timeouts on long prompts (15k tokens → 5 min prefill). ## #1546b — .completed sentinel for download verification Written after all files are verified in download_from_huggingface(). is_checkpoint_path_complete() checks for it, preventing corrupt partially- downloaded files from appearing complete after a crash. ## #1546a — Model-level download resume fast-path download_from_huggingface() now accepts do_not_upgrade flag. When set and .completed sentinel exists, skips the HF API call entirely. ## #1546c — Directory iterator hardening discover_extra_models() uses skip_permission_denied + error_code handling to prevent crashes on temp files from interrupted downloads. ## lemonade-sdk#1804 — Pre-load OOM guard Upgraded the pre-load memory check from warning to hard block when model size exceeds 2x available memory headroom. Prevents OOM killer crashes with a clear error message instead. ## CI — PR-Agent + Qodo dual review Added pr-agent-review.yml (DeepSeek) and qodo-merge.yml workflows.
Before fetching the HF API and rebuilding the file list, check for an existing .download_manifest.json with incomplete files. If found, resume downloading from the partial state instead of starting over. This avoids re-downloading already-completed files after a network interruption or Ctrl-C during model pull.
This reverts commit 11991bc.
…ection (lemonade-sdk#2414) The function used a std::lock_guard which caused a deadlock when build_recipes_info re-entered get_system_info_with_cache via get_rocm_arch(). Fix by switching to std::unique_lock, marking s_recipes_computed early, and unlocking during recipe computation. On failure the flag resets so the next call retries.
…ete (lemonade-sdk#2435) The draft checkpoint is optional — the model can run without MTP spec decoding. But are_required_checkpoints_complete() iterates over ALL checkpoint types, and if the draft model hasn't been downloaded yet the entire model gets marked as not-downloaded / 'unreadable'. Fix by skipping the draft type alongside npu_cache.
…crets Adds a setup script that reads API keys from ~/Documents/ and pushes them as GitHub secrets via \`gh secret set\`, plus comment headers in both PR-Agent workflows referencing how to configure secrets. Workflows already skip gracefully when secrets are missing, but the script and comments make it discoverable for new contributors. - .github/scripts/setup-repo-secrets.sh — idempotent, dry-run, --repo support - pr-agent-review.yml — header: DEEPSEEK_API_KEY required - qodo-merge.yml — header: QODO_API_KEY required
…-sdk#2413) install_therock_if_needed() didn't check if TheRock was already installed before calling install_therock(). When hot-swapping multiple ROCm backends (e.g. llamacpp:rocm and sd-cpp:rocm), each triggered a separate 3 GB TheRock download. Fix by adding is_therock_installed_for_current_arch() check at the top of install_therock_if_needed().
When offline mode is enabled, the server silently skipped the download but the CLI still printed 'Model pulled successfully'. Fix: server now includes a 'warning' field in both legacy and SSE streaming completion responses when offline, and the CLI surfaces it as 'Model pulled with warning: ...' instead of 'successfully'.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f4b2dc43be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| double available = get_available_memory_gb( | ||
| model_info.device & DEVICE_GPU ? DEVICE_GPU : | ||
| model_info.device & DEVICE_NPU ? DEVICE_NPU : | ||
| DEVICE_CPU); | ||
| double headroom = available * SAFETY_MARGIN; | ||
| if (model_info.size > headroom * 2.0) { |
There was a problem hiding this comment.
Do not treat unknown memory as zero capacity
In environments where get_available_memory_gb() cannot determine a value, it returns 0.0; this new hard-fail path then computes zero headroom and rejects every model with model_info.size > 0. For example, CPU recipes on a generic x86 host without the Apple/AMD GPU probes used by that helper will now throw before reaching the backend even though the model could load. Please skip the rejection when memory is unknown, or fall back to generic system RAM for CPU/NPU loads.
Useful? React with 👍 / 👎.
| while (!has_first_token.load() && heartbeat_running.load()) { | ||
| std::this_thread::sleep_for(KEEPALIVE_INTERVAL); | ||
| if (has_first_token.load() || !heartbeat_running.load()) break; |
There was a problem hiding this comment.
Make the heartbeat wait interruptible
For short streaming responses, the backend can send the first token and finish while this heartbeat thread is still inside the 10-second sleep_for; forward_sse_stream() then joins the thread before sending the final flush/[DONE], so common sub-10-second streams are delayed by up to 10 seconds. Use an interruptible wait such as a condition variable, or otherwise wake the thread when has_first_token/heartbeat_running changes.
Useful? React with 👍 / 👎.
| // indistinguishable from a complete download by other checks alone. | ||
| // The sentinel is written after all files are verified in | ||
| // download_from_huggingface(). | ||
| return safe_exists(marker_dir / ".completed"); |
There was a problem hiding this comment.
Accept legacy completed snapshot directories
Existing directory checkpoints downloaded by previous Lemonade versions do not have the new .completed sentinel, so this now marks otherwise intact RyzenAI/Kokoro/Moonshine/vLLM snapshots as not downloaded. On auto-load, those users will unnecessarily hit Hugging Face again, and cached models can fail to load in offline/no-network use despite having no .download_manifest.json or .partial files. Please migrate/write the sentinel after verifying legacy directories, or keep the previous no-partials fallback for pre-sentinel caches.
Useful? React with 👍 / 👎.
| if (entry.is_directory() && safe_exists(entry.path() / ".completed")) { | ||
| LOG(INFO, "ModelManager") << "Model already downloaded (found .completed sentinel), skipping API call" | ||
| << std::endl; | ||
| return; |
There was a problem hiding this comment.
Verify required variants before .completed fast-return
When do_not_upgrade is true, this returns as soon as any snapshot in the main repo has .completed, without checking the current recipe's required files. If a user previously downloaded the base GGUF from a repo and later loads a recipe from the same repo that needs an extra checkpoint such as mmproj, is_model_downloaded() is false, but this fast path skips the missing file download and the caller marks the model downloaded. Please validate this ModelInfo's resolved checkpoint paths before returning.
Useful? React with 👍 / 👎.
| if (is_therock_installed_for_current_arch(backend_versions)) { | ||
| LOG(DEBUG, "BackendManager") << "TheRock already installed, skipping" << std::endl; | ||
| return; |
There was a problem hiding this comment.
Check all TheRock architectures before skipping
On a multi-AMD-GPU machine, this early return only checks the primary architecture from get_rocm_arch(), but the new code below is intended to install TheRock for every architecture returned by get_rocm_arches(). If the primary/iGPU runtime is already installed and a secondary dGPU architecture is missing, installation exits here and the dGPU still lacks its TheRock libraries. Please skip only when every detected architecture has the expected version.
Useful? React with 👍 / 👎.
| s_recipes_computed = true; | ||
| lock.unlock(); |
There was a problem hiding this comment.
Do not expose half-built recipe cache
Under concurrent startup/status requests, one thread can set s_recipes_computed here and unlock while build_recipes_info() is still running; a second caller then takes the early s_hardware_computed && s_recipes_computed return and observes s_cached_system_info before recipes has been assigned. This can transiently hide recipe/backend availability from /system-info consumers. Keep a separate in-progress state or leave s_recipes_computed false until the assignment completes.
Useful? React with 👍 / 👎.
| echo "⚠️ $label key file not found at: $path" | ||
| return 1 |
There was a problem hiding this comment.
Send missing-key warnings to stderr
When a key file is missing or empty, read_key prints the warning to stdout and returns nonzero, but the caller captures stdout with DEEPSEEK_KEY="$(read_key ...)" || true. That makes the warning text a non-empty secret value, so set_secret uploads it instead of skipping the unset key. Emit diagnostics on stderr or explicitly clear the captured value on failure.
Useful? React with 👍 / 👎.
| pr_agent_job: | ||
| name: PR-Agent (DeepSeek) | ||
| runs-on: ubuntu-latest | ||
| if: ${{ github.event.sender.type != 'Bot' && secrets.DEEPSEEK_API_KEY != '' }} |
There was a problem hiding this comment.
Move secret checks out of job conditions
GitHub's Actions docs state that "Secrets cannot be directly referenced in if: conditionals" (https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/use-secrets), but this job-level condition uses secrets.DEEPSEEK_API_KEY; the same pattern is present in qodo-merge.yml. These newly added review jobs will not be evaluated as intended before any steps run. Use a preliminary check step/job and condition on a non-secret output instead.
Useful? React with 👍 / 👎.
| download_from_manifest(manifest, headers, progress_callback); | ||
| if (fs::exists(manifest_path)) { | ||
| fs::remove(manifest_path); | ||
| } | ||
| return; |
There was a problem hiding this comment.
Mark resumed snapshots complete
When the new manifest-resume path succeeds, it removes the manifest and returns before the normal completion path writes .completed or advances refs. With the new downloaded-state check requiring that sentinel for directory snapshots, a successfully resumed RyzenAI/Kokoro/Moonshine/vLLM download is immediately considered incomplete on the next cache rebuild and can be re-downloaded or fail offline. Continue through the same post-download finalization used by fresh downloads.
Useful? React with 👍 / 👎.
| (void)checkpoint; | ||
|
|
||
| if (type == "npu_cache") continue; | ||
| if (type == "npu_cache" || type == "draft") continue; |
There was a problem hiding this comment.
Keep draft checkpoints required for MTP models
The built-in MTP recipes have a draft checkpoint and the llama.cpp loader still enables --spec-type draft-mtp for models labeled mtp; if the draft file is missing after a partial cache, manual deletion, or cache-first import, this skip lets is_model_downloaded() return true so auto-load will not repair the missing draft. Require the draft checkpoint for recipes that advertise MTP, or avoid enabling MTP defaults when the draft is absent.
Useful? React with 👍 / 👎.
superm1
left a comment
There was a problem hiding this comment.
SO MUCH unrelated work. Please tear out anything but what you're trying to fix.
Description
Fixes #2413 — TheRock was downloaded twice when hot-swapping multiple ROCm backends.
Root cause
`install_therock_if_needed()` checked `will_install_therock()` to determine whether TheRock should be installed, but did not check `is_therock_installed_for_current_arch()` before calling `install_therock()`. When `lemonade config set` triggered hot-swaps for both `llamacpp:rocm` and `sd-cpp:rocm`, each backend independently triggered a full 3 GB TheRock download.
Fix
Added `is_therock_installed_for_current_arch()` check at the top of `install_therock_if_needed()`, so the function exits early when TheRock is already installed for the current GPU architecture.