Skip to content

fix: skip redundant TheRock download when already installed (#2413)#2467

Open
bong-water-water-bong wants to merge 14 commits into
lemonade-sdk:mainfrom
bong-water-water-bong:fix/therock-duplicate-download
Open

fix: skip redundant TheRock download when already installed (#2413)#2467
bong-water-water-bong wants to merge 14 commits into
lemonade-sdk:mainfrom
bong-water-water-bong:fix/therock-duplicate-download

Conversation

@bong-water-water-bong

Copy link
Copy Markdown
Contributor

Description

Fixes #2413 — TheRock was downloaded twice when hot-swapping multiple ROCm backends.

Root cause

`install_therock_if_needed()` checked `will_install_therock()` to determine whether TheRock should be installed, but did not check `is_therock_installed_for_current_arch()` before calling `install_therock()`. When `lemonade config set` triggered hot-swaps for both `llamacpp:rocm` and `sd-cpp:rocm`, each backend independently triggered a full 3 GB TheRock download.

Fix

Added `is_therock_installed_for_current_arch()` check at the top of `install_therock_if_needed()`, so the function exits early when TheRock is already installed for the current GPU architecture.

…r OpenAI compat

Closes lemonade-sdk#1370 — OpenCode / @ai-sdk/openai-compatible streaming crash

Three fixes for reasoning model streaming:

1. Streaming proxy normalization (streaming_proxy.cpp):
   - Intercepts each SSE data: {...} line in forward_sse_stream()
   - Injects content: "" when reasoning_content is present without content
   - Injects role: "assistant" when null/missing on assistant deltas
   - Only applies to chat.completion.chunk objects (non-chat passthrough)

2. Non-streaming response (server.cpp):
   - Same content injection for REST chat completions response

3. thinking: false passthrough (server.cpp):
   - Replaced strip_handled_thinking_fields() (which erased enable_thinking/
     thinking before forwarding) with normalize_thinking_fields() which
     renames thinking → enable_thinking and keeps it in the forwarded
     request. FLM/vLLM/cloud backends now see enable_thinking.
   - /no_think prefix retained for llama.cpp compatibility

Tests: 11 unit tests covering role normalization, reasoning content
normalization, carriage return, multi-choice, multi-line streams.
7/7 C++ tests pass (100%).
Closes lemonade-sdk#2371 — multi-GPU systems only got ROCm for the first GPU arch

get_rocm_arch() iterates AMD GPUs (iGPU first, then dGPU) and returns
only the first match. On systems with both an iGPU and dGPU with
different architectures, TheRock was only installed for the iGPU.

Fix:
- Add get_rocm_arches() returning ALL detected AMD GPU architectures
  (deduplicated, iGPU-first ordering preserved)
- Update install_therock_if_needed() to install TheRock for every arch
- Keep get_rocm_arch() for backward compat (rocm_channel, display)
…dk#1364, lemonade-sdk#1546)

lemonade-sdk#1364 — Large Prompts Timing Out
- Add SSE keepalive heartbeat thread in forward_sse_stream()
- Sends : keepalive\n\n every 10s during prefill while waiting for first token
- Prevents client-side read timeouts on long-running prompt processing
- Thread-safe via shared mutex with the libcurl write callback

lemonade-sdk#1546 — Model Download Resilience
- Add .completed sentinel written after all files are verified in download
- is_checkpoint_path_complete() checks for .completed as authoritative marker
- Prevents corrupt partially-downloaded files from appearing complete
- Hardened recursive_directory_iterator with skip_permission_denied + error_code
Add a pre-load memory check in Router::load_model() that compares
model_info.size (file size in GB) against get_available_memory_gb()
for the target device. Logs a warning when the model may not fit.

Chose warn-only (not block) because:
1. GGUF file size != load-time memory (mmap'd, paged)
2. Auto-tune ctx_size resolver will reduce context to fit
3. A hard block would frustrate users who know their setup
Completes fixes for lemonade-sdk#1364, lemonade-sdk#1546, lemonade-sdk#1804:

## lemonade-sdk#1364 — SSE heartbeat during long prefill
Injects : keepalive\n\n every 10s during prefill to prevent client-side
read timeouts on long prompts (15k tokens → 5 min prefill).

## #1546b — .completed sentinel for download verification
Written after all files are verified in download_from_huggingface().
is_checkpoint_path_complete() checks for it, preventing corrupt partially-
downloaded files from appearing complete after a crash.

## #1546a — Model-level download resume fast-path
download_from_huggingface() now accepts do_not_upgrade flag. When set and
.completed sentinel exists, skips the HF API call entirely.

## #1546c — Directory iterator hardening
discover_extra_models() uses skip_permission_denied + error_code handling
to prevent crashes on temp files from interrupted downloads.

## lemonade-sdk#1804 — Pre-load OOM guard
Upgraded the pre-load memory check from warning to hard block when model
size exceeds 2x available memory headroom. Prevents OOM killer crashes
with a clear error message instead.

## CI — PR-Agent + Qodo dual review
Added pr-agent-review.yml (DeepSeek) and qodo-merge.yml workflows.
Before fetching the HF API and rebuilding the file list, check for an
existing .download_manifest.json with incomplete files. If found, resume
downloading from the partial state instead of starting over.

This avoids re-downloading already-completed files after a network
interruption or Ctrl-C during model pull.
…ection (lemonade-sdk#2414)

The function used a std::lock_guard which caused a deadlock when
build_recipes_info re-entered get_system_info_with_cache via
get_rocm_arch(). Fix by switching to std::unique_lock, marking
s_recipes_computed early, and unlocking during recipe computation.
On failure the flag resets so the next call retries.
…ete (lemonade-sdk#2435)

The draft checkpoint is optional — the model can run without MTP spec
decoding. But are_required_checkpoints_complete() iterates over ALL
checkpoint types, and if the draft model hasn't been downloaded yet
the entire model gets marked as not-downloaded / 'unreadable'.

Fix by skipping the draft type alongside npu_cache.
…crets

Adds a setup script that reads API keys from ~/Documents/ and pushes
them as GitHub secrets via \`gh secret set\`, plus comment headers in
both PR-Agent workflows referencing how to configure secrets.

Workflows already skip gracefully when secrets are missing, but the
script and comments make it discoverable for new contributors.

- .github/scripts/setup-repo-secrets.sh — idempotent, dry-run, --repo support
- pr-agent-review.yml — header: DEEPSEEK_API_KEY required
- qodo-merge.yml — header: QODO_API_KEY required
…-sdk#2413)

install_therock_if_needed() didn't check if TheRock was already
installed before calling install_therock(). When hot-swapping
multiple ROCm backends (e.g. llamacpp:rocm and sd-cpp:rocm), each
triggered a separate 3 GB TheRock download.

Fix by adding is_therock_installed_for_current_arch() check at the
top of install_therock_if_needed().
When offline mode is enabled, the server silently skipped the download
but the CLI still printed 'Model pulled successfully'.

Fix: server now includes a 'warning' field in both legacy and SSE
streaming completion responses when offline, and the CLI surfaces it
as 'Model pulled with warning: ...' instead of 'successfully'.
@github-actions github-actions Bot added bug Something isn't working engine::llamacpp llama.cpp backend (LlamaCppServer); GPU/CPU LLM inference (Vulkan, ROCm, Metal) runtime::rocm AMD ROCm runtime labels Jun 27, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f4b2dc43be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cpp/server/router.cpp
Comment on lines +498 to +503
double available = get_available_memory_gb(
model_info.device & DEVICE_GPU ? DEVICE_GPU :
model_info.device & DEVICE_NPU ? DEVICE_NPU :
DEVICE_CPU);
double headroom = available * SAFETY_MARGIN;
if (model_info.size > headroom * 2.0) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Do not treat unknown memory as zero capacity

In environments where get_available_memory_gb() cannot determine a value, it returns 0.0; this new hard-fail path then computes zero headroom and rejects every model with model_info.size > 0. For example, CPU recipes on a generic x86 host without the Apple/AMD GPU probes used by that helper will now throw before reaching the backend even though the model could load. Please skip the rejection when memory is unknown, or fall back to generic system RAM for CPU/NPU loads.

Useful? React with 👍 / 👎.

Comment on lines +166 to +168
while (!has_first_token.load() && heartbeat_running.load()) {
std::this_thread::sleep_for(KEEPALIVE_INTERVAL);
if (has_first_token.load() || !heartbeat_running.load()) break;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Make the heartbeat wait interruptible

For short streaming responses, the backend can send the first token and finish while this heartbeat thread is still inside the 10-second sleep_for; forward_sse_stream() then joins the thread before sending the final flush/[DONE], so common sub-10-second streams are delayed by up to 10 seconds. Use an interruptible wait such as a condition variable, or otherwise wake the thread when has_first_token/heartbeat_running changes.

Useful? React with 👍 / 👎.

// indistinguishable from a complete download by other checks alone.
// The sentinel is written after all files are verified in
// download_from_huggingface().
return safe_exists(marker_dir / ".completed");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accept legacy completed snapshot directories

Existing directory checkpoints downloaded by previous Lemonade versions do not have the new .completed sentinel, so this now marks otherwise intact RyzenAI/Kokoro/Moonshine/vLLM snapshots as not downloaded. On auto-load, those users will unnecessarily hit Hugging Face again, and cached models can fail to load in offline/no-network use despite having no .download_manifest.json or .partial files. Please migrate/write the sentinel after verifying legacy directories, or keep the previous no-partials fallback for pre-sentinel caches.

Useful? React with 👍 / 👎.

Comment on lines +4140 to +4143
if (entry.is_directory() && safe_exists(entry.path() / ".completed")) {
LOG(INFO, "ModelManager") << "Model already downloaded (found .completed sentinel), skipping API call"
<< std::endl;
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Verify required variants before .completed fast-return

When do_not_upgrade is true, this returns as soon as any snapshot in the main repo has .completed, without checking the current recipe's required files. If a user previously downloaded the base GGUF from a repo and later loads a recipe from the same repo that needs an extra checkpoint such as mmproj, is_model_downloaded() is false, but this fast path skips the missing file download and the caller marks the model downloaded. Please validate this ModelInfo's resolved checkpoint paths before returning.

Useful? React with 👍 / 👎.

Comment on lines +219 to +221
if (is_therock_installed_for_current_arch(backend_versions)) {
LOG(DEBUG, "BackendManager") << "TheRock already installed, skipping" << std::endl;
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check all TheRock architectures before skipping

On a multi-AMD-GPU machine, this early return only checks the primary architecture from get_rocm_arch(), but the new code below is intended to install TheRock for every architecture returned by get_rocm_arches(). If the primary/iGPU runtime is already installed and a secondary dGPU architecture is missing, installation exits here and the dGPU still lacks its TheRock libraries. Please skip only when every detected architecture has the expected version.

Useful? React with 👍 / 👎.

Comment on lines +4048 to +4049
s_recipes_computed = true;
lock.unlock();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Do not expose half-built recipe cache

Under concurrent startup/status requests, one thread can set s_recipes_computed here and unlock while build_recipes_info() is still running; a second caller then takes the early s_hardware_computed && s_recipes_computed return and observes s_cached_system_info before recipes has been assigned. This can transiently hide recipe/backend availability from /system-info consumers. Keep a separate in-progress state or leave s_recipes_computed false until the assignment completes.

Useful? React with 👍 / 👎.

Comment on lines +58 to +59
echo "⚠️ $label key file not found at: $path"
return 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Send missing-key warnings to stderr

When a key file is missing or empty, read_key prints the warning to stdout and returns nonzero, but the caller captures stdout with DEEPSEEK_KEY="$(read_key ...)" || true. That makes the warning text a non-empty secret value, so set_secret uploads it instead of skipping the unset key. Emit diagnostics on stderr or explicitly clear the captured value on failure.

Useful? React with 👍 / 👎.

pr_agent_job:
name: PR-Agent (DeepSeek)
runs-on: ubuntu-latest
if: ${{ github.event.sender.type != 'Bot' && secrets.DEEPSEEK_API_KEY != '' }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Move secret checks out of job conditions

GitHub's Actions docs state that "Secrets cannot be directly referenced in if: conditionals" (https://docs.github.com/en/actions/how-tos/write-workflows/choose-what-workflows-do/use-secrets), but this job-level condition uses secrets.DEEPSEEK_API_KEY; the same pattern is present in qodo-merge.yml. These newly added review jobs will not be evaluated as intended before any steps run. Use a preliminary check step/job and condition on a non-secret output instead.

Useful? React with 👍 / 👎.

Comment on lines +4168 to +4172
download_from_manifest(manifest, headers, progress_callback);
if (fs::exists(manifest_path)) {
fs::remove(manifest_path);
}
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Mark resumed snapshots complete

When the new manifest-resume path succeeds, it removes the manifest and returns before the normal completion path writes .completed or advances refs. With the new downloaded-state check requiring that sentinel for directory snapshots, a successfully resumed RyzenAI/Kokoro/Moonshine/vLLM download is immediately considered incomplete on the next cache rebuild and can be re-downloaded or fail offline. Continue through the same post-download finalization used by fresh downloads.

Useful? React with 👍 / 👎.

(void)checkpoint;

if (type == "npu_cache") continue;
if (type == "npu_cache" || type == "draft") continue;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep draft checkpoints required for MTP models

The built-in MTP recipes have a draft checkpoint and the llama.cpp loader still enables --spec-type draft-mtp for models labeled mtp; if the draft file is missing after a partial cache, manual deletion, or cache-first import, this skip lets is_model_downloaded() return true so auto-load will not repair the missing draft. Require the draft checkpoint for recipes that advertise MTP, or avoid enabling MTP defaults when the draft is absent.

Useful? React with 👍 / 👎.

@superm1 superm1 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SO MUCH unrelated work. Please tear out anything but what you're trying to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working engine::llamacpp llama.cpp backend (LlamaCppServer); GPU/CPU LLM inference (Vulkan, ROCm, Metal) runtime::rocm AMD ROCm runtime

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unnecessary download of TheRock

2 participants