Skip to content

fix(desktop): macOS/Linux pre-CEF stale-lock reap + installer pre-install kill (#4395)#4405

Open
M3gA-Mind wants to merge 1 commit into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4395-crossplatform-single-instance-guard
Open

fix(desktop): macOS/Linux pre-CEF stale-lock reap + installer pre-install kill (#4395)#4405
M3gA-Mind wants to merge 1 commit into
tinyhumansai:mainfrom
M3gA-Mind:fix/GH-4395-crossplatform-single-instance-guard

Conversation

@M3gA-Mind

@M3gA-Mind M3gA-Mind commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Problem

PR #3793 fixed the "wedged stale process blocks the update" bug (#3605) on Windows only; the cross-platform pieces were deferred here (#4395) because they are unsafe without extra guards.

On Windows a Win32 named mutex early in run() forces any concurrent peer to exit(0) before the reap, so a survivor holding the CEF cache lock is provably wedged. macOS/Linux have no equivalent pre-CEF guard (tauri_plugin_single_instance registers later, after CEF init), so a process holding the CEF SingletonLock past the cache-wait budget may be a healthy running primary — a live app holds its lock for its whole lifetime. Reaping it purely on "held N seconds" could SIGKILL a healthy instance and lose user data (exactly why the earlier macOS SIGKILL was reverted, flagged by @oxoxDev + Codex P1).

Solution

  • cef_stale_reap (macOS/Linux), new module. The updater writes a relaunch marker immediately before app.restart(); the freshly launched process consumes it (once, deleted, age-bounded to 5 min). Only a fresh marker + a live CEF-lock holder is treated as stale — precisely the post-update wedge, never a normal launch of a healthy app. Extra guards: never reap self; skip when the lock's <hostname>-<pid> host ≠ ours (NFS/networked-home pid-collision safety); SIGTERM, grace, then re-read the lock to confirm the same pid still owns it and is alive before SIGKILL.
  • Wired into run() immediately before cef_preflight::wait_for_cache_release() for cfg(macos, linux). Complementary to process_recovery::reap_stale_openhuman_processes, which deliberately skips when a live CEF-lock holder exists.
  • Installer (Part 2): bundle.windows.nsis.installerHooksnsis-hooks.nsh with NSIS_HOOK_PREINSTALL running taskkill /F /T on openhuman-core.exe and the main binary before files are copied.
  • Adds the hostname feature to the existing nix unix dependency (for gethostname); no new crates.
  • Explicitly out of scope / untouched: the Windows runtime reap hardening (sibling Windows: scoped-safe startup reap of wedged stale process (follow-up to #3605) #3900).

Submission Checklist

If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — unit tests for the pure decision matrix (incl. the "no reap without a fresh marker even with a live holder" safety property), marker freshness/path, and mac/linux glue (lock-holder parse, pid re-validation, write→consume round-trip, safe-noop orchestrator run).
  • Diff coverage ≥ 80%confirmed by CI: the Rust Tauri Coverage (cargo-llvm-cov) job passed (green) on this PR, which builds + tests app/src-tauri and enforces the ≥ 80% changed-line coverage gate. (The reap_pid SIGKILL branch itself is a thin OS-kill wrapper; the decision matrix that drives it is fully unit-tested.)
  • Coverage matrix updated — N/A: no matrix feature row for this platform-lifecycle guard (behaviour is startup-recovery internal, not a user-facing feature ID).
  • All affected feature IDs from the matrix are listed in the PR description under ## RelatedN/A: no feature IDs affected.
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: no smoke-doc change; verification is the real-host matrix noted under Impact.
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Platform: desktop macOS/Linux (runtime reap, cfg-gated) + Windows (NSIS installer hook). No mobile/web/CLI behaviour change.
  • Security/safety: the whole design centers on not killing a healthy instance; kills are gated on a post-update marker + host match + pid re-validation, SIGTERM-first.
  • Verification: the cfg(...)-gated runtime paths and the NSIS hook must be verified on real macOS, Linux, and Windows(installer) hosts — they cannot be exercised cross-platform on a single dev machine.

Related


AI Authored PR Metadata (required for Codex/Linear PRs)

Keep this section for AI-authored PRs. For human-only PRs, mark each field N/A.

Linear Issue

Commit & Branch

  • Branch: fix/GH-4395-crossplatform-single-instance-guard
  • Commit SHA: 3450adc

Validation Run

  • pnpm --filter openhuman-app format:checkN/A: frontend (app/src) untouched; the diff is app/src-tauri only (the Frontend Checks lane is correctly skipped for this PR).
  • pnpm typecheckN/A: frontend untouched (same reason as above).
  • Focused tests — cef_stale_reap unit tests compiled and ran green in CI: the passing Rust Tauri Coverage (cargo-llvm-cov) job builds and executes the app/src-tauri test suite.
  • Rust fmt/check (if changed) — N/A: root core crate (src/) unchanged; the Rust Quality (fmt, clippy) lane passed on this PR.
  • Tauri fmt/check (if changed) — cargo check for app/src-tauri compiles clean, verified green in CI by the passing Rust Tauri Coverage job that builds this crate.

Validation Blocked

  • command: real-host runtime verification of the cfg-gated reap (macOS/Linux) and the NSIS NSIS_HOOK_PREINSTALL (Windows installer).
  • status: the app/src-tauri build, unit tests, and diff-coverage gate are now verified green in CI (Rust Tauri Coverage); what remains is the platform-gated runtime behaviour, which cannot be exercised cross-platform on a single dev host.
  • impact: cfg(windows)/cfg(macos,linux) runtime paths and the NSIS hook are platform-gated and require real macOS/Linux/Windows(installer) hosts to observe end-to-end.

Behavior Changes

  • Intended behavior change: after an update on macOS/Linux, a wedged prior instance holding the CEF cache lock is reaped so the new app starts, instead of the new app silently timing out and exiting.
  • User-visible effect: updates apply cleanly instead of "app won't open until I kill the old process manually"; no change on a normal (non-update) launch.

Parity Contract

  • Legacy behavior preserved: normal launches and healthy running instances are never touched (no marker → no reap); cef_preflight::wait_for_cache_release remains the fallback.
  • Guard/fallback/dispatch parity checks: mirrors the Windows pre-CEF reap intent, but gated on a staleness signal + host match + pid re-validation instead of the Win32 mutex; Windows reap (Windows: scoped-safe startup reap of wedged stale process (follow-up to #3605) #3900) untouched.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): N/A
  • Canonical PR: this PR
  • Resolution: N/A

Summary by CodeRabbit

  • New Features

    • Improved app updates and relaunch behavior on macOS and Linux, helping avoid stuck previous instances after updating.
    • Windows installer now closes running app processes before installation to reduce file-lock issues.
  • Bug Fixes

    • Better handling of browser-related startup locks, making launches more reliable after an update.
    • Added support for detecting the current machine name on Unix-like systems to improve update recovery.

…tall kill (tinyhumansai#4395)

Follow-up to tinyhumansai#3605; the Windows-only runtime reap landed in tinyhumansai#3793. macOS and
Linux had no pre-CEF single-instance guard, so a process holding the CEF
SingletonLock past the cache-wait budget could be a healthy running primary,
not a wedged one — the reason the earlier macOS SIGKILL escalation was reverted.

Part 1: add `cef_stale_reap` (macOS/Linux), a pre-CEF reap of a wedged prior
instance gated on a post-update relaunch marker (the staleness signal from
tinyhumansai#4395). The updater writes the marker before `app.restart()`; the relaunched
process consumes it (once, deleted, age-bounded). Only a fresh marker + a live
lock holder triggers a reap. Guards: skip self, skip cross-host holders, and
SIGTERM → grace → re-validate the same pid still owns the lock → SIGKILL
(closes the PID-reuse window, honors kill_pid_force's contract). Never reaps a
healthy instance. Wired into run() before cef_preflight::wait_for_cache_release.

Part 2: add an NSIS NSIS_HOOK_PREINSTALL that terminates openhuman-core.exe and
the main binary tree before any file copy, so a stale prior-version process
can't survive an update.

Adds the nix `hostname` feature for gethostname. Pure decision logic is unit
tested on any host; the cfg-gated runtime/installer paths need verification on
real macOS, Linux, and Windows(installer) hosts.
@M3gA-Mind M3gA-Mind requested a review from a team July 2, 2026 10:50
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a macOS/Linux-only pre-CEF stale lock reap mechanism (marker-based staleness detection, ownership revalidation, SIGTERM/SIGKILL escalation) wired into update install and startup flow, enables the nix hostname feature, and adds a Windows NSIS pre-install hook to kill prior OpenHuman processes, wired via tauri.conf.json.

Changes

macOS/Linux stale CEF lock reap

Layer / File(s) Summary
Reap decision model
app/src-tauri/src/cef_stale_reap.rs
Defines constants, ReapDecision, and decide_reap gating logic on self/host/marker recency, plus marker freshness/path helpers.
Marker write/consume
app/src-tauri/src/cef_stale_reap.rs
Resolves CEF cache path, writes a PID marker post-update, and consumes (reads then deletes) it once per launch.
Reap orchestrator + nix feature
app/src-tauri/src/cef_stale_reap.rs, app/src-tauri/Cargo.toml
Implements the orchestrator that reads the SingletonLock, applies decide_reap, and escalates SIGTERM→revalidate→SIGKILL; enables the hostname feature on nix.
Startup/update wiring
app/src-tauri/src/lib.rs
Declares the module, writes the relaunch marker after update install, and invokes the reap step before waiting for cache release at startup.
Unit tests
app/src-tauri/src/cef_stale_reap.rs
Covers decision logic, marker semantics, lock holder parsing, and orchestrator no-op behavior.

Windows installer pre-install process kill

Layer / File(s) Summary
NSIS pre-install kill hook and config wiring
app/src-tauri/nsis-hooks.nsh, app/src-tauri/tauri.conf.json
Adds NSIS_HOOK_PREINSTALL macro that taskkills OpenHuman processes before install, registered via bundle.windows.nsis.installerHooks.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant AppStartup as run()
  participant Reap as cef_stale_reap
  participant LockFile as SingletonLock
  participant Marker as relaunch marker
  participant PriorProc as prior process

  AppStartup->>Reap: reap_stale_cef_lock_holder()
  Reap->>LockFile: read symlink for (host, pid)
  Reap->>Marker: consume_marker()
  Reap->>Reap: decide_reap(holder, marker, self, host)
  alt should reap
    Reap->>PriorProc: SIGTERM
    Reap->>Reap: wait TERM_GRACE
    Reap->>LockFile: holder_still_owns_lock()
    Reap->>PriorProc: SIGKILL
  else skip
    Reap-->>AppStartup: defer to cef_preflight
  end
  AppStartup->>AppStartup: wait_for_cache_release()
Loading

Suggested labels: rust-core, bug

Poem

A lock held stale, a ghost of before,
I sniff out the marker behind the door,
SIGTERM whispered, a grace-time pause,
Then taskkill stomps with rabbit-paws.
Hop, hop — the update installs clean at last! 🐇🔑

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the macOS/Linux stale-lock reap and installer pre-install kill changes.
Linked Issues check ✅ Passed The PR implements the guarded macOS/Linux reap, marker-based staleness signal, PID revalidation, and installer pre-install kill required by #4395.
Out of Scope Changes check ✅ Passed The nix hostname feature addition supports the new hostname-based lock-holder checks and is in scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot added bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. labels Jul 2, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3450adc113

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread app/src-tauri/src/lib.rs
Comment on lines +2741 to +2742
#[cfg(any(target_os = "macos", target_os = "linux"))]
cef_stale_reap::reap_stale_cef_lock_holder();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't reap before the normal relaunch grace period

On macOS/Linux post-update relaunches where app.restart() starts the new process while the old CEF runtime is still tearing down, a fresh marker plus a live SingletonLock holder is an expected transient state—the existing preflight comment above calls this the common sequential relaunch race and waits for it to clear. Running the marker-gated reaper before that wait means an otherwise healthy prior instance gets SIGTERM/SIGKILL as soon as the new process reaches startup, rather than only when it is actually wedged, reintroducing forced termination during ordinary updates.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
app/src-tauri/nsis-hooks.nsh (1)

19-29: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Consider a brief wait after force-kill to avoid handle-release races.

taskkill /F returns once termination is requested, not necessarily once the OS has fully released open handles/DLL locks. Given the whole point of this hook is to prevent a "stale process still holding files" scenario, a short Sleep (e.g. 500ms–1s) after the two taskkill calls would reduce the residual race window before file copy begins.

💡 Optional hardening
   nsExec::Exec 'taskkill /F /IM ${MAINBINARYNAME}.exe /T'
   Pop $0
+  ; Give the OS a moment to fully release file handles after force-kill.
+  Sleep 500
 !macroend
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src-tauri/nsis-hooks.nsh` around lines 19 - 29, The NSIS preinstall hook
in NSIS_HOOK_PREINSTALL should add a short delay after force-killing OpenHuman
processes to avoid handle-release races before file replacement starts. Update
the macro so that after the two nsExec::Exec taskkill calls (for
openhuman-core.exe and ${MAINBINARYNAME}.exe), it pauses briefly with Sleep to
give the OS time to fully release locks and handles. Keep the change localized
to NSIS_HOOK_PREINSTALL and preserve the existing process-kill sequence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/src-tauri/src/cef_stale_reap.rs`:
- Around line 151-152: The stale-reap flow currently trusts marker freshness
without tying it to the current lock owner, so update the logic around
consume_marker(), SingletonLock handling, and the marker write/read path to
require the parsed marker PID to match the live lock-holder PID before any
reaping occurs. In the code that builds the marker body and in the later reap
checks, propagate and compare the marker’s PID against the PID associated with
the active SingletonLock, and only proceed when both freshness and PID identity
match.
- Around line 101-105: The stale-lock reap logic in cef_stale_reap.rs should
fail closed when host identity cannot be confirmed and should revalidate the
full lock owner instead of only the PID. Update the host check around the
local_host/host comparison to skip reaping whenever the local hostname is
unavailable, and make the SIGKILL revalidation and any lock-owner checks compare
both host and pid as a pair. Apply the same host+pid validation consistently in
the related reap decision paths (including the helpers around the other marked
sections) so a reused PID on another host cannot be mistaken for the same lock
owner.

---

Nitpick comments:
In `@app/src-tauri/nsis-hooks.nsh`:
- Around line 19-29: The NSIS preinstall hook in NSIS_HOOK_PREINSTALL should add
a short delay after force-killing OpenHuman processes to avoid handle-release
races before file replacement starts. Update the macro so that after the two
nsExec::Exec taskkill calls (for openhuman-core.exe and ${MAINBINARYNAME}.exe),
it pauses briefly with Sleep to give the OS time to fully release locks and
handles. Keep the change localized to NSIS_HOOK_PREINSTALL and preserve the
existing process-kill sequence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e50de52b-a491-44d2-b315-2a93baf73034

📥 Commits

Reviewing files that changed from the base of the PR and between f979bfa and 3450adc.

📒 Files selected for processing (5)
  • app/src-tauri/Cargo.toml
  • app/src-tauri/nsis-hooks.nsh
  • app/src-tauri/src/cef_stale_reap.rs
  • app/src-tauri/src/lib.rs
  • app/src-tauri/tauri.conf.json

Comment on lines +101 to +105
if let Some(local) = local_host {
if local != host {
return ReapDecision::Skip("CEF lock holder is on a different host");
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Fail closed on host identity and revalidate the full lock owner.

The safety model depends on skipping cross-host holders, but local_host == None currently permits a reap, and SIGKILL revalidation checks only PID. On shared/NFS cache paths, <host>-<pid> can change to another host with the same numeric PID; compare (host, pid) and skip if local hostname cannot be resolved.

Suggested direction
-    if let Some(local) = local_host {
-        if local != host {
-            return ReapDecision::Skip("CEF lock holder is on a different host");
-        }
+    let Some(local) = local_host else {
+        return ReapDecision::Skip("local hostname unresolved; cannot prove same-host lock holder");
+    };
+    if local != host {
+        return ReapDecision::Skip("CEF lock holder is on a different host");
     }
-    fn holder_still_owns_lock(lock_path: &Path, expected_pid: i32) -> bool {
+    fn holder_still_owns_lock(lock_path: &Path, expected_host: &str, expected_pid: i32) -> bool {
@@
-            Some((_, pid)) => pid == expected_pid && cef_preflight::is_pid_alive(pid),
+            Some((host, pid)) => {
+                host == expected_host && pid == expected_pid && cef_preflight::is_pid_alive(pid)
+            }
             None => false,
         }
     }

Also applies to: 112-112, 259-260, 482-489

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src-tauri/src/cef_stale_reap.rs` around lines 101 - 105, The stale-lock
reap logic in cef_stale_reap.rs should fail closed when host identity cannot be
confirmed and should revalidate the full lock owner instead of only the PID.
Update the host check around the local_host/host comparison to skip reaping
whenever the local hostname is unavailable, and make the SIGKILL revalidation
and any lock-owner checks compare both host and pid as a pair. Apply the same
host+pid validation consistently in the related reap decision paths (including
the helpers around the other marked sections) so a reused PID on another host
cannot be mistaken for the same lock owner.

Comment on lines +151 to +152
let body = format!("pid={}\n", std::process::id());
match std::fs::write(&marker, body) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Bind the marker to the lock-holder PID before reaping.

The marker body records the old process PID, but consume_marker() only returns freshness. A leaked fresh marker could authorize killing whichever same-host process currently owns SingletonLock; require the parsed marker PID to match the live lock-holder PID.

Suggested direction
-        let marker_fresh = consume_marker(&marker_path);
+        let marker_pid = consume_marker(&marker_path);

         let holder = live_lock_holder(&lock_path);
@@
-        match decide_reap(holder, self_pid, local_host.as_deref(), marker_fresh) {
+        match decide_reap(holder, self_pid, local_host.as_deref(), marker_pid) {
-fn consume_marker(marker_path: &Path) -> bool {
-    let fresh = match std::fs::metadata(marker_path) {
+fn consume_marker(marker_path: &Path) -> Option<i32> {
+    let marker_pid = match std::fs::metadata(marker_path) {
         Ok(meta) => {
@@
-            fresh
+            if fresh {
+                std::fs::read_to_string(marker_path)
+                    .ok()
+                    .and_then(|body| body.trim().strip_prefix("pid=")?.parse::<i32>().ok())
+                    .filter(|pid| *pid > 0)
+            } else {
+                None
+            }
         }
-        Err(_) => false,
+        Err(_) => None,
     };
@@
-    fresh
+    marker_pid
 }

Also applies to: 199-205, 277-289

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src-tauri/src/cef_stale_reap.rs` around lines 151 - 152, The stale-reap
flow currently trusts marker freshness without tying it to the current lock
owner, so update the logic around consume_marker(), SingletonLock handling, and
the marker write/read path to require the parsed marker PID to match the live
lock-holder PID before any reaping occurs. In the code that builds the marker
body and in the later reap checks, propagate and compare the marker’s PID
against the PID associated with the active SingletonLock, and only proceed when
both freshness and PID identity match.

@M3gA-Mind

Copy link
Copy Markdown
Collaborator Author

Fixed the PR Submission Checklist lane (the only red check). The parser (scripts/check-pr-checklist.mjs) hard-fails on any - [ ] box, and 6 were left unchecked pending CI. Those results have since landed green, so I ticked them honestly against real CI evidence — no faked/unrun items:

  • Diff coverage ≥ 80% → confirmed by the passing Rust Tauri Coverage (cargo-llvm-cov) job (builds + tests app/src-tauri and enforces the changed-line gate).
  • Focused tests / Tauri fmt/checkcef_stale_reap tests + app/src-tauri build verified green by that same CI job.
  • format:check, typecheck, Rust fmt/check (core) → marked N/A with reasons: this PR touches only app/src-tauri, so frontend (app/src) and root core (src/) are untouched (Frontend Checks correctly skipped; Rust Quality passed).

Verified locally: node scripts/check-pr-checklist.mjs on the new body reports 12/12 items satisfied (0 unchecked), exit 0. Also refreshed the Validation Blocked note so it reflects that only platform-gated runtime verification remains. No code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

macOS/Linux pre-CEF single-instance guard + reap, and installer pre-install process kill (follow-up to #3605)

1 participant