fix(desktop): macOS/Linux pre-CEF stale-lock reap + installer pre-install kill (#4395)#4405
Conversation
…tall kill (tinyhumansai#4395) Follow-up to tinyhumansai#3605; the Windows-only runtime reap landed in tinyhumansai#3793. macOS and Linux had no pre-CEF single-instance guard, so a process holding the CEF SingletonLock past the cache-wait budget could be a healthy running primary, not a wedged one — the reason the earlier macOS SIGKILL escalation was reverted. Part 1: add `cef_stale_reap` (macOS/Linux), a pre-CEF reap of a wedged prior instance gated on a post-update relaunch marker (the staleness signal from tinyhumansai#4395). The updater writes the marker before `app.restart()`; the relaunched process consumes it (once, deleted, age-bounded). Only a fresh marker + a live lock holder triggers a reap. Guards: skip self, skip cross-host holders, and SIGTERM → grace → re-validate the same pid still owns the lock → SIGKILL (closes the PID-reuse window, honors kill_pid_force's contract). Never reaps a healthy instance. Wired into run() before cef_preflight::wait_for_cache_release. Part 2: add an NSIS NSIS_HOOK_PREINSTALL that terminates openhuman-core.exe and the main binary tree before any file copy, so a stale prior-version process can't survive an update. Adds the nix `hostname` feature for gethostname. Pure decision logic is unit tested on any host; the cfg-gated runtime/installer paths need verification on real macOS, Linux, and Windows(installer) hosts.
📝 WalkthroughWalkthroughAdds a macOS/Linux-only pre-CEF stale lock reap mechanism (marker-based staleness detection, ownership revalidation, SIGTERM/SIGKILL escalation) wired into update install and startup flow, enables the nix hostname feature, and adds a Windows NSIS pre-install hook to kill prior OpenHuman processes, wired via tauri.conf.json. ChangesmacOS/Linux stale CEF lock reap
Windows installer pre-install process kill
Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant AppStartup as run()
participant Reap as cef_stale_reap
participant LockFile as SingletonLock
participant Marker as relaunch marker
participant PriorProc as prior process
AppStartup->>Reap: reap_stale_cef_lock_holder()
Reap->>LockFile: read symlink for (host, pid)
Reap->>Marker: consume_marker()
Reap->>Reap: decide_reap(holder, marker, self, host)
alt should reap
Reap->>PriorProc: SIGTERM
Reap->>Reap: wait TERM_GRACE
Reap->>LockFile: holder_still_owns_lock()
Reap->>PriorProc: SIGKILL
else skip
Reap-->>AppStartup: defer to cef_preflight
end
AppStartup->>AppStartup: wait_for_cache_release()
Suggested labels: Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3450adc113
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| #[cfg(any(target_os = "macos", target_os = "linux"))] | ||
| cef_stale_reap::reap_stale_cef_lock_holder(); |
There was a problem hiding this comment.
Don't reap before the normal relaunch grace period
On macOS/Linux post-update relaunches where app.restart() starts the new process while the old CEF runtime is still tearing down, a fresh marker plus a live SingletonLock holder is an expected transient state—the existing preflight comment above calls this the common sequential relaunch race and waits for it to clear. Running the marker-gated reaper before that wait means an otherwise healthy prior instance gets SIGTERM/SIGKILL as soon as the new process reaches startup, rather than only when it is actually wedged, reintroducing forced termination during ordinary updates.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
app/src-tauri/nsis-hooks.nsh (1)
19-29: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winConsider a brief wait after force-kill to avoid handle-release races.
taskkill /Freturns once termination is requested, not necessarily once the OS has fully released open handles/DLL locks. Given the whole point of this hook is to prevent a "stale process still holding files" scenario, a shortSleep(e.g. 500ms–1s) after the two taskkill calls would reduce the residual race window before file copy begins.💡 Optional hardening
nsExec::Exec 'taskkill /F /IM ${MAINBINARYNAME}.exe /T' Pop $0 + ; Give the OS a moment to fully release file handles after force-kill. + Sleep 500 !macroend🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/nsis-hooks.nsh` around lines 19 - 29, The NSIS preinstall hook in NSIS_HOOK_PREINSTALL should add a short delay after force-killing OpenHuman processes to avoid handle-release races before file replacement starts. Update the macro so that after the two nsExec::Exec taskkill calls (for openhuman-core.exe and ${MAINBINARYNAME}.exe), it pauses briefly with Sleep to give the OS time to fully release locks and handles. Keep the change localized to NSIS_HOOK_PREINSTALL and preserve the existing process-kill sequence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/src-tauri/src/cef_stale_reap.rs`:
- Around line 151-152: The stale-reap flow currently trusts marker freshness
without tying it to the current lock owner, so update the logic around
consume_marker(), SingletonLock handling, and the marker write/read path to
require the parsed marker PID to match the live lock-holder PID before any
reaping occurs. In the code that builds the marker body and in the later reap
checks, propagate and compare the marker’s PID against the PID associated with
the active SingletonLock, and only proceed when both freshness and PID identity
match.
- Around line 101-105: The stale-lock reap logic in cef_stale_reap.rs should
fail closed when host identity cannot be confirmed and should revalidate the
full lock owner instead of only the PID. Update the host check around the
local_host/host comparison to skip reaping whenever the local hostname is
unavailable, and make the SIGKILL revalidation and any lock-owner checks compare
both host and pid as a pair. Apply the same host+pid validation consistently in
the related reap decision paths (including the helpers around the other marked
sections) so a reused PID on another host cannot be mistaken for the same lock
owner.
---
Nitpick comments:
In `@app/src-tauri/nsis-hooks.nsh`:
- Around line 19-29: The NSIS preinstall hook in NSIS_HOOK_PREINSTALL should add
a short delay after force-killing OpenHuman processes to avoid handle-release
races before file replacement starts. Update the macro so that after the two
nsExec::Exec taskkill calls (for openhuman-core.exe and ${MAINBINARYNAME}.exe),
it pauses briefly with Sleep to give the OS time to fully release locks and
handles. Keep the change localized to NSIS_HOOK_PREINSTALL and preserve the
existing process-kill sequence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e50de52b-a491-44d2-b315-2a93baf73034
📒 Files selected for processing (5)
app/src-tauri/Cargo.tomlapp/src-tauri/nsis-hooks.nshapp/src-tauri/src/cef_stale_reap.rsapp/src-tauri/src/lib.rsapp/src-tauri/tauri.conf.json
| if let Some(local) = local_host { | ||
| if local != host { | ||
| return ReapDecision::Skip("CEF lock holder is on a different host"); | ||
| } | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Fail closed on host identity and revalidate the full lock owner.
The safety model depends on skipping cross-host holders, but local_host == None currently permits a reap, and SIGKILL revalidation checks only PID. On shared/NFS cache paths, <host>-<pid> can change to another host with the same numeric PID; compare (host, pid) and skip if local hostname cannot be resolved.
Suggested direction
- if let Some(local) = local_host {
- if local != host {
- return ReapDecision::Skip("CEF lock holder is on a different host");
- }
+ let Some(local) = local_host else {
+ return ReapDecision::Skip("local hostname unresolved; cannot prove same-host lock holder");
+ };
+ if local != host {
+ return ReapDecision::Skip("CEF lock holder is on a different host");
}- fn holder_still_owns_lock(lock_path: &Path, expected_pid: i32) -> bool {
+ fn holder_still_owns_lock(lock_path: &Path, expected_host: &str, expected_pid: i32) -> bool {
@@
- Some((_, pid)) => pid == expected_pid && cef_preflight::is_pid_alive(pid),
+ Some((host, pid)) => {
+ host == expected_host && pid == expected_pid && cef_preflight::is_pid_alive(pid)
+ }
None => false,
}
}Also applies to: 112-112, 259-260, 482-489
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/src-tauri/src/cef_stale_reap.rs` around lines 101 - 105, The stale-lock
reap logic in cef_stale_reap.rs should fail closed when host identity cannot be
confirmed and should revalidate the full lock owner instead of only the PID.
Update the host check around the local_host/host comparison to skip reaping
whenever the local hostname is unavailable, and make the SIGKILL revalidation
and any lock-owner checks compare both host and pid as a pair. Apply the same
host+pid validation consistently in the related reap decision paths (including
the helpers around the other marked sections) so a reused PID on another host
cannot be mistaken for the same lock owner.
| let body = format!("pid={}\n", std::process::id()); | ||
| match std::fs::write(&marker, body) { |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Bind the marker to the lock-holder PID before reaping.
The marker body records the old process PID, but consume_marker() only returns freshness. A leaked fresh marker could authorize killing whichever same-host process currently owns SingletonLock; require the parsed marker PID to match the live lock-holder PID.
Suggested direction
- let marker_fresh = consume_marker(&marker_path);
+ let marker_pid = consume_marker(&marker_path);
let holder = live_lock_holder(&lock_path);
@@
- match decide_reap(holder, self_pid, local_host.as_deref(), marker_fresh) {
+ match decide_reap(holder, self_pid, local_host.as_deref(), marker_pid) {-fn consume_marker(marker_path: &Path) -> bool {
- let fresh = match std::fs::metadata(marker_path) {
+fn consume_marker(marker_path: &Path) -> Option<i32> {
+ let marker_pid = match std::fs::metadata(marker_path) {
Ok(meta) => {
@@
- fresh
+ if fresh {
+ std::fs::read_to_string(marker_path)
+ .ok()
+ .and_then(|body| body.trim().strip_prefix("pid=")?.parse::<i32>().ok())
+ .filter(|pid| *pid > 0)
+ } else {
+ None
+ }
}
- Err(_) => false,
+ Err(_) => None,
};
@@
- fresh
+ marker_pid
}Also applies to: 199-205, 277-289
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/src-tauri/src/cef_stale_reap.rs` around lines 151 - 152, The stale-reap
flow currently trusts marker freshness without tying it to the current lock
owner, so update the logic around consume_marker(), SingletonLock handling, and
the marker write/read path to require the parsed marker PID to match the live
lock-holder PID before any reaping occurs. In the code that builds the marker
body and in the later reap checks, propagate and compare the marker’s PID
against the PID associated with the active SingletonLock, and only proceed when
both freshness and PID identity match.
|
Fixed the PR Submission Checklist lane (the only red check). The parser (
Verified locally: |
Summary
cef_stale_reap) that unblocks an updated app when a wedged prior instance keeps holding the CEFSingletonLock— the Background processes from old app versions break updates — critical #3605 symptom, previously fixed on Windows only (fix(tauri): proactively reap wedged stale process on Windows startup (#3605) #3793).kill_pid_force's contract).NSIS_HOOK_PREINSTALLthat terminatesopenhuman-core.exeand the main binary tree before any file copy, so a stale prior-version process can't survive the install/update.Problem
PR #3793 fixed the "wedged stale process blocks the update" bug (#3605) on Windows only; the cross-platform pieces were deferred here (#4395) because they are unsafe without extra guards.
On Windows a Win32 named mutex early in
run()forces any concurrent peer toexit(0)before the reap, so a survivor holding the CEF cache lock is provably wedged. macOS/Linux have no equivalent pre-CEF guard (tauri_plugin_single_instanceregisters later, after CEF init), so a process holding the CEFSingletonLockpast the cache-wait budget may be a healthy running primary — a live app holds its lock for its whole lifetime. Reaping it purely on "held N seconds" could SIGKILL a healthy instance and lose user data (exactly why the earlier macOS SIGKILL was reverted, flagged by @oxoxDev + Codex P1).Solution
cef_stale_reap(macOS/Linux), new module. The updater writes a relaunch marker immediately beforeapp.restart(); the freshly launched process consumes it (once, deleted, age-bounded to 5 min). Only a fresh marker + a live CEF-lock holder is treated as stale — precisely the post-update wedge, never a normal launch of a healthy app. Extra guards: never reap self; skip when the lock's<hostname>-<pid>host ≠ ours (NFS/networked-home pid-collision safety); SIGTERM, grace, then re-read the lock to confirm the same pid still owns it and is alive before SIGKILL.run()immediately beforecef_preflight::wait_for_cache_release()forcfg(macos, linux). Complementary toprocess_recovery::reap_stale_openhuman_processes, which deliberately skips when a live CEF-lock holder exists.bundle.windows.nsis.installerHooks→nsis-hooks.nshwithNSIS_HOOK_PREINSTALLrunningtaskkill /F /Tonopenhuman-core.exeand the main binary before files are copied.hostnamefeature to the existingnixunix dependency (forgethostname); no new crates.Submission Checklist
Rust Tauri Coverage (cargo-llvm-cov)job passed (green) on this PR, which builds + testsapp/src-tauriand enforces the ≥ 80% changed-line coverage gate. (Thereap_pidSIGKILL branch itself is a thin OS-kill wrapper; the decision matrix that drives it is fully unit-tested.)N/A: no matrix feature row for this platform-lifecycle guard (behaviour is startup-recovery internal, not a user-facing feature ID).## Related—N/A: no feature IDs affected.N/A: no smoke-doc change; verification is the real-host matrix noted under Impact.Closes #NNNin the## RelatedsectionImpact
cfg-gated) + Windows (NSIS installer hook). No mobile/web/CLI behaviour change.cfg(...)-gated runtime paths and the NSIS hook must be verified on real macOS, Linux, and Windows(installer) hosts — they cannot be exercised cross-platform on a single dev machine.Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/GH-4395-crossplatform-single-instance-guardValidation Run
pnpm --filter openhuman-app format:check—N/A: frontend (app/src) untouched; the diff isapp/src-taurionly (theFrontend Checkslane is correctly skipped for this PR).pnpm typecheck—N/A: frontend untouched (same reason as above).cef_stale_reapunit tests compiled and ran green in CI: the passingRust Tauri Coverage (cargo-llvm-cov)job builds and executes theapp/src-tauritest suite.N/A: root core crate (src/) unchanged; theRust Quality (fmt, clippy)lane passed on this PR.cargo checkforapp/src-tauricompiles clean, verified green in CI by the passingRust Tauri Coveragejob that builds this crate.Validation Blocked
command:real-host runtime verification of thecfg-gated reap (macOS/Linux) and the NSISNSIS_HOOK_PREINSTALL(Windows installer).status:theapp/src-tauribuild, unit tests, and diff-coverage gate are now verified green in CI (Rust Tauri Coverage); what remains is the platform-gated runtime behaviour, which cannot be exercised cross-platform on a single dev host.impact:cfg(windows)/cfg(macos,linux)runtime paths and the NSIS hook are platform-gated and require real macOS/Linux/Windows(installer) hosts to observe end-to-end.Behavior Changes
Parity Contract
cef_preflight::wait_for_cache_releaseremains the fallback.Duplicate / Superseded PR Handling
Summary by CodeRabbit
New Features
Bug Fixes