Skip to content

fix(store): remove MEMORY journal mode to prevent DB corruption on crash (#67)#68

Closed
halindrome wants to merge 3 commits intoDeusData:mainfrom
halindrome:fix/crash-safe-bulk-write
Closed

fix(store): remove MEMORY journal mode to prevent DB corruption on crash (#67)#68
halindrome wants to merge 3 commits intoDeusData:mainfrom
halindrome:fix/crash-safe-bulk-write

Conversation

@halindrome
Copy link

Problem

BeginBulkWrite() in internal/store/store.go switches SQLite to PRAGMA journal_mode = MEMORY and PRAGMA synchronous = OFF to accelerate bulk writes during indexing. This is catastrophic if the process is killed before EndBulkWrite() is called:

  • MEMORY journal mode keeps transaction metadata in RAM only — no on-disk rollback journal
  • synchronous = OFF skips fsync
  • If the process is killed (SIGKILL, OOM, terminal close), partially-written B-tree pages may be flushed to disk
  • On next open, SQLite detects corruption: PRAGMA integrity_check returns B-tree errors

Every call to index_repository (and the auto-sync watcher reindex) opens this vulnerability window.

Closes #67.

Fix

Remove PRAGMA journal_mode = MEMORY from BeginBulkWrite(). WAL mode (already used everywhere else via the _journal_mode=WAL DSN parameter) is maintained throughout. WAL writes go to the WAL file, never directly to the main DB — a mid-index crash rolls back cleanly on next open.

PRAGMA cache_size = -65536 (64 MB) is kept to preserve write throughput.

EndBulkWrite() no longer re-issues PRAGMA journal_mode = WAL (now redundant).

Changes

  • internal/store/store.go: Remove 1 line from BeginBulkWrite(); remove 1 line from EndBulkWrite(); update doc-comments

Testing

  • All existing tests pass unchanged (go test ./...)
  • A new crash-simulation test (TestBulkWriteCrashRecovery) verifies that forking a subprocess that calls BeginBulkWrite then os.Exit(1) leaves the DB readable and corruption-free (see companion PR or follow-up)

Risk

Low. WAL mode is production-proven in this codebase. The only behavioural change is that indexing no longer exits WAL mode, which is the desired outcome.

🤖 Generated with Claude Code

shanemccarron-maker and others added 2 commits March 16, 2026 15:49
… DB corruption on crash

Fixes DeusData#67. PRAGMA journal_mode = MEMORY disabled crash safety during bulk
indexing: a SIGKILL, OOM kill, or terminal close could leave partially-written
B-tree pages on disk with no rollback journal, permanently corrupting the DB.

WAL mode (already used everywhere else) is now maintained throughout BeginBulkWrite/
EndBulkWrite. If the process dies mid-index, WAL pages are rolled back on next open.
PRAGMA cache_size = -65536 (64 MB) is kept to preserve write throughput.

EndBulkWrite no longer re-issues PRAGMA journal_mode = WAL (redundant).
All existing tests pass unchanged.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… write

Adds two tests to prove the fix from plan 09-01 holds:

- TestBulkWriteKeepsWAL: asserts journal_mode stays WAL through
  BeginBulkWrite/EndBulkWrite, and synchronous returns to NORMAL (1)
- TestBulkWriteCrashRecovery: forks a subprocess that calls BeginBulkWrite
  then os.Exit(1), then verifies PRAGMA integrity_check = "ok" on reopen

Covers regression for GitHub issue DeusData#67.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@halindrome
Copy link
Author

QA Round 1

# PR #68 Review: fix/crash-safe-bulk-write

**Model used:** claude-opus-4-6

---

## What Was Tested

Reviewed the full diff (two files: `internal/store/store.go` changes and new `internal/store/bulkwrite_crash_test.go`), all callers of `BeginBulkWrite`/`EndBulkWrite`, the pipeline `Run()` method, `restoreDefaultCache`, `OpenPath` DSN, and the crash test subprocess idiom.

---

## Findings

### 1. Stale comment in pipeline.go (line 193)

- **File:** `internal/pipeline/pipeline.go:193`
- **Line:** `// Use MEMORY journal mode during fresh indexing for faster bulk writes.`
- **Problem:** The entire point of this PR is that `PRAGMA journal_mode = MEMORY` was removed from `BeginBulkWrite`. This comment still says "Use MEMORY journal mode." It is now factually incorrect and will mislead future readers.
- **Expected:** Comment should reference cache boost / synchronous=OFF, not MEMORY journal mode.
- **Severity:** Minor
- **Status:** Confirmed

---

### 2. `synchronous = OFF` with WAL still allows data loss on power failure (not corruption)

- **Context:** `BeginBulkWrite` still sets `PRAGMA synchronous = OFF`. With WAL mode, `synchronous = OFF` means SQLite does not issue `fsync()` on WAL writes. A power failure or kernel crash can lose committed transactions that were not yet synced to disk.
- **Nuance:** With `synchronous = OFF`, a SIGKILL of the process is safe (the OS buffer cache will flush to disk), but an **OS crash or power loss** during bulk write can lose the entire indexing transaction. The DB will not be *corrupted* (WAL guarantees structural integrity), but it may roll back to a pre-indexing state silently.
- **Impact:** For this application (a code index that can be rebuilt), this is an acceptable trade-off. However, the doc comment says "a SIGKILL during indexing will not corrupt the DB" — accurate, but could be read as a stronger guarantee than it provides. A note that power-loss may lose the in-progress indexing session (but not corrupt) would be more precise.
- **Severity:** Minor (design trade-off, index is rebuildable)
- **Status:** Confirmed

---

### 3. `TestBulkWriteCrashRecovery` does not assert the subprocess actually ran

- **File:** `internal/store/bulkwrite_crash_test.go:111`
- **Problem:** `cmd.CombinedOutput()` error is discarded. If the subprocess fails to start (e.g., `os.Args[0]` is not a test binary, or `TestCrashHelper` is not found), the test silently proceeds to reopen the pre-created empty DB, runs `integrity_check` on an untouched DB, gets "ok", and **passes as a false positive**.
- **Fix:** Check `cmd.ProcessState.ExitCode() == 1` to confirm the subprocess actually ran and crashed as intended. Exit code 2 means `OpenPath` failed in the subprocess.
- **Severity:** Major (crash recovery test can silently become a no-op)
- **Status:** Confirmed

---

### 4. `TestBulkWriteCrashRecovery` does not assert the inserted row survives (or not)

- **Problem:** The test only checks `integrity_check = ok`. It does not verify whether the inserted row (`CrashFunc`) is present after recovery. If WAL replay silently discarded the write, the test would still pass. Asserting the row presence (or explicitly accepting its absence as expected-under-OFF) would strengthen the test.
- **Severity:** Minor
- **Status:** Confirmed

---

### 5. No `defer` on `EndBulkWrite` — panic in `runPasses` leaks PRAGMA state

- **File:** `internal/pipeline/pipeline.go:194-209`
- **Problem:** The pipeline calls `BeginBulkWrite` then relies on explicit `EndBulkWrite` in error/success paths. If `runPasses` panics (nil pointer, OOB), `EndBulkWrite` is never called, leaving the connection with `synchronous = OFF` and a 64 MB cache for the lifetime of the `Store`. A `defer s.Store.EndBulkWrite(ctx)` immediately after `BeginBulkWrite` would be more robust.
- **Severity:** Minor (Go panics in an MCP server likely crash the process anyway)
- **Status:** Hypothetical

---

### Confirmed Correct (no issues found)

- **`restoreDefaultCache`** correctly undoes the 64 MB override — recomputes adaptive cache from current DB file size, which is desirable post-bulk-write.
- **Single call site** — both full and incremental reindex paths flow through the same `pipeline.Run()``BeginBulkWrite`/`EndBulkWrite` pair. No differential treatment needed.
- **Subprocess schema race** — parent pre-creates schema, closes, then subprocess opens. Sequential, no race.

---

## Summary

| # | Finding | Severity | Status |
|---|---------|----------|--------|
| 1 | Stale "MEMORY journal mode" comment in `pipeline.go:193` | Minor | Confirmed |
| 2 | `synchronous=OFF` + WAL allows data loss on power failure (not corruption) | Minor | Confirmed |
| 3 | Subprocess exit code not checked — test can false-positive silently | **Major** | Confirmed |
| 4 | Crash test does not assert row presence after recovery | Minor | Confirmed |
| 5 | No `defer` on `EndBulkWrite` — panic leaks PRAGMA state | Minor | Hypothetical |

**Bottom line:** The core fix (removing `PRAGMA journal_mode = MEMORY`) is correct and complete. Finding #3 is the only **major** issue — `TestBulkWriteCrashRecovery` can silently pass on an untouched DB if the subprocess fails to launch. The remaining findings are minor documentation/test-strength issues.

- pipeline.go: replace stale "MEMORY journal mode" comment with accurate
  description of BeginBulkWrite (64 MB cache boost + synchronous=OFF,
  WAL preserved for crash safety)
- bulkwrite_crash_test.go: verify subprocess exited with code 1; fail
  fast if subprocess never crashed (prevents silent pass on pre-created DB)
- bulkwrite_crash_test.go: log row-survival outcome after integrity_check
  to document synchronous=OFF durability trade-off without failing test

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@halindrome
Copy link
Author

QA Round 2

Fixes from Round 1 applied in commit b3e79bc fix(store): address QA round 1 findings.

# PR #68 QA Round 2 — fix/crash-safe-bulk-write

**Model used:** claude-opus-4-6

---

## Round 1 Fix Verification

### Issue 1 — Stale "MEMORY journal mode" comment in pipeline.go
**Status: VERIFIED FIXED**

Old: `// Use MEMORY journal mode during fresh indexing for faster bulk writes.`
New (lines 193–195):

// Boost cache to 64 MB and set synchronous = OFF for write throughput.
// WAL mode is preserved so the DB remains crash-safe: the WAL file is
// replayed on next open if the process is killed mid-write.

Accurate — BeginBulkWrite sets synchronous=OFF and cache_size=-65536; does not touch journal_mode.

### Issue 3 — Subprocess exit code not checked
**Status: VERIFIED FIXED**

```go
if exitErr, ok := err.(*exec.ExitError); !ok || exitErr.ExitCode() != 1 {
    t.Fatalf("subprocess did not crash as expected: err=%v, output=%s", err, out)
}

Handles all three failure modes correctly: clean exit (0), wrong exit code, and launch failure. Confirmed correct.

Issue 4 — No row-presence assertion after crash recovery

Status: VERIFIED FIXED

Queries SELECT COUNT(*) FROM nodes WHERE qualified_name = 'crash.CrashFunc' — table and column names are correct per schema. Result is logged (not asserted), with a comment documenting the synchronous=OFF power-loss trade-off. Appropriate.


Re-evaluation of Deferred Issues

Issue 5 — No defer on EndBulkWrite

Remains Minor / Hypothetical. No explicit panic() calls exist in the pipeline or store packages, and any runtime panic would propagate to crash the process (no recover() in the call chain). Accepting this risk is reasonable.


New Findings (Round 2)

Issue 7 — Row-presence query silently swallows scan error

  • File: internal/store/bulkwrite_crash_test.go line ~144
  • Problem: _ = s2.DB().QueryRowContext(...).Scan(&rowCount) discards the Scan error. If the query fails (edge case), rowCount stays 0 and the log reads "row survived crash: false" — misleading, since it implies row loss rather than query failure. The line is informational only (no assert), so impact is limited to confusing log output.
  • Severity: Minor
  • Status: Confirmed

Issue 9 — WithTransaction + single-conn pool: panic deadlocks store (amplifies Issue 5)

  • Context: WithTransaction does not recover from panics — an fn() panic leaves the transaction open. With MaxOpenConns=1, this means the entire store is deadlocked until GC finalizes the *sql.Tx. Combined with Issue 5 (synchronous=OFF leak), a recovered panic (if any outer caller adds recover() in the future) would leave the store permanently degraded.
  • Severity: Minor / Hypothetical — requires both a runtime panic in runPasses AND a recover() in an outer caller, neither of which currently exist.
  • Status: Hypothetical — worth noting architecturally if panic recovery is ever added.

Summary

# Description Severity Status
1 Stale comment fix Minor ✅ Verified fixed
3 Exit-code check in crash test Major ✅ Verified fixed
4 Row-presence log added Minor ✅ Verified fixed
7 (new) Row-presence query swallows scan error Minor Confirmed
9 (new) Panic + single-conn pool deadlocks store Minor Hypothetical

All Round 1 fixes correctly applied. No critical or major issues found in Round 2. Two minor findings remain — both are acceptable for an index that can be rebuilt from source.

@DeusData
Copy link
Owner

Hey hey, will revisit this soon, after I made my latest changes. I already had in the plan to rewrite "writing logic" especially in the initial indexing phase. Will revisit this PR once I am done :) In general I want to move away from heavy bulk writes (basically stacked up INSERT statements) and just directly build the Btree from RAM -> Disk. Is much faster and less compute heavy. I already have this implemented (at least for initial indexing), but need to test it a bit. Made everything so far significantly faster in the initial indexing phase, especially for large repos

@DeusData
Copy link
Owner

But thx for the contribution until now! Appreciate it a lot :)

@halindrome
Copy link
Author

Understood. FWIW I am running with this compiled locally because my codebase is too large and I couldn't get past it!

@DeusData
Copy link
Owner

Hey @halindrome, can you provide some additional information? Like how many files do you have, which language primarily is served? On which platform (OS + hardware) do you run? Just that I can try to replicate with my new setup

@halindrome
Copy link
Author

Oh sure! There is a monorepo with 5 submodules. perl, python, typescript, shell, some other stuff. HTML. CSS. markdown of course.

CleanShot 2026-03-16 at 17 41 19@2x

@halindrome
Copy link
Author

I actually thought we were talking about a different PR - lol. This one is just a nasty bug if you kill CC while the indexing is running or accidentally start up a few in the same folder at the same time. The memory issue (which probably caused the corruption crash) is reflected in PR #59

@DeusData
Copy link
Owner

Yes, I was just confused on the channels haha, sorry. Working on this one at the night after work primarily. Right now I have already rewritten the complete thing in C, getting rid of of the CGO overhead. Took me some time, currently just need to test memory allocation etc thoroughly. I would like to deeply optimize on memory/cpu, especially for cases like yours where heavy repos might induce a lot of processing. Hope it will work then much better for you :)

@halindrome
Copy link
Author

Sounds good - LMK if you need help or a guinea pig.

@halindrome
Copy link
Author

Closing in favour of #72, which ports this fix to the C rewrite. The Go code this PR patched no longer exists in main.

@halindrome halindrome closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BeginBulkWrite() disables crash safety — SIGKILL during index permanently corrupts DB

3 participants