fix(store): remove MEMORY journal mode to prevent DB corruption on crash (#67)#68
fix(store): remove MEMORY journal mode to prevent DB corruption on crash (#67)#68halindrome wants to merge 3 commits intoDeusData:mainfrom
Conversation
… DB corruption on crash Fixes DeusData#67. PRAGMA journal_mode = MEMORY disabled crash safety during bulk indexing: a SIGKILL, OOM kill, or terminal close could leave partially-written B-tree pages on disk with no rollback journal, permanently corrupting the DB. WAL mode (already used everywhere else) is now maintained throughout BeginBulkWrite/ EndBulkWrite. If the process dies mid-index, WAL pages are rolled back on next open. PRAGMA cache_size = -65536 (64 MB) is kept to preserve write throughput. EndBulkWrite no longer re-issues PRAGMA journal_mode = WAL (redundant). All existing tests pass unchanged. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
… write Adds two tests to prove the fix from plan 09-01 holds: - TestBulkWriteKeepsWAL: asserts journal_mode stays WAL through BeginBulkWrite/EndBulkWrite, and synchronous returns to NORMAL (1) - TestBulkWriteCrashRecovery: forks a subprocess that calls BeginBulkWrite then os.Exit(1), then verifies PRAGMA integrity_check = "ok" on reopen Covers regression for GitHub issue DeusData#67. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
QA Round 1# PR #68 Review: fix/crash-safe-bulk-write
**Model used:** claude-opus-4-6
---
## What Was Tested
Reviewed the full diff (two files: `internal/store/store.go` changes and new `internal/store/bulkwrite_crash_test.go`), all callers of `BeginBulkWrite`/`EndBulkWrite`, the pipeline `Run()` method, `restoreDefaultCache`, `OpenPath` DSN, and the crash test subprocess idiom.
---
## Findings
### 1. Stale comment in pipeline.go (line 193)
- **File:** `internal/pipeline/pipeline.go:193`
- **Line:** `// Use MEMORY journal mode during fresh indexing for faster bulk writes.`
- **Problem:** The entire point of this PR is that `PRAGMA journal_mode = MEMORY` was removed from `BeginBulkWrite`. This comment still says "Use MEMORY journal mode." It is now factually incorrect and will mislead future readers.
- **Expected:** Comment should reference cache boost / synchronous=OFF, not MEMORY journal mode.
- **Severity:** Minor
- **Status:** Confirmed
---
### 2. `synchronous = OFF` with WAL still allows data loss on power failure (not corruption)
- **Context:** `BeginBulkWrite` still sets `PRAGMA synchronous = OFF`. With WAL mode, `synchronous = OFF` means SQLite does not issue `fsync()` on WAL writes. A power failure or kernel crash can lose committed transactions that were not yet synced to disk.
- **Nuance:** With `synchronous = OFF`, a SIGKILL of the process is safe (the OS buffer cache will flush to disk), but an **OS crash or power loss** during bulk write can lose the entire indexing transaction. The DB will not be *corrupted* (WAL guarantees structural integrity), but it may roll back to a pre-indexing state silently.
- **Impact:** For this application (a code index that can be rebuilt), this is an acceptable trade-off. However, the doc comment says "a SIGKILL during indexing will not corrupt the DB" — accurate, but could be read as a stronger guarantee than it provides. A note that power-loss may lose the in-progress indexing session (but not corrupt) would be more precise.
- **Severity:** Minor (design trade-off, index is rebuildable)
- **Status:** Confirmed
---
### 3. `TestBulkWriteCrashRecovery` does not assert the subprocess actually ran
- **File:** `internal/store/bulkwrite_crash_test.go:111`
- **Problem:** `cmd.CombinedOutput()` error is discarded. If the subprocess fails to start (e.g., `os.Args[0]` is not a test binary, or `TestCrashHelper` is not found), the test silently proceeds to reopen the pre-created empty DB, runs `integrity_check` on an untouched DB, gets "ok", and **passes as a false positive**.
- **Fix:** Check `cmd.ProcessState.ExitCode() == 1` to confirm the subprocess actually ran and crashed as intended. Exit code 2 means `OpenPath` failed in the subprocess.
- **Severity:** Major (crash recovery test can silently become a no-op)
- **Status:** Confirmed
---
### 4. `TestBulkWriteCrashRecovery` does not assert the inserted row survives (or not)
- **Problem:** The test only checks `integrity_check = ok`. It does not verify whether the inserted row (`CrashFunc`) is present after recovery. If WAL replay silently discarded the write, the test would still pass. Asserting the row presence (or explicitly accepting its absence as expected-under-OFF) would strengthen the test.
- **Severity:** Minor
- **Status:** Confirmed
---
### 5. No `defer` on `EndBulkWrite` — panic in `runPasses` leaks PRAGMA state
- **File:** `internal/pipeline/pipeline.go:194-209`
- **Problem:** The pipeline calls `BeginBulkWrite` then relies on explicit `EndBulkWrite` in error/success paths. If `runPasses` panics (nil pointer, OOB), `EndBulkWrite` is never called, leaving the connection with `synchronous = OFF` and a 64 MB cache for the lifetime of the `Store`. A `defer s.Store.EndBulkWrite(ctx)` immediately after `BeginBulkWrite` would be more robust.
- **Severity:** Minor (Go panics in an MCP server likely crash the process anyway)
- **Status:** Hypothetical
---
### Confirmed Correct (no issues found)
- **`restoreDefaultCache`** correctly undoes the 64 MB override — recomputes adaptive cache from current DB file size, which is desirable post-bulk-write.
- **Single call site** — both full and incremental reindex paths flow through the same `pipeline.Run()` → `BeginBulkWrite`/`EndBulkWrite` pair. No differential treatment needed.
- **Subprocess schema race** — parent pre-creates schema, closes, then subprocess opens. Sequential, no race.
---
## Summary
| # | Finding | Severity | Status |
|---|---------|----------|--------|
| 1 | Stale "MEMORY journal mode" comment in `pipeline.go:193` | Minor | Confirmed |
| 2 | `synchronous=OFF` + WAL allows data loss on power failure (not corruption) | Minor | Confirmed |
| 3 | Subprocess exit code not checked — test can false-positive silently | **Major** | Confirmed |
| 4 | Crash test does not assert row presence after recovery | Minor | Confirmed |
| 5 | No `defer` on `EndBulkWrite` — panic leaks PRAGMA state | Minor | Hypothetical |
**Bottom line:** The core fix (removing `PRAGMA journal_mode = MEMORY`) is correct and complete. Finding #3 is the only **major** issue — `TestBulkWriteCrashRecovery` can silently pass on an untouched DB if the subprocess fails to launch. The remaining findings are minor documentation/test-strength issues. |
- pipeline.go: replace stale "MEMORY journal mode" comment with accurate description of BeginBulkWrite (64 MB cache boost + synchronous=OFF, WAL preserved for crash safety) - bulkwrite_crash_test.go: verify subprocess exited with code 1; fail fast if subprocess never crashed (prevents silent pass on pre-created DB) - bulkwrite_crash_test.go: log row-survival outcome after integrity_check to document synchronous=OFF durability trade-off without failing test Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
QA Round 2Fixes from Round 1 applied in commit # PR #68 QA Round 2 — fix/crash-safe-bulk-write
**Model used:** claude-opus-4-6
---
## Round 1 Fix Verification
### Issue 1 — Stale "MEMORY journal mode" comment in pipeline.go
**Status: VERIFIED FIXED**
Old: `// Use MEMORY journal mode during fresh indexing for faster bulk writes.`
New (lines 193–195):// Boost cache to 64 MB and set synchronous = OFF for write throughput. Handles all three failure modes correctly: clean exit (0), wrong exit code, and launch failure. Confirmed correct. Issue 4 — No row-presence assertion after crash recoveryStatus: VERIFIED FIXED Queries Re-evaluation of Deferred IssuesIssue 5 — No
|
| # | Description | Severity | Status |
|---|---|---|---|
| 1 | Stale comment fix | Minor | ✅ Verified fixed |
| 3 | Exit-code check in crash test | Major | ✅ Verified fixed |
| 4 | Row-presence log added | Minor | ✅ Verified fixed |
| 7 (new) | Row-presence query swallows scan error | Minor | Confirmed |
| 9 (new) | Panic + single-conn pool deadlocks store | Minor | Hypothetical |
All Round 1 fixes correctly applied. No critical or major issues found in Round 2. Two minor findings remain — both are acceptable for an index that can be rebuilt from source.
|
Hey hey, will revisit this soon, after I made my latest changes. I already had in the plan to rewrite "writing logic" especially in the initial indexing phase. Will revisit this PR once I am done :) In general I want to move away from heavy bulk writes (basically stacked up INSERT statements) and just directly build the Btree from RAM -> Disk. Is much faster and less compute heavy. I already have this implemented (at least for initial indexing), but need to test it a bit. Made everything so far significantly faster in the initial indexing phase, especially for large repos |
|
But thx for the contribution until now! Appreciate it a lot :) |
|
Understood. FWIW I am running with this compiled locally because my codebase is too large and I couldn't get past it! |
|
Hey @halindrome, can you provide some additional information? Like how many files do you have, which language primarily is served? On which platform (OS + hardware) do you run? Just that I can try to replicate with my new setup |
|
I actually thought we were talking about a different PR - lol. This one is just a nasty bug if you kill CC while the indexing is running or accidentally start up a few in the same folder at the same time. The memory issue (which probably caused the corruption crash) is reflected in PR #59 |
|
Yes, I was just confused on the channels haha, sorry. Working on this one at the night after work primarily. Right now I have already rewritten the complete thing in C, getting rid of of the CGO overhead. Took me some time, currently just need to test memory allocation etc thoroughly. I would like to deeply optimize on memory/cpu, especially for cases like yours where heavy repos might induce a lot of processing. Hope it will work then much better for you :) |
|
Sounds good - LMK if you need help or a guinea pig. |
|
Closing in favour of #72, which ports this fix to the C rewrite. The Go code this PR patched no longer exists in main. |

Problem
BeginBulkWrite()ininternal/store/store.goswitches SQLite toPRAGMA journal_mode = MEMORYandPRAGMA synchronous = OFFto accelerate bulk writes during indexing. This is catastrophic if the process is killed beforeEndBulkWrite()is called:synchronous = OFFskips fsyncPRAGMA integrity_checkreturns B-tree errorsEvery call to
index_repository(and the auto-sync watcher reindex) opens this vulnerability window.Closes #67.
Fix
Remove
PRAGMA journal_mode = MEMORYfromBeginBulkWrite(). WAL mode (already used everywhere else via the_journal_mode=WALDSN parameter) is maintained throughout. WAL writes go to the WAL file, never directly to the main DB — a mid-index crash rolls back cleanly on next open.PRAGMA cache_size = -65536(64 MB) is kept to preserve write throughput.EndBulkWrite()no longer re-issuesPRAGMA journal_mode = WAL(now redundant).Changes
internal/store/store.go: Remove 1 line fromBeginBulkWrite(); remove 1 line fromEndBulkWrite(); update doc-commentsTesting
go test ./...)TestBulkWriteCrashRecovery) verifies that forking a subprocess that callsBeginBulkWritethenos.Exit(1)leaves the DB readable and corruption-free (see companion PR or follow-up)Risk
Low. WAL mode is production-proven in this codebase. The only behavioural change is that indexing no longer exits WAL mode, which is the desired outcome.
🤖 Generated with Claude Code