fix(store): remove MEMORY journal mode to prevent DB corruption on crash (#67) by halindrome · Pull Request #68 · DeusData/codebase-memory-mcp

halindrome · 2026-03-16T20:50:14Z

Problem

BeginBulkWrite() in internal/store/store.go switches SQLite to PRAGMA journal_mode = MEMORY and PRAGMA synchronous = OFF to accelerate bulk writes during indexing. This is catastrophic if the process is killed before EndBulkWrite() is called:

MEMORY journal mode keeps transaction metadata in RAM only — no on-disk rollback journal
synchronous = OFF skips fsync
If the process is killed (SIGKILL, OOM, terminal close), partially-written B-tree pages may be flushed to disk
On next open, SQLite detects corruption: PRAGMA integrity_check returns B-tree errors

Every call to index_repository (and the auto-sync watcher reindex) opens this vulnerability window.

Closes #67.

Fix

Remove PRAGMA journal_mode = MEMORY from BeginBulkWrite(). WAL mode (already used everywhere else via the _journal_mode=WAL DSN parameter) is maintained throughout. WAL writes go to the WAL file, never directly to the main DB — a mid-index crash rolls back cleanly on next open.

PRAGMA cache_size = -65536 (64 MB) is kept to preserve write throughput.

EndBulkWrite() no longer re-issues PRAGMA journal_mode = WAL (now redundant).

Changes

internal/store/store.go: Remove 1 line from BeginBulkWrite(); remove 1 line from EndBulkWrite(); update doc-comments

Testing

All existing tests pass unchanged (go test ./...)
A new crash-simulation test (TestBulkWriteCrashRecovery) verifies that forking a subprocess that calls BeginBulkWrite then os.Exit(1) leaves the DB readable and corruption-free (see companion PR or follow-up)

Risk

Low. WAL mode is production-proven in this codebase. The only behavioural change is that indexing no longer exits WAL mode, which is the desired outcome.

🤖 Generated with Claude Code

… DB corruption on crash Fixes DeusData#67. PRAGMA journal_mode = MEMORY disabled crash safety during bulk indexing: a SIGKILL, OOM kill, or terminal close could leave partially-written B-tree pages on disk with no rollback journal, permanently corrupting the DB. WAL mode (already used everywhere else) is now maintained throughout BeginBulkWrite/ EndBulkWrite. If the process dies mid-index, WAL pages are rolled back on next open. PRAGMA cache_size = -65536 (64 MB) is kept to preserve write throughput. EndBulkWrite no longer re-issues PRAGMA journal_mode = WAL (redundant). All existing tests pass unchanged. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

… write Adds two tests to prove the fix from plan 09-01 holds: - TestBulkWriteKeepsWAL: asserts journal_mode stays WAL through BeginBulkWrite/EndBulkWrite, and synchronous returns to NORMAL (1) - TestBulkWriteCrashRecovery: forks a subprocess that calls BeginBulkWrite then os.Exit(1), then verifies PRAGMA integrity_check = "ok" on reopen Covers regression for GitHub issue DeusData#67. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

halindrome · 2026-03-16T21:18:59Z

QA Round 1

# PR #68 Review: fix/crash-safe-bulk-write

**Model used:** claude-opus-4-6

---

## What Was Tested

Reviewed the full diff (two files: `internal/store/store.go` changes and new `internal/store/bulkwrite_crash_test.go`), all callers of `BeginBulkWrite`/`EndBulkWrite`, the pipeline `Run()` method, `restoreDefaultCache`, `OpenPath` DSN, and the crash test subprocess idiom.

---

## Findings

### 1. Stale comment in pipeline.go (line 193)

- **File:** `internal/pipeline/pipeline.go:193`
- **Line:** `// Use MEMORY journal mode during fresh indexing for faster bulk writes.`
- **Problem:** The entire point of this PR is that `PRAGMA journal_mode = MEMORY` was removed from `BeginBulkWrite`. This comment still says "Use MEMORY journal mode." It is now factually incorrect and will mislead future readers.
- **Expected:** Comment should reference cache boost / synchronous=OFF, not MEMORY journal mode.
- **Severity:** Minor
- **Status:** Confirmed

---

### 2. `synchronous = OFF` with WAL still allows data loss on power failure (not corruption)

- **Context:** `BeginBulkWrite` still sets `PRAGMA synchronous = OFF`. With WAL mode, `synchronous = OFF` means SQLite does not issue `fsync()` on WAL writes. A power failure or kernel crash can lose committed transactions that were not yet synced to disk.
- **Nuance:** With `synchronous = OFF`, a SIGKILL of the process is safe (the OS buffer cache will flush to disk), but an **OS crash or power loss** during bulk write can lose the entire indexing transaction. The DB will not be *corrupted* (WAL guarantees structural integrity), but it may roll back to a pre-indexing state silently.
- **Impact:** For this application (a code index that can be rebuilt), this is an acceptable trade-off. However, the doc comment says "a SIGKILL during indexing will not corrupt the DB" — accurate, but could be read as a stronger guarantee than it provides. A note that power-loss may lose the in-progress indexing session (but not corrupt) would be more precise.
- **Severity:** Minor (design trade-off, index is rebuildable)
- **Status:** Confirmed

---

### 3. `TestBulkWriteCrashRecovery` does not assert the subprocess actually ran

- **File:** `internal/store/bulkwrite_crash_test.go:111`
- **Problem:** `cmd.CombinedOutput()` error is discarded. If the subprocess fails to start (e.g., `os.Args[0]` is not a test binary, or `TestCrashHelper` is not found), the test silently proceeds to reopen the pre-created empty DB, runs `integrity_check` on an untouched DB, gets "ok", and **passes as a false positive**.
- **Fix:** Check `cmd.ProcessState.ExitCode() == 1` to confirm the subprocess actually ran and crashed as intended. Exit code 2 means `OpenPath` failed in the subprocess.
- **Severity:** Major (crash recovery test can silently become a no-op)
- **Status:** Confirmed

---

### 4. `TestBulkWriteCrashRecovery` does not assert the inserted row survives (or not)

- **Problem:** The test only checks `integrity_check = ok`. It does not verify whether the inserted row (`CrashFunc`) is present after recovery. If WAL replay silently discarded the write, the test would still pass. Asserting the row presence (or explicitly accepting its absence as expected-under-OFF) would strengthen the test.
- **Severity:** Minor
- **Status:** Confirmed

---

### 5. No `defer` on `EndBulkWrite` — panic in `runPasses` leaks PRAGMA state

- **File:** `internal/pipeline/pipeline.go:194-209`
- **Problem:** The pipeline calls `BeginBulkWrite` then relies on explicit `EndBulkWrite` in error/success paths. If `runPasses` panics (nil pointer, OOB), `EndBulkWrite` is never called, leaving the connection with `synchronous = OFF` and a 64 MB cache for the lifetime of the `Store`. A `defer s.Store.EndBulkWrite(ctx)` immediately after `BeginBulkWrite` would be more robust.
- **Severity:** Minor (Go panics in an MCP server likely crash the process anyway)
- **Status:** Hypothetical

---

### Confirmed Correct (no issues found)

- **`restoreDefaultCache`** correctly undoes the 64 MB override — recomputes adaptive cache from current DB file size, which is desirable post-bulk-write.
- **Single call site** — both full and incremental reindex paths flow through the same `pipeline.Run()` → `BeginBulkWrite`/`EndBulkWrite` pair. No differential treatment needed.
- **Subprocess schema race** — parent pre-creates schema, closes, then subprocess opens. Sequential, no race.

---

## Summary

| # | Finding | Severity | Status |
|---|---------|----------|--------|
| 1 | Stale "MEMORY journal mode" comment in `pipeline.go:193` | Minor | Confirmed |
| 2 | `synchronous=OFF` + WAL allows data loss on power failure (not corruption) | Minor | Confirmed |
| 3 | Subprocess exit code not checked — test can false-positive silently | **Major** | Confirmed |
| 4 | Crash test does not assert row presence after recovery | Minor | Confirmed |
| 5 | No `defer` on `EndBulkWrite` — panic leaks PRAGMA state | Minor | Hypothetical |

**Bottom line:** The core fix (removing `PRAGMA journal_mode = MEMORY`) is correct and complete. Finding #3 is the only **major** issue — `TestBulkWriteCrashRecovery` can silently pass on an untouched DB if the subprocess fails to launch. The remaining findings are minor documentation/test-strength issues.

- pipeline.go: replace stale "MEMORY journal mode" comment with accurate description of BeginBulkWrite (64 MB cache boost + synchronous=OFF, WAL preserved for crash safety) - bulkwrite_crash_test.go: verify subprocess exited with code 1; fail fast if subprocess never crashed (prevents silent pass on pre-created DB) - bulkwrite_crash_test.go: log row-survival outcome after integrity_check to document synchronous=OFF durability trade-off without failing test Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

halindrome · 2026-03-16T21:27:47Z

QA Round 2

Fixes from Round 1 applied in commit b3e79bc fix(store): address QA round 1 findings.

# PR #68 QA Round 2 — fix/crash-safe-bulk-write

**Model used:** claude-opus-4-6

---

## Round 1 Fix Verification

### Issue 1 — Stale "MEMORY journal mode" comment in pipeline.go
**Status: VERIFIED FIXED**

Old: `// Use MEMORY journal mode during fresh indexing for faster bulk writes.`
New (lines 193–195):

// Boost cache to 64 MB and set synchronous = OFF for write throughput.
// WAL mode is preserved so the DB remains crash-safe: the WAL file is
// replayed on next open if the process is killed mid-write.

Accurate — BeginBulkWrite sets synchronous=OFF and cache_size=-65536; does not touch journal_mode.

### Issue 3 — Subprocess exit code not checked
**Status: VERIFIED FIXED**

```go
if exitErr, ok := err.(*exec.ExitError); !ok || exitErr.ExitCode() != 1 {
    t.Fatalf("subprocess did not crash as expected: err=%v, output=%s", err, out)
}

Handles all three failure modes correctly: clean exit (0), wrong exit code, and launch failure. Confirmed correct.

Issue 4 — No row-presence assertion after crash recovery

Status: VERIFIED FIXED

Queries SELECT COUNT(*) FROM nodes WHERE qualified_name = 'crash.CrashFunc' — table and column names are correct per schema. Result is logged (not asserted), with a comment documenting the synchronous=OFF power-loss trade-off. Appropriate.

Re-evaluation of Deferred Issues

Issue 5 — No `defer` on `EndBulkWrite`

Remains Minor / Hypothetical. No explicit panic() calls exist in the pipeline or store packages, and any runtime panic would propagate to crash the process (no recover() in the call chain). Accepting this risk is reasonable.

New Findings (Round 2)

Issue 7 — Row-presence query silently swallows scan error

File: internal/store/bulkwrite_crash_test.go line ~144
Problem: _ = s2.DB().QueryRowContext(...).Scan(&rowCount) discards the Scan error. If the query fails (edge case), rowCount stays 0 and the log reads "row survived crash: false" — misleading, since it implies row loss rather than query failure. The line is informational only (no assert), so impact is limited to confusing log output.
Severity: Minor
Status: Confirmed

Issue 9 — `WithTransaction` + single-conn pool: panic deadlocks store (amplifies Issue 5)

Context: WithTransaction does not recover from panics — an fn() panic leaves the transaction open. With MaxOpenConns=1, this means the entire store is deadlocked until GC finalizes the *sql.Tx. Combined with Issue 5 (synchronous=OFF leak), a recovered panic (if any outer caller adds recover() in the future) would leave the store permanently degraded.
Severity: Minor / Hypothetical — requires both a runtime panic in runPasses AND a recover() in an outer caller, neither of which currently exist.
Status: Hypothetical — worth noting architecturally if panic recovery is ever added.

Summary

#	Description	Severity	Status
1	Stale comment fix	Minor	✅ Verified fixed
3	Exit-code check in crash test	Major	✅ Verified fixed
4	Row-presence log added	Minor	✅ Verified fixed
7 (new)	Row-presence query swallows scan error	Minor	Confirmed
9 (new)	Panic + single-conn pool deadlocks store	Minor	Hypothetical

All Round 1 fixes correctly applied. No critical or major issues found in Round 2. Two minor findings remain — both are acceptable for an index that can be rebuilt from source.

DeusData · 2026-03-16T22:19:29Z

Hey hey, will revisit this soon, after I made my latest changes. I already had in the plan to rewrite "writing logic" especially in the initial indexing phase. Will revisit this PR once I am done :) In general I want to move away from heavy bulk writes (basically stacked up INSERT statements) and just directly build the Btree from RAM -> Disk. Is much faster and less compute heavy. I already have this implemented (at least for initial indexing), but need to test it a bit. Made everything so far significantly faster in the initial indexing phase, especially for large repos

DeusData · 2026-03-16T22:19:55Z

But thx for the contribution until now! Appreciate it a lot :)

halindrome · 2026-03-16T22:22:15Z

Understood. FWIW I am running with this compiled locally because my codebase is too large and I couldn't get past it!

DeusData · 2026-03-16T22:38:00Z

Hey @halindrome, can you provide some additional information? Like how many files do you have, which language primarily is served? On which platform (OS + hardware) do you run? Just that I can try to replicate with my new setup

halindrome · 2026-03-16T22:41:37Z

Oh sure! There is a monorepo with 5 submodules. perl, python, typescript, shell, some other stuff. HTML. CSS. markdown of course.

halindrome · 2026-03-16T22:43:41Z

I actually thought we were talking about a different PR - lol. This one is just a nasty bug if you kill CC while the indexing is running or accidentally start up a few in the same folder at the same time. The memory issue (which probably caused the corruption crash) is reflected in PR #59

DeusData · 2026-03-16T22:46:32Z

Yes, I was just confused on the channels haha, sorry. Working on this one at the night after work primarily. Right now I have already rewritten the complete thing in C, getting rid of of the CGO overhead. Took me some time, currently just need to test memory allocation etc thoroughly. I would like to deeply optimize on memory/cpu, especially for cases like yours where heavy repos might induce a lot of processing. Hope it will work then much better for you :)

halindrome · 2026-03-16T22:47:57Z

Sounds good - LMK if you need help or a guinea pig.

halindrome · 2026-03-19T22:36:05Z

Closing in favour of #72, which ports this fix to the C rewrite. The Go code this PR patched no longer exists in main.

shanemccarron-maker and others added 2 commits March 16, 2026 15:49

halindrome closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(store): remove MEMORY journal mode to prevent DB corruption on crash (#67)#68

fix(store): remove MEMORY journal mode to prevent DB corruption on crash (#67)#68
halindrome wants to merge 3 commits intoDeusData:mainfrom
halindrome:fix/crash-safe-bulk-write

halindrome commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

DeusData commented Mar 16, 2026

Uh oh!

DeusData commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

DeusData commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

DeusData commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

halindrome commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

halindrome commented Mar 16, 2026

Problem

Fix

Changes

Testing

Risk

Uh oh!

halindrome commented Mar 16, 2026

QA Round 1

Uh oh!

halindrome commented Mar 16, 2026

QA Round 2

Issue 4 — No row-presence assertion after crash recovery

Re-evaluation of Deferred Issues

Issue 5 — No defer on EndBulkWrite

New Findings (Round 2)

Issue 7 — Row-presence query silently swallows scan error

Issue 9 — WithTransaction + single-conn pool: panic deadlocks store (amplifies Issue 5)

Summary

Uh oh!

DeusData commented Mar 16, 2026

Uh oh!

DeusData commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

DeusData commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

DeusData commented Mar 16, 2026

Uh oh!

halindrome commented Mar 16, 2026

Uh oh!

halindrome commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Issue 5 — No `defer` on `EndBulkWrite`

Issue 9 — `WithTransaction` + single-conn pool: panic deadlocks store (amplifies Issue 5)