Area
src/everos
What happened?
The on-disk LanceDB index under ~/.everos/.index/lancedb can grow without bound
until it fills the entire disk, while the Markdown source of truth stays tiny.
On a self-hosted single-user instance I have hit this twice:
atomic_fact.lance bloated to 318 GB (8,974 data fragments, 1,759 versions)
while the live data was only a few MB and the Markdown truth was ~26 MB.
- It recurred and reached 583 GB, filling the volume to 0 bytes free (shell
itself started returning ENOSPC).
The root problem is that the cascade maintenance worker swallows every
compaction/prune failure and keeps running, so stale LanceDB versions accumulate
forever with no cap, no metric, and no user-visible signal:
memory/cascade/worker.py _run_optimize_once wraps optimize()/cleanup in a
broad except Exception: # never crash the daemon, logs a
cascade_lancedb_optimize_failed warning, and continues.
- There is no max-version / max-size cap on the index, no index-size health metric,
and no prune/maintenance CLI to recover.
Once compaction is broken, prune never runs → versions pile up → disk fills →
death spiral: a full disk means compaction can't even write its temp scratch,
so it can never prune itself back down.
Two failure modes that break compaction
- FD exhaustion (EMFILE / os error 24). LanceDB maintenance needs ~290 FDs
(per docs/cascade_runbook.md), but a daemon launched under macOS's default soft
limit of 256 (launchctl limit maxfiles) hits EMFILE on every cleanup cycle.
Logs showed thousands of os error 24 / "Too many open files". Raising the
launcher's NumberOfFiles soft limit to 8192 fixed this mode.
- lance list-encoding corruption (persists even after the FD fix).
optimize() dies with a lance 7.0.0 error like
Max offset of 648640 exceeds length of values 466149 on an atomic_fact
list<...> column (list.rs). Because optimize() runs compaction before
cleanup, it never reaches the cleanup step → reclaims nothing → unbounded growth.
Steps to reproduce
- Run the EverOS daemon continuously and keep adding memories so the cascade worker
compacts/prunes on its normal schedule.
- Cause compaction to fail — easiest is to launch the daemon under a low FD soft
limit (macOS default 256), or let the lance list-encoding error above occur on
atomic_fact.
- Watch
du -sh ~/.everos/.index/lancedb climb into the tens/hundreds of GB while
~/.everos/evermem/**.md stays a few tens of MB.
grep cascade_lancedb_optimize_failed in the logs — failures are logged as
warnings only; the daemon keeps serving and never surfaces the bloat.
Environment
- OS: macOS (Darwin), single-user self-host, LaunchAgent
- EverOS: 1.0.0 and 1.1.0 (reproduced on both)
- lance / lancedb: 7.0.0
- Markdown truth ~34 MB; index bloated to 318 GB then 583 GB
Workaround
- When compaction is broken but the disk still has room: stop the daemon and call
lance cleanup_old_versions(older_than=timedelta(0), delete_unverified=True)
directly on each *.lance dir — this bypasses the broken compaction step that
Table.optimize() runs first, and reclaims the stale versions (row counts
unchanged).
- At a full disk that direct cleanup is impossible (no scratch space). Recovery:
stop daemon → rm -rf ~/.everos/.index/{lancedb,sqlite} (the .index is 100%
rebuildable; the Markdown at ~/.everos/evermem is the truth) → restart → re-embed
from Markdown (slow).
Suggested fixes
- Add a hard cap on index version count / size, or a watchdog that prunes when the
index greatly exceeds the Markdown footprint.
- Surface a health signal / metric when
optimize() fails repeatedly instead of only
a swallowed warning (e.g. expose index size + last-successful-compaction in
/health or a status command).
- Ship a first-class
everos index prune / maintenance CLI that calls
cleanup_old_versions directly (works even when optimize() compaction is broken).
- Raise the FD soft limit in the bundled launchers/docs so EMFILE can't silently
break maintenance out of the box.
- Fix or work around the underlying lance list-encoding compaction bug
(pin/upgrade lance, or rewrite the affected list<...> column).
Area
src/everos
What happened?
The on-disk LanceDB index under
~/.everos/.index/lancedbcan grow without bounduntil it fills the entire disk, while the Markdown source of truth stays tiny.
On a self-hosted single-user instance I have hit this twice:
atomic_fact.lancebloated to 318 GB (8,974 data fragments, 1,759 versions)while the live data was only a few MB and the Markdown truth was ~26 MB.
itself started returning ENOSPC).
The root problem is that the cascade maintenance worker swallows every
compaction/prune failure and keeps running, so stale LanceDB versions accumulate
forever with no cap, no metric, and no user-visible signal:
memory/cascade/worker.py_run_optimize_oncewrapsoptimize()/cleanup in abroad
except Exception: # never crash the daemon, logs acascade_lancedb_optimize_failedwarning, and continues.and no
prune/maintenance CLI to recover.Once compaction is broken, prune never runs → versions pile up → disk fills →
death spiral: a full disk means compaction can't even write its temp scratch,
so it can never prune itself back down.
Two failure modes that break compaction
(per
docs/cascade_runbook.md), but a daemon launched under macOS's default softlimit of 256 (
launchctl limit maxfiles) hits EMFILE on every cleanup cycle.Logs showed thousands of
os error 24/ "Too many open files". Raising thelauncher's
NumberOfFilessoft limit to 8192 fixed this mode.optimize()dies with a lance 7.0.0 error likeMax offset of 648640 exceeds length of values 466149on anatomic_factlist<...>column (list.rs). Becauseoptimize()runs compaction beforecleanup, it never reaches the cleanup step → reclaims nothing → unbounded growth.
Steps to reproduce
compacts/prunes on its normal schedule.
limit (macOS default 256), or let the lance list-encoding error above occur on
atomic_fact.du -sh ~/.everos/.index/lancedbclimb into the tens/hundreds of GB while~/.everos/evermem/**.mdstays a few tens of MB.grep cascade_lancedb_optimize_failedin the logs — failures are logged aswarnings only; the daemon keeps serving and never surfaces the bloat.
Environment
Workaround
lance
cleanup_old_versions(older_than=timedelta(0), delete_unverified=True)directly on each
*.lancedir — this bypasses the broken compaction step thatTable.optimize()runs first, and reclaims the stale versions (row countsunchanged).
stop daemon →
rm -rf ~/.everos/.index/{lancedb,sqlite}(the.indexis 100%rebuildable; the Markdown at
~/.everos/evermemis the truth) → restart → re-embedfrom Markdown (slow).
Suggested fixes
index greatly exceeds the Markdown footprint.
optimize()fails repeatedly instead of onlya swallowed warning (e.g. expose index size + last-successful-compaction in
/healthor a status command).everos index prune/ maintenance CLI that callscleanup_old_versionsdirectly (works even whenoptimize()compaction is broken).break maintenance out of the box.
(pin/upgrade lance, or rewrite the affected
list<...>column).