What problem are you trying to solve?
Hi! π First, thanks for building this β the knowledge-graph idea is what pulled
me in. Quick context on how I got here, since this is my first open-source
contribution and I want to do it right.
I was looking for a way to turn my own pile of documents (meeting notes, PDFs)
into something explorable, found Understand Anything, and tried /understand-knowledge.
That's where I hit the wall: it expects an already-authored Karpathy-pattern
markdown wiki, so my raw documents couldn't go in directly. To unblock myself I
built a small standalone helper that converts documents into the wiki format
(index.md + [[wikilinks]] + entity/concept pages) and ran /understand-knowledge on
the result. It worked β I tested it on two of my own meeting transcripts and it
correctly built and accumulated a cross-linked graph across both. That's what
made me want to propose it upstream instead of keeping it as a private hack.
Digging into the code, the gap is small and concrete: in
skills/understand-knowledge/parse-knowledge-base.py, the raw/ loop (~L432β448)
builds each source node from only name + filePath + size β the summary is literally
"Raw source ({ext}, {size} KB)". The SKILL.md Notes confirm it: "Source nodes
from raw/ are lightweight (filename + size only) β we don't parse PDFs or binary
files." So any text inside those documents never reaches the article-analyzer or
the graph. Closing that makes the knowledge path usable on the documents people
actually have, not just hand-authored wikis.
Proposed solution (optional)
Add an optional deterministic text-extraction step for raw/ documents, reusing the
existing scan β analyze β merge flow with no downstream changes.
- New helper in the skill dir, e.g.
extract-source-text.py (Python, to match the
existing parse-knowledge-base.py / merge-knowledge-graph.py convention).
- For raw/ files with supported extensions (start with .pdf, .docx), extract plain
text β markdown.
- Feed that text into the source node in parse-knowledge-base.py β populate a real
summary/content field instead of the filename+size string, so Phase 3's
article-analyzer can surface entities, claims, and relationships from the
document body the same way it does for wiki articles.
- Graceful degradation: if extraction libs aren't present, or a file is a scanned
image / unsupported binary, fall back to today's filename+size behavior. Output
for existing wikis stays byte-identical.
Dependency note (the main thing I'd want your call on): package.json keeps a tight
onlyBuiltDependencies list and leans on native tree-sitter bindings, so I'd avoid
adding native build surface. My default would be pure-Python libs (pypdf for PDF,
python-docx for DOCX), imported lazily so they're only needed when raw/ actually
contains documents. If you'd rather not add Python deps, an alternative is shelling
out to pandoc/markitdown when available.
Scope for a first PR: just PDF + DOCX β richer source nodes. NOT in this PR:
OCR/scanned PDFs, images, audio, every format, or auto-generating wiki articles
from documents (that feels like the authoring layer's job and a much bigger change).
Small slice, behind the existing flow, with tests in the skill package's suite.
Alternatives you've considered
- The standalone converter I built for myself β works, but it's a separate tool the
user must run first; folding extraction into the raw/ step is cleaner and keeps
everything inside /understand-knowledge.
- Pre-converting documents to markdown by hand (or via Obsidian Web Clipper) before
running the tool β works but is exactly the manual friction this would remove.
Which part of the project?
skill (understand-knowledge + parse-knowledge-base.py)
What problem are you trying to solve?
Hi! π First, thanks for building this β the knowledge-graph idea is what pulled
me in. Quick context on how I got here, since this is my first open-source
contribution and I want to do it right.
I was looking for a way to turn my own pile of documents (meeting notes, PDFs)
into something explorable, found Understand Anything, and tried /understand-knowledge.
That's where I hit the wall: it expects an already-authored Karpathy-pattern
markdown wiki, so my raw documents couldn't go in directly. To unblock myself I
built a small standalone helper that converts documents into the wiki format
(index.md + [[wikilinks]] + entity/concept pages) and ran /understand-knowledge on
the result. It worked β I tested it on two of my own meeting transcripts and it
correctly built and accumulated a cross-linked graph across both. That's what
made me want to propose it upstream instead of keeping it as a private hack.
Digging into the code, the gap is small and concrete: in
skills/understand-knowledge/parse-knowledge-base.py, the raw/ loop (~L432β448)builds each source node from only name + filePath + size β the summary is literally
"Raw source ({ext}, {size} KB)". The SKILL.md Notes confirm it: "Source nodesfrom raw/ are lightweight (filename + size only) β we don't parse PDFs or binary
files." So any text inside those documents never reaches the article-analyzer or
the graph. Closing that makes the knowledge path usable on the documents people
actually have, not just hand-authored wikis.
Proposed solution (optional)
Add an optional deterministic text-extraction step for raw/ documents, reusing the
existing scan β analyze β merge flow with no downstream changes.
extract-source-text.py(Python, to match theexisting parse-knowledge-base.py / merge-knowledge-graph.py convention).
text β markdown.
summary/content field instead of the filename+size string, so Phase 3's
article-analyzer can surface entities, claims, and relationships from the
document body the same way it does for wiki articles.
image / unsupported binary, fall back to today's filename+size behavior. Output
for existing wikis stays byte-identical.
Dependency note (the main thing I'd want your call on): package.json keeps a tight
onlyBuiltDependencies list and leans on native tree-sitter bindings, so I'd avoid
adding native build surface. My default would be pure-Python libs (pypdf for PDF,
python-docx for DOCX), imported lazily so they're only needed when raw/ actually
contains documents. If you'd rather not add Python deps, an alternative is shelling
out to pandoc/markitdown when available.
Scope for a first PR: just PDF + DOCX β richer source nodes. NOT in this PR:
OCR/scanned PDFs, images, audio, every format, or auto-generating wiki articles
from documents (that feels like the authoring layer's job and a much bigger change).
Small slice, behind the existing flow, with tests in the skill package's suite.
Alternatives you've considered
user must run first; folding extraction into the raw/ step is cleaner and keeps
everything inside /understand-knowledge.
running the tool β works but is exactly the manual friction this would remove.
Which part of the project?
skill (understand-knowledge + parse-knowledge-base.py)