feat: parse text from raw/ documents (PDF/DOCX) so /understand-knowledge can use their content

### What problem are you trying to solve?

Hi! 👋 First, thanks for building this — the knowledge-graph idea is what pulled
me in. Quick context on how I got here, since this is my first open-source
contribution and I want to do it right.

I was looking for a way to turn my own pile of documents (meeting notes, PDFs)
into something explorable, found Understand Anything, and tried /understand-knowledge.
That's where I hit the wall: it expects an already-authored Karpathy-pattern
markdown wiki, so my raw documents couldn't go in directly. To unblock myself I
built a small standalone helper that converts documents into the wiki format
(index.md + [[wikilinks]] + entity/concept pages) and ran /understand-knowledge on
the result. It worked — I tested it on two of my own meeting transcripts and it
correctly built and *accumulated* a cross-linked graph across both. That's what
made me want to propose it upstream instead of keeping it as a private hack.

Digging into the code, the gap is small and concrete: in
`skills/understand-knowledge/parse-knowledge-base.py`, the raw/ loop (~L432–448)
builds each source node from only name + filePath + size — the summary is literally
`"Raw source ({ext}, {size} KB)"`. The SKILL.md Notes confirm it: "Source nodes
from raw/ are lightweight (filename + size only) — we don't parse PDFs or binary
files." So any text inside those documents never reaches the article-analyzer or
the graph. Closing that makes the knowledge path usable on the documents people
actually have, not just hand-authored wikis.

### Proposed solution (optional)

Add an optional deterministic text-extraction step for raw/ documents, reusing the
existing scan → analyze → merge flow with no downstream changes.

1. New helper in the skill dir, e.g. `extract-source-text.py` (Python, to match the
   existing parse-knowledge-base.py / merge-knowledge-graph.py convention).
2. For raw/ files with supported extensions (start with .pdf, .docx), extract plain
   text → markdown.
3. Feed that text into the source node in parse-knowledge-base.py — populate a real
   summary/content field instead of the filename+size string, so Phase 3's
   article-analyzer can surface entities, claims, and relationships from the
   document body the same way it does for wiki articles.
4. Graceful degradation: if extraction libs aren't present, or a file is a scanned
   image / unsupported binary, fall back to today's filename+size behavior. Output
   for existing wikis stays byte-identical.

Dependency note (the main thing I'd want your call on): package.json keeps a tight
onlyBuiltDependencies list and leans on native tree-sitter bindings, so I'd avoid
adding native build surface. My default would be pure-Python libs (pypdf for PDF,
python-docx for DOCX), imported lazily so they're only needed when raw/ actually
contains documents. If you'd rather not add Python deps, an alternative is shelling
out to pandoc/markitdown when available.

Scope for a first PR: just PDF + DOCX → richer source nodes. NOT in this PR:
OCR/scanned PDFs, images, audio, every format, or auto-generating wiki articles
from documents (that feels like the authoring layer's job and a much bigger change).
Small slice, behind the existing flow, with tests in the skill package's suite.

### Alternatives you've considered

- The standalone converter I built for myself — works, but it's a separate tool the
  user must run first; folding extraction into the raw/ step is cleaner and keeps
  everything inside /understand-knowledge.
- Pre-converting documents to markdown by hand (or via Obsidian Web Clipper) before
  running the tool — works but is exactly the manual friction this would remove.

### Which part of the project?

skill (understand-knowledge + parse-knowledge-base.py)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parse text from raw/ documents (PDF/DOCX) so /understand-knowledge can use their content #437

What problem are you trying to solve?

Proposed solution (optional)

Alternatives you've considered

Which part of the project?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: parse text from raw/ documents (PDF/DOCX) so /understand-knowledge can use their content #437

Description

What problem are you trying to solve?

Proposed solution (optional)

Alternatives you've considered

Which part of the project?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions