Skip to content

feat: parse text from raw/ documents (PDF/DOCX) so /understand-knowledge can use their contentΒ #437

@Adarsh1011

Description

@Adarsh1011

What problem are you trying to solve?

Hi! πŸ‘‹ First, thanks for building this β€” the knowledge-graph idea is what pulled
me in. Quick context on how I got here, since this is my first open-source
contribution and I want to do it right.

I was looking for a way to turn my own pile of documents (meeting notes, PDFs)
into something explorable, found Understand Anything, and tried /understand-knowledge.
That's where I hit the wall: it expects an already-authored Karpathy-pattern
markdown wiki, so my raw documents couldn't go in directly. To unblock myself I
built a small standalone helper that converts documents into the wiki format
(index.md + [[wikilinks]] + entity/concept pages) and ran /understand-knowledge on
the result. It worked β€” I tested it on two of my own meeting transcripts and it
correctly built and accumulated a cross-linked graph across both. That's what
made me want to propose it upstream instead of keeping it as a private hack.

Digging into the code, the gap is small and concrete: in
skills/understand-knowledge/parse-knowledge-base.py, the raw/ loop (~L432–448)
builds each source node from only name + filePath + size β€” the summary is literally
"Raw source ({ext}, {size} KB)". The SKILL.md Notes confirm it: "Source nodes
from raw/ are lightweight (filename + size only) β€” we don't parse PDFs or binary
files." So any text inside those documents never reaches the article-analyzer or
the graph. Closing that makes the knowledge path usable on the documents people
actually have, not just hand-authored wikis.

Proposed solution (optional)

Add an optional deterministic text-extraction step for raw/ documents, reusing the
existing scan β†’ analyze β†’ merge flow with no downstream changes.

  1. New helper in the skill dir, e.g. extract-source-text.py (Python, to match the
    existing parse-knowledge-base.py / merge-knowledge-graph.py convention).
  2. For raw/ files with supported extensions (start with .pdf, .docx), extract plain
    text β†’ markdown.
  3. Feed that text into the source node in parse-knowledge-base.py β€” populate a real
    summary/content field instead of the filename+size string, so Phase 3's
    article-analyzer can surface entities, claims, and relationships from the
    document body the same way it does for wiki articles.
  4. Graceful degradation: if extraction libs aren't present, or a file is a scanned
    image / unsupported binary, fall back to today's filename+size behavior. Output
    for existing wikis stays byte-identical.

Dependency note (the main thing I'd want your call on): package.json keeps a tight
onlyBuiltDependencies list and leans on native tree-sitter bindings, so I'd avoid
adding native build surface. My default would be pure-Python libs (pypdf for PDF,
python-docx for DOCX), imported lazily so they're only needed when raw/ actually
contains documents. If you'd rather not add Python deps, an alternative is shelling
out to pandoc/markitdown when available.

Scope for a first PR: just PDF + DOCX β†’ richer source nodes. NOT in this PR:
OCR/scanned PDFs, images, audio, every format, or auto-generating wiki articles
from documents (that feels like the authoring layer's job and a much bigger change).
Small slice, behind the existing flow, with tests in the skill package's suite.

Alternatives you've considered

  • The standalone converter I built for myself β€” works, but it's a separate tool the
    user must run first; folding extraction into the raw/ step is cleaner and keeps
    everything inside /understand-knowledge.
  • Pre-converting documents to markdown by hand (or via Obsidian Web Clipper) before
    running the tool β€” works but is exactly the manual friction this would remove.

Which part of the project?

skill (understand-knowledge + parse-knowledge-base.py)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions