Skip to content

feat(rag): index Microsoft Word (.docx) documents#1866

Open
kovtcharov wants to merge 23 commits into
mainfrom
feat/rag-docx-indexing
Open

feat(rag): index Microsoft Word (.docx) documents#1866
kovtcharov wants to merge 23 commits into
mainfrom
feat/rag-docx-indexing

Conversation

@kovtcharov

@kovtcharov kovtcharov commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Why this matters

Word documents couldn't be indexed for RAG — the UI rejected .docx with a "not supported, save as PDF first" message and the SDK shipped no extractor. Anyone with a handbook, contract, or report in .docx had to convert to PDF before GAIA could answer questions about it. After this change .docx indexes directly, the same as PDF / PPTX / XLSX, with no conversion step.

Extraction walks the document body in order and captures paragraph text, table cells (including tables nested inside a cell and rows/cells wrapped in repeating-section content controls), and — critically for form/template documents — the text inside content controls (w:sdt), hyperlinks, and textboxes, which Word stores outside the direct runs that Paragraph.text exposes. So filled-in form values get indexed, not just the labels. Tabs and line breaks are preserved as whitespace (so Column1/Column2 don't glue into an unsearchable token) and the VML mc:Fallback twin of a textbox is skipped so shape text isn't double-counted. Corrupt / non-.docx files and a missing python-docx install fail loudly with actionable, file-named errors (no silent skip). Allow-lists and rejection messaging across the UI backend and React frontend are flipped so .docx flows end-to-end; legacy binary .doc/.ppt/.xls remain intentionally rejected.

Closes #1072

Heads-up: bundled agent migration (not in the title)

This branch also carries the RoutingAgent / DocumentQAAgent migration to standalone hub wheels (gaia-agent-routing / gaia-agent-docqa, following the #1102 pattern), which arrived via the base branch this work was cut from. It is intentional and the docs/CI are updated in lockstep, but it is a breaking import-path changefrom gaia.agents.routing.agent import RoutingAgent / from gaia.agents.docqa.agent import DocumentQAAgent move under hub/agents/python/{routing,docqa}/. Flagging it here for release notes so the import break isn't a surprise; no deprecation shim is included given the migration's scope.

Test plan

  • pytest tests/unit/rag/test_docx_extraction.py — paragraphs, table cells, document order, content controls (inline + block), nested tables, repeating-section (sdt-wrapped) rows, hyperlinks, textbox single-capture, tab/break whitespace + intra-word integrity, corrupt/missing-file errors, dispatcher routing (15 tests)
  • pytest tests/unit/rag/ tests/unit/chat/ui/test_server.py tests/integration/test_files_router.py — 226 passed, 2 skipped (allow-list + legacy-office rejection tests updated)
  • End-to-end local run: generated a real .docx (heading + paragraphs + table), confirmed full RAGSDK.index_document() indexes it (planted fact + table cell retrievable) and corrupt/missing .docx is rejected with an actionable error naming the file
  • python util/lint.py --all — Black, isort, Pylint, Flake8 green
  • Frontend UnsupportedFeature.test.tsx updated to assert .docx is now supported (vitest)
  • Reviewer: index a .docx via gaia chat --index <file>.docx and ask a question about its contents

Ovtcharov and others added 15 commits June 4, 2026 14:56
DocumentQAAgent and RoutingAgent were the last two agents left in the
core source tree under src/gaia/agents/. They now ship as standalone
gaia-agent-docqa / gaia-agent-routing wheels under hub/agents/python/,
completing the "strip src/gaia/agents/ to framework only" goal for #1102
(only base/, tools/, registry.py, builder/ — plus the chat family and
email — remain in core).

docqa is a building-block RAG agent: it registers via the gaia.agent
entry point as a hidden agent (mirroring fileio), default model
Qwen3.5-35B-A3B-GGUF. routing is infrastructure — a meta-agent loaded by
class path from the OpenAI API server, not a registry agent — so it ships
without a gaia.agent entry point; gaia.api.agent_registry now resolves it
at gaia_agent_routing.agent.RoutingAgent and fails loudly with an install
hint when the wheel is absent.
Self-review follow-up to the docqa/routing migration: the gaia-agent-code
CLI imported RoutingAgent from the old in-tree path
(gaia.agents.routing.agent), which the migration broke. Repoint it at
gaia_agent_routing.agent and declare gaia-agent-routing as a dependency of
gaia-agent-code, since the `gaia-code` query path routes through
RoutingAgent for language/project-type detection. No reverse dependency
(routing → code) — routing resolves CodeAgent through the registry at
runtime, avoiding a cycle.

Also clears the now-dead RoutingAgent allowance in the agent-conventions
checker (it only applied while routing lived under src/gaia/agents/).
# Conflicts:
#	hub/agents/python/docqa/tests/test_docqa_agent.py
# Conflicts:
#	.github/workflows/test_gaia_cli.yml
#	setup.py
Merging main surfaced three stale references the migration missed:

- test_default_max_steps imported the now-migrated gaia.agents.docqa;
  repoint it at the core BuilderAgentConfig, which exercises the same
  field(default_factory=default_max_steps) inheritance.
- test_agent_pypi_publish asserted every published wheel declares a
  gaia.agent entry point, but routing is infrastructure loaded by
  class-path and intentionally ships without one. Exempt it explicitly.
- Routing module path + source links in the docs still pointed at
  src/gaia/agents/routing; repoint to the gaia_agent_routing wheel.

Also preserve the original traceback on the gaia-code ImportError
re-raise (raise ... from e) now that the block is being edited.
gaia-agent-code now depends on gaia-agent-routing>=0.1.0, which isn't
published to PyPI. The Test Code Agent workflow installed code straight
from the hub dir, so uv tried to resolve routing from the registry and
failed. Install the local routing package first so the dep resolves
locally. End users are unaffected — both wheels publish together on tag.
The API streaming tests target the 'gaia-code' model, which routes
through RoutingAgent. Pre-migration routing lived in core, so it
resolved automatically; now it ships as the gaia-agent-routing wheel
that the API Tests job didn't install — so 3 streaming tests hit the
(correct) missing-wheel error instead of a real agent. Install the
local routing+code hub packages, and re-run API tests when either
hub package changes.
CLAUDE.md still pointed DocumentQAAgent/RoutingAgent at the old
src/gaia/agents/{docqa,routing} locations and listed docqa in the source
tree — stale after the hub migration and misleading since CLAUDE.md loads
as context on every session. Point both at their hub wheels and drop the
docqa tree entry.

errors.py FRAMEWORK_PATHS carried a dead 'gaia/agents/routing' entry; the
wheel's frames are already filtered by 'site-packages/'. Remove it and
update the test that asserted its presence.
Word documents previously could not be indexed for RAG — the UI rejected
.docx with a "not supported, save as PDF" message and the SDK had no
extractor. Users with handbooks, contracts, and reports in .docx had to
convert to PDF first. Now .docx indexes directly like PDF/PPTX/XLSX.

Extraction walks the document body in order, capturing paragraph text,
table cells (including tables nested in a cell), and — importantly for
form/template docs — text inside content controls (w:sdt) and hyperlinks,
which Word stores outside the direct runs that Paragraph.text exposes.
Corrupt/non-.docx files and a missing python-docx install fail loudly with
actionable errors. Allow-lists and rejection messaging across the UI
backend and frontend are updated so .docx flows end-to-end.

Closes #1072
@github-actions github-actions Bot added documentation Documentation changes dependencies Dependency updates devops DevOps/infrastructure changes rag RAG system changes tests Test changes performance Performance-critical changes agents labels Jun 25, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Verdict: Approve with suggestions

Word (.docx) documents now index directly for RAG instead of being rejected with a "save as PDF first" message — the extractor walks the document body in order and, importantly, pulls text out of content controls and hyperlinks that Word hides from the simple paragraph API, so filled-in form values get indexed rather than just the labels. The allow-lists and rejection messaging are flipped consistently across the SDK, UI backend, and React frontend, with a zip-bomb guard mirroring the existing .pptx path. Test coverage is genuinely thorough (paragraphs, tables, nested tables, inline + block content controls, hyperlinks, corrupt-file errors, dispatcher routing).

Two things to be aware of, neither blocking:

  • The PR bundles a second, larger change the title/description don't mention: RoutingAgent and DocumentQAAgent are migrated out of the core package into standalone gaia-agent-routing / gaia-agent-docqa hub wheels (Agent Hub: Restructure — move production agents to hub/agents/ #1102). This changes their public import paths — anyone doing from gaia.agents.routing.agent import RoutingAgent will break. The migration itself is clean and the docs are updated in lockstep, but it should be called out in the description / release notes so the breaking import change isn't a surprise.
  • A couple of minor nits below.

No correctness or security issues found.

🔍 Technical details

🟡 Important

Bundled agent migration is a breaking SDK change not surfaced in the PR descriptionsrc/gaia/agents/routing/ and src/gaia/agents/docqa/ move to hub/agents/python/{routing,docqa}/, changing the documented import path (docs/sdk/agents/routing.mdx previously showed from gaia.agents.routing.agent import RoutingAgent). This follows the established #1102 pattern (code/jira/docker/blender/sd already migrated), the docs are all updated, and there's no in-core dangling reference (verified grep over src/gaia/ is clean), so it's intentional and well-executed — not a code defect. The only gap is visibility: the description is "feat(rag): index .docx" and says nothing about the migration or the import-path break. Add a line to the description / changelog so release notes capture it. No deprecation shim is needed given the migration's scope.

🟢 Minor

  • # TODO(#1072) lives inside the docstring (src/gaia/rag/sdk.py:1621) — it renders as literal docstring text with a stray #. Move it below the closing """ as a normal comment, or drop the #:
            Known omissions: header/footer text (separate XML parts, usually
            repeated boilerplate) and embedded images (TODO #1072: VLM
            extraction for images embedded in .docx files).
    
  • gaia-code install hint references a doc path (src/gaia/api/agent_registry.py:1029) — the error string points users to docs/spec/agent-hub-restructure.mdx; worth a quick confirm that page exists on the rendered site, otherwise the actionable error sends users to a 404.

Strengths

  • The content-control / hyperlink handling (_paragraph_text joining every w:t descendant + recursive _emit for block-level w:sdt and nested tables) is the right call — it's exactly the text that form/template .docx files hide from Paragraph.text, and the tests prove it (test_inline_content_control_captured, test_nested_table_in_cell_captured).
  • Fail-loud error handling done well: corrupt/non-.docx and missing python-docx raise actionable, file-named ValueErrors with raise ... from e, and the zip-bomb guard reuses the proven .pptx pattern.
  • Migration hygiene is strong — INFRA_ONLY_AGENT_IDS exemption for routing's missing gaia.agent entry point, the pytest.importorskip guards for framework-only envs, the loud install hint in agent_registry.py, and synchronized CI workflows all show the breaking move was threaded through carefully.
  • Allow-list changes are consistent across all four surfaces (ui/utils.py, ui/routers/files.py, UnsupportedFeature.tsx, frontend tests), so .docx flows end-to-end with no contradicting "unsupported" message left behind.

Adversarial review of the XML-walk extractor surfaced three cases that
silently degraded exactly the form/template/report documents the feature
targets:

- Textboxes/shapes (mc:AlternateContent) were emitted twice — once from the
  DrawingML mc:Choice and once from the VML mc:Fallback twin — and glued
  onto the host paragraph. Skip mc:Fallback so shape text is captured once.
- Tabs and line/page breaks (w:tab/w:br/w:cr) were dropped, gluing adjacent
  words into unsearchable tokens (e.g. "Column1Column2"). Translate them to
  whitespace while leaving intra-word run splits untouched.
- Rows/cells wrapped in repeating-section content controls (w:sdt around
  w:tr/w:tc) were skipped by the direct-child findall. Descend through the
  wrappers.

Also wrap missing-file / directory / permission OSErrors in the same
actionable message as the corrupt-file path instead of a raw traceback.

Adds regression tests for each case (textbox single-capture, tab/break
whitespace, intra-word integrity, sdt-wrapped rows).
@kovtcharov kovtcharov requested a review from itomek June 25, 2026 23:48
@kovtcharov

Copy link
Copy Markdown
Collaborator Author

Thanks for the review — addressed:

  • Bundled routing/docqa migration (Important): surfaced in the PR description under "Heads-up: bundled agent migration," including the breaking import-path change for release notes. The migration rides in from the base branch this was cut from; per the review it's clean/intentional, so it stays — now it's documented rather than silent.
  • # TODO(#1072) rendering inside the docstring (Minor): moved into the "Known omissions" sentence so there's no stray # in the rendered docstring (fa01b9d3).
  • gaia-code install-hint doc path (Minor): verified — docs/spec/agent-hub-restructure.mdx exists, so the actionable error doesn't 404. No change needed.

Also merged latest main (only conflict was CLAUDE.md, resolved to main's authoritative version since this PR doesn't touch it). Full suite green after the merge: 226 passed, 2 skipped; lint clean.

Resolve conflicts: adopt the standalone gaia-agent-chat wheel import path
(#1102) in agent-split tests, keep the chat/tools path trigger in the docqa
workflow, and unify the test-summary needs list.
@github-actions

Copy link
Copy Markdown
Contributor

🟡 setup.py has duplicate dict keys and duplicate list entries introduced by this push — the new lines for agent-docqa and agent-routing were inserted above entries that already existed for both, so each appears twice.

In Python, duplicate keys in a dict literal are silently allowed (last value wins), meaning the first of each pair is dead code. The "agents" meta-extra list also ends up installing gaia-agent-docqa and gaia-agent-routing twice (pip deduplicates at resolution time, so no install breakage, but the list is wrong).

Please remove the four stale lines at lines 274–275 and 292–293 of setup.py (the ones that now appear below agent-email in the dict, and below gaia-agent-email in the "agents" list).

🔍 Technical details

setup.py:271–275extras_require dict:

        "agent-docqa": ["gaia-agent-docqa"],   # ← new (line 271)
        "agent-routing": ["gaia-agent-routing"], # ← new (line 272)
        "agent-email": ["gaia-agent-email"],
        "agent-docqa": ["gaia-agent-docqa"],   # ← stale duplicate (line 274)
        "agent-routing": ["gaia-agent-routing"], # ← stale duplicate (line 275)

setup.py:289–293"agents" list:

            "gaia-agent-docqa",   # ← new (line 289)
            "gaia-agent-routing", # ← new (line 290)
            "gaia-agent-email",
            "gaia-agent-docqa",   # ← stale duplicate (line 292)
            "gaia-agent-routing", # ← stale duplicate (line 293)

Lines 274–275 and 292–293 should be deleted.

Black/Flake8 E303 in test_agents_split.py after the main merge.
@github-actions

Copy link
Copy Markdown
Contributor

🟡 setup.py has duplicate keys/entries introduced by this push — looks like a merge artifact that needs a cleanup pass before this lands.

The extras_require dict now has "agent-docqa" and "agent-routing" defined twice. Python silently uses the last value (same value here, so no functional breakage), but Python 3.12+ emits SyntaxWarning: 'agent-docqa' is a duplicate key in a dict literal, which will show up in CI. The "all" extras list has the same packages listed twice too.

🔍 Technical details

setup.py around line 268 (after the patch):

        "agent-browser": ["gaia-agent-browser"],
        "agent-docqa": ["gaia-agent-docqa"],    # ← newly added
        "agent-routing": ["gaia-agent-routing"], # ← newly added
        "agent-email": ["gaia-agent-email"],
        "agent-docqa": ["gaia-agent-docqa"],    # ← already existed — duplicate key
        "agent-routing": ["gaia-agent-routing"], # ← already existed — duplicate key

And in the "all" list around line 286:

        "gaia-agent-browser",
        "gaia-agent-docqa",    # ← newly added
        "gaia-agent-routing",  # ← newly added
        "gaia-agent-email",
        "gaia-agent-docqa",    # ← already existed — duplicate entry
        "gaia-agent-routing",  # ← already existed — duplicate entry

Fix: remove the pre-existing "agent-docqa" / "agent-routing" dict keys and the duplicate list entries below "agent-email", keeping only the newly positioned ones above it.

The intro note listed Word/Excel as supported but the CLI examples still
showed only PDF/PPTX, so a user couldn't tell .docx was indexable.
@github-actions

Copy link
Copy Markdown
Contributor

🟡 setup.py has duplicate dictionary keys for agent-docqa and agent-routing — the entries added by this push land immediately before identical pre-existing entries. Python silently keeps the last definition; Python 3.12 + promotes this to a SyntaxWarning that can break strict CI runs. Both duplicate occurrences (in extras_require and in the "agents" list) need one copy removed.

🔍 Technical details

setup.py:271-275 — after the patch, the dict literal reads:

"agent-docqa": ["gaia-agent-docqa"],   # ← newly inserted (line 271)
"agent-routing": ["gaia-agent-routing"],  # ← newly inserted (line 272)
"agent-email": ["gaia-agent-email"],
"agent-docqa": ["gaia-agent-docqa"],   # ← pre-existing (line 274) — duplicate key
"agent-routing": ["gaia-agent-routing"],  # ← pre-existing (line 275) — duplicate key

setup.py:289-293 — same pattern in the "agents" flat list (harmless duplicate list entries, but still wrong).

Fix: remove the newly-added lines (271-272 in the patched file) — the pre-existing entries already register both agents correctly.

Merge re-inserted both keys above the pre-existing entries, producing
duplicate dict keys (Python 3.12 SyntaxWarning) and double list entries
in the agents meta-extra.
@kovtcharov-amd

Copy link
Copy Markdown
Collaborator

Addressed the setup.py duplicate-key feedback in 8cd92098: removed the stale agent-docqa/agent-routing entries that the main merge re-inserted below agent-email (both in extras_require and the agents meta-extra). Verified no SyntaxWarning: duplicate key under python -W error::SyntaxWarning.

The remaining red on RAG Integration Tests / Test Lemonade Embeddings API is the known transient AMD-Vulkan llama-server failed to start fault on the self-hosted runner (tracked by #1876), not this change — the server comes up and the embedder load is what intermittently 500s. RAG Unit Tests and the other 52 checks are green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents dependencies Dependency updates devops DevOps/infrastructure changes documentation Documentation changes performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Microsoft Office (docx, pptx, xls) indexing

3 participants