Skip to content

feat(document): add multi-format course material upload#741

Open
jackefn wants to merge 1 commit into
THU-MAIC:mainfrom
jackefn:feat/document-multiformat-upload
Open

feat(document): add multi-format course material upload#741
jackefn wants to merge 1 commit into
THU-MAIC:mainfrom
jackefn:feat/document-multiformat-upload

Conversation

@jackefn

@jackefn jackefn commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Implement MAIC ETL Milestone 2 by moving the upload flow from PDF-only upload to single-file course material upload.

This PR adds a generic document extraction API, supports PDF/DOCX/PPTX/TXT/Markdown uploads from the homepage, routes files through the document extractor boundary, and keeps the existing generation compatibility shape stable.

Related Issues

Related to #621, #611, #41, #140

Changes

  • Add /api/extract-document for generic course material extraction.
  • Add shared MIME helpers and upload accept-list for PDF, DOCX, PPTX, TXT, and Markdown.
  • Add a local plain-text extractor for TXT and Markdown so these formats do not depend on MinerU.
  • Extend the self-hosted MinerU document extractor to support PDF, DOCX, and PPTX.
  • Update the homepage upload UI from "Upload PDF" to "Upload course material".
  • Preserve single-file upload behavior and existing generation compatibility fields.
  • Select providers by explicit PDF parser choice or by extractor capability match.
  • Return clear diagnostics for unsupported file types and unsupported provider/format combinations.
  • Update generation preview copy to show the uploaded file type dynamically.
  • Add document extractor tests for registry behavior, PDF compatibility, and TXT/Markdown extraction.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or build changes

Verification

Steps to reproduce / test

  1. Start OpenMAIC locally and upload a PDF, TXT, or Markdown file from the homepage course material upload control.
  2. Configure MinerU and upload a DOCX or PPTX file to confirm provider-backed extraction.
  3. Upload an unsupported file type or force an unsupported provider/format combination and confirm a clear error is returned.

What you personally verified

  • TXT and Markdown extraction runs locally through plain-text and no longer calls MinerU.
  • PDF extraction still works through the existing PDF-compatible path.
  • DOCX/PPTX are routed to MinerU by capability match when MinerU is configured.
  • Unsupported formats and unsupported provider/format combinations return clear diagnostics.
  • The generation preview displays the uploaded file type dynamically instead of hardcoding PDF.

Evidence

  • CI passes (pnpm check && pnpm lint && npx tsc --noEmit)
  • Manually tested locally
  • Screenshots / recordings attached (if UI changes)

Local verification:

pnpm vitest run tests/document/extractor-registry.test.ts tests/document/pdf-compat.test.ts tests/document/text-extractor.test.ts
Test Files  3 passed (3)
Tests       13 passed (13)

pnpm exec tsc --noEmit
passed

pnpm check:i18n-keys
passed

pnpm check
passed

pnpm lint
0 errors, existing warnings only

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have added/updated documentation as needed
  • My changes do not introduce new warnings

@wyuc wyuc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the extractor-boundary design (registry + capability matching + ParsedPdfContent compat shape) is clean, and the PDF/TXT/Markdown paths look solid. One blocking issue on the DOCX/PPTX path, plus a few minor items.

Blocking

DOCX/PPTX returns an opaque 500 instead of clear diagnostics

The headline new formats break (or fail unhelpfully) on the two most common setups, which contradicts the "clear diagnostics for unsupported provider/format combinations" goal:

  • No MinerU configured (stock deployment). For a DOCX/PPTX upload the client sends no providerId/baseUrl (app/generation-preview/page.tsx, the documentMimeType === 'application/pdf' guards), so the route capability-matches mineru — the only office-capable provider. mineru is unmanaged → resolvePDFBaseUrl('mineru', undefined) returns undefinedparseWithMinerUDocument throws "MinerU base URL is required…", which the outer try/catch maps to apiError('PARSE_FAILED', 500, …). Users get a 500 with a raw internal message, not an actionable 4xx.

  • MinerU configured client-side (not server-managed). The same PDF-only guards in generation-preview drop the user's configured baseUrl/apiKey for non-PDF files, so even a correctly-configured client cannot upload DOCX/PPTX — the request reaches the route without a base URL and fails identically.

Root cause is twofold: (1) generation-preview only forwards provider config when the file is a PDF, and (2) selectDocumentExtractorProvider matches on static capabilities only, with no check that the matched provider is actually configured/runnable.

Suggested fix:

  • Forward providerId/apiKey/baseUrl for the document path regardless of MIME type (let the route decide), so client-configured MinerU works for office formats.
  • After provider selection, if the chosen provider is MinerU-backed and has neither a server config (isServerConfiguredProvider('pdf', id)) nor a client base URL, return a 400/422 with an actionable message (e.g. "DOCX/PPTX requires a configured MinerU extractor") rather than falling through to a 500. Ideally capability matching also factors in configured-availability.

Minor

  • getGenerationStepText produces wrong labels (app/generation-preview/types.ts). The type label is derived from pdfFileName.split('.').pop().toUpperCase() with no validation, so report.v2 renders as "Analyzing V2 file" and an extension-less name renders the whole filename. Prefer the MIME-based branch, or only accept the extension when it's one of the known set.

  • Text charset is effectively ignored (lib/document/extractors/text.ts + lib/document/mime.ts). normalizeDocumentMimeType strips ;charset=… before the extractor runs, so textDecoderForMimeType always falls back to UTF-8 and a non-UTF-8 TXT/MD upload is silently mojibaked. Either preserve the charset for text types, sniff a BOM, or document UTF-8-only.

  • Dead i18n keys. getGenerationStepText always overrides the pdf-analysis step, so generation.analyzingPdf / analyzingPdfDesc are now unreachable in every locale. Remove or repoint.

Tests

Registry and text-extractor tests are good. Gaps worth closing:

  • No test for the unconfigured-MinerU path — the registry test asserts mineru is selected for office formats but nothing asserts the route/extract behavior when it can't run (where the 500 hides).
  • No test for extract-document route error mapping (unsupported MIME → 400, unknown provider → 400, provider/format mismatch → 400).

Verdict

Mergeable once the DOCX/PPTX path returns an actionable 4xx (and forwards client MinerU config for office formats). A route-level error-mapping test would lock it in. The rest is low-severity polish.

@jackefn jackefn force-pushed the feat/document-multiformat-upload branch from 7c287fb to fa52e8c Compare June 16, 2026 12:05

@wyuc wyuc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround — the 422 path, client-config forwarding, label whitelist, BOM sniffing, dead-key cleanup, and the new route error-mapping test all look good. Two regressions slipped in with the fix, though:

Blocking (new regressions)

1. Non-PDF uploads from the main UI now get rejected with 400

Removing the documentMimeType === 'application/pdf' guard means generation-preview now always forwards providerId. The homepage stores settings.pdfProviderId (default unpdf) for every upload, so a TXT/Markdown/DOCX/PPTX upload arrives with providerId=unpdf. In app/api/extract-document/route.ts the !provider.supportedMimeTypes.includes(mimeType) check then 400s the request ("Document extractor 'unpdf' does not support MIME type 'text/plain'") before selectDocumentExtractorProvider can pick plain-text/mineru.

Net effect: the new course-material formats fail on the default path unless the user manually switches the extractor dropdown to a compatible one — which defeats the feature. (TXT/MD should "just work" locally; they no longer do.)

Fix: treat preferredProviderId as a hint, not a hard constraint — if it doesn't support the file's MIME type, fall back to MIME/capability-based selection instead of returning 400. Alternatively, only forward providerId from the client when it actually applies to the file (PDF, or a provider that supports the MIME). Note you still want to forward apiKey/baseUrl unconditionally so client-configured MinerU keeps working for Office formats (that was the original fix).

2. The new 422 guard wrongly blocks mineru-cloud

isMinerUBackedProvider returns true for both mineru and mineru-cloud, but mineru-cloud resolves its endpoint as config.baseUrl || MINERU_CLOUD_DEFAULT_BASE (lib/pdf/mineru-cloud.ts) — it needs only an API key, no base URL. So a user who selects mineru-cloud for a PDF (key set, no base URL — the normal case) now hits the 422 "requires a configured MinerU … base URL", a path that worked before.

Fix: narrow the guard to provider.id === 'mineru' (self-hosted, which genuinely requires a base URL), or check the effective endpoint after resolution (including the cloud default) rather than the raw client base URL.

Otherwise

Round-1 items are all addressed well. A test covering (a) a non-PDF upload with a default unpdf providerId and (b) mineru-cloud PDF with key-only would lock both regressions down. Close these two and it's good to go.

@jackefn jackefn force-pushed the feat/document-multiformat-upload branch from fa52e8c to f8e8631 Compare June 16, 2026 14:17
@yanpgwang

Copy link
Copy Markdown
Collaborator

Re-reviewed the latest push — both round-2 regressions are fixed and now locked by tests:

  1. Incompatible providerId treated as a hint ✅ — the route drops a non-matching preferred provider and falls back to selectDocumentExtractorProvider, so a default unpdf + TXT/MD upload resolves to plain-text instead of 400. Covered by "treats an incompatible preferred provider as a hint and falls back by MIME type".
  2. 422 guard no longer blocks mineru-cloud ✅ — narrowed to provider.id === 'mineru', so a key-only mineru-cloud PDF passes. Covered by "allows MinerU Cloud PDF extraction with an API key and no base URL".

Local checks: vitest run tests/document19/19 passed; tsc --noEmit clean; prettier --check clean on the changed files; new i18n keys present in all 8 locales.

One thing I'd suggest fixing in this PR (not as follow-up)

Office support is added for self-hosted MinerU but not for MinerU Cloud — this PR creates a self-host/cloud capability split. Today mineru-cloud is left PDF-only (lib/document/extractors/pdf.ts:41 only grants the Office MIME types to id === 'mineru'). So a user configured with only MinerU Cloud (key, no base URL) who uploads a DOCX gets capability-matched to self-hosted mineru → the new 422 → a message telling them to configure a MinerU server / base URL, which they don't need and don't have. The headline "multi-format upload" feature silently doesn't apply to the Cloud path.

I'd rather we close this in the PR than ship the inconsistency and fix it later — adding Office to one backend but not the other is the kind of gap that's easy to land now and confusing to debug afterwards.

I verified the Cloud API genuinely supports Office, so this is a real, achievable fix (not an API limitation). Live call to mineru.net/api/v4/file-urls/batch with .docx/.pptx/.doc/.ppt all returned code:0 with OSS upload URLs that preserve the extension — and that extension is how MinerU infers the document type. Concretely:

  • lib/document/extractors/pdf.ts:41 — grant the Office MIME types to mineru-cloud as well.
  • lib/pdf/mineru-cloud.ts sanitizeFileName — relax the .pdf-only allowlist to also accept .docx/.pptx; and thread the real filename/extension through to the upload instead of forcing document.pdf. This is the critical part: if the Cloud receives a .pdf-named DOCX it will try to parse it as a PDF and fail.
  • route the mineru-cloud non-PDF path to the Cloud parser (the current extract() sends every non-PDF to the self-hosted parseWithMinerUDocument).

If you'd prefer to keep this PR scoped to self-hosted MinerU, that's a reasonable call too — but then the UI/422 copy should explicitly say Office requires self-hosted MinerU (Cloud = PDF only), so Cloud users aren't sent to configure a base URL they don't need.

Minor (non-blocking)

The charset branch in text.ts is dead: the route normalizes the MIME type (stripping ;charset=…) before calling extract, so textDecoderForMimeType's mimeType.match(/charset=/) (lib/document/extractors/text.ts:15) never matches. BOM sniffing is the only effective handling, so a charset=gbk file with no BOM still falls back to UTF-8 and mojibakes. Either drop the dead regex (and document UTF-8 + BOM only) or preserve the charset through to the extractor.

Verdict

Round-2 regressions are resolved and tested. The self-host/cloud Office split is the one thing I'd like addressed before merge (or documented away if Cloud is intentionally out of scope); the charset note is low-severity polish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants