feat(document): add multi-format course material upload#741
Conversation
wyuc
left a comment
There was a problem hiding this comment.
Thanks for this — the extractor-boundary design (registry + capability matching + ParsedPdfContent compat shape) is clean, and the PDF/TXT/Markdown paths look solid. One blocking issue on the DOCX/PPTX path, plus a few minor items.
Blocking
DOCX/PPTX returns an opaque 500 instead of clear diagnostics
The headline new formats break (or fail unhelpfully) on the two most common setups, which contradicts the "clear diagnostics for unsupported provider/format combinations" goal:
-
No MinerU configured (stock deployment). For a DOCX/PPTX upload the client sends no
providerId/baseUrl(app/generation-preview/page.tsx, thedocumentMimeType === 'application/pdf'guards), so the route capability-matchesmineru— the only office-capable provider.mineruis unmanaged →resolvePDFBaseUrl('mineru', undefined)returnsundefined→parseWithMinerUDocumentthrows "MinerU base URL is required…", which the outer try/catch maps toapiError('PARSE_FAILED', 500, …). Users get a 500 with a raw internal message, not an actionable 4xx. -
MinerU configured client-side (not server-managed). The same PDF-only guards in
generation-previewdrop the user's configuredbaseUrl/apiKeyfor non-PDF files, so even a correctly-configured client cannot upload DOCX/PPTX — the request reaches the route without a base URL and fails identically.
Root cause is twofold: (1) generation-preview only forwards provider config when the file is a PDF, and (2) selectDocumentExtractorProvider matches on static capabilities only, with no check that the matched provider is actually configured/runnable.
Suggested fix:
- Forward
providerId/apiKey/baseUrlfor the document path regardless of MIME type (let the route decide), so client-configured MinerU works for office formats. - After provider selection, if the chosen provider is MinerU-backed and has neither a server config (
isServerConfiguredProvider('pdf', id)) nor a client base URL, return a400/422with an actionable message (e.g. "DOCX/PPTX requires a configured MinerU extractor") rather than falling through to a 500. Ideally capability matching also factors in configured-availability.
Minor
-
getGenerationStepTextproduces wrong labels (app/generation-preview/types.ts). The type label is derived frompdfFileName.split('.').pop().toUpperCase()with no validation, soreport.v2renders as "Analyzing V2 file" and an extension-less name renders the whole filename. Prefer the MIME-based branch, or only accept the extension when it's one of the known set. -
Text charset is effectively ignored (
lib/document/extractors/text.ts+lib/document/mime.ts).normalizeDocumentMimeTypestrips;charset=…before the extractor runs, sotextDecoderForMimeTypealways falls back to UTF-8 and a non-UTF-8 TXT/MD upload is silently mojibaked. Either preserve the charset for text types, sniff a BOM, or document UTF-8-only. -
Dead i18n keys.
getGenerationStepTextalways overrides thepdf-analysisstep, sogeneration.analyzingPdf/analyzingPdfDescare now unreachable in every locale. Remove or repoint.
Tests
Registry and text-extractor tests are good. Gaps worth closing:
- No test for the unconfigured-MinerU path — the registry test asserts
mineruis selected for office formats but nothing asserts the route/extract behavior when it can't run (where the 500 hides). - No test for
extract-documentroute error mapping (unsupported MIME → 400, unknown provider → 400, provider/format mismatch → 400).
Verdict
Mergeable once the DOCX/PPTX path returns an actionable 4xx (and forwards client MinerU config for office formats). A route-level error-mapping test would lock it in. The rest is low-severity polish.
7c287fb to
fa52e8c
Compare
wyuc
left a comment
There was a problem hiding this comment.
Thanks for the quick turnaround — the 422 path, client-config forwarding, label whitelist, BOM sniffing, dead-key cleanup, and the new route error-mapping test all look good. Two regressions slipped in with the fix, though:
Blocking (new regressions)
1. Non-PDF uploads from the main UI now get rejected with 400
Removing the documentMimeType === 'application/pdf' guard means generation-preview now always forwards providerId. The homepage stores settings.pdfProviderId (default unpdf) for every upload, so a TXT/Markdown/DOCX/PPTX upload arrives with providerId=unpdf. In app/api/extract-document/route.ts the !provider.supportedMimeTypes.includes(mimeType) check then 400s the request ("Document extractor 'unpdf' does not support MIME type 'text/plain'") before selectDocumentExtractorProvider can pick plain-text/mineru.
Net effect: the new course-material formats fail on the default path unless the user manually switches the extractor dropdown to a compatible one — which defeats the feature. (TXT/MD should "just work" locally; they no longer do.)
Fix: treat preferredProviderId as a hint, not a hard constraint — if it doesn't support the file's MIME type, fall back to MIME/capability-based selection instead of returning 400. Alternatively, only forward providerId from the client when it actually applies to the file (PDF, or a provider that supports the MIME). Note you still want to forward apiKey/baseUrl unconditionally so client-configured MinerU keeps working for Office formats (that was the original fix).
2. The new 422 guard wrongly blocks mineru-cloud
isMinerUBackedProvider returns true for both mineru and mineru-cloud, but mineru-cloud resolves its endpoint as config.baseUrl || MINERU_CLOUD_DEFAULT_BASE (lib/pdf/mineru-cloud.ts) — it needs only an API key, no base URL. So a user who selects mineru-cloud for a PDF (key set, no base URL — the normal case) now hits the 422 "requires a configured MinerU … base URL", a path that worked before.
Fix: narrow the guard to provider.id === 'mineru' (self-hosted, which genuinely requires a base URL), or check the effective endpoint after resolution (including the cloud default) rather than the raw client base URL.
Otherwise
Round-1 items are all addressed well. A test covering (a) a non-PDF upload with a default unpdf providerId and (b) mineru-cloud PDF with key-only would lock both regressions down. Close these two and it's good to go.
fa52e8c to
f8e8631
Compare
|
Re-reviewed the latest push — both round-2 regressions are fixed and now locked by tests:
Local checks: One thing I'd suggest fixing in this PR (not as follow-up)Office support is added for self-hosted MinerU but not for MinerU Cloud — this PR creates a self-host/cloud capability split. Today I'd rather we close this in the PR than ship the inconsistency and fix it later — adding Office to one backend but not the other is the kind of gap that's easy to land now and confusing to debug afterwards. I verified the Cloud API genuinely supports Office, so this is a real, achievable fix (not an API limitation). Live call to
If you'd prefer to keep this PR scoped to self-hosted MinerU, that's a reasonable call too — but then the UI/422 copy should explicitly say Office requires self-hosted MinerU (Cloud = PDF only), so Cloud users aren't sent to configure a base URL they don't need. Minor (non-blocking)The charset branch in VerdictRound-2 regressions are resolved and tested. The self-host/cloud Office split is the one thing I'd like addressed before merge (or documented away if Cloud is intentionally out of scope); the charset note is low-severity polish. |
Summary
Implement MAIC ETL Milestone 2 by moving the upload flow from PDF-only upload to single-file course material upload.
This PR adds a generic document extraction API, supports PDF/DOCX/PPTX/TXT/Markdown uploads from the homepage, routes files through the document extractor boundary, and keeps the existing generation compatibility shape stable.
Related Issues
Related to #621, #611, #41, #140
Changes
/api/extract-documentfor generic course material extraction.plain-textextractor for TXT and Markdown so these formats do not depend on MinerU.Type of Change
Verification
Steps to reproduce / test
What you personally verified
plain-textand no longer calls MinerU.Evidence
pnpm check && pnpm lint && npx tsc --noEmit)Local verification:
Checklist