Skip to content

docs(document): describe multi-format course material upload#742

Open
jackefn wants to merge 1 commit into
THU-MAIC:mainfrom
jackefn:Docs/document-multiformat-upload
Open

docs(document): describe multi-format course material upload#742
jackefn wants to merge 1 commit into
THU-MAIC:mainfrom
jackefn:Docs/document-multiformat-upload

Conversation

@jackefn

@jackefn jackefn commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

Document the MAIC ETL multi-format course material upload flow.

This PR updates the MinerU/document extraction documentation to describe /api/extract-document, supported file types, provider selection behavior, server-side MinerU configuration, the compatibility response shape, and the current Milestone 2 scope/non-goals.

Related Issues

Related to #621, #611, #41, #140

Changes

  • Update the README MinerU section to describe course material upload and server-side configuration.
  • Replace the old PDF-only parser README with document extraction and multi-format upload documentation.
  • Document supported file types: PDF, DOCX, PPTX, TXT, and Markdown.
  • Document provider selection by explicit PDF parser choice or capability match.
  • Document unsupported provider/format diagnostics.
  • Document MinerU setup via .env.local and server-providers.yml.
  • Clarify the compatibility response shape used by the existing generation flow.
  • Clarify current Milestone 2 scope and non-goals.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or build changes

Verification

Steps to reproduce / test

  1. Read the updated README MinerU section.
  2. Read lib/pdf/README.md and confirm it describes the current document extraction and multi-format upload behavior.
  3. Confirm the docs-only branch contains only documentation changes.

What you personally verified

  • The docs PR changes only README.md and lib/pdf/README.md.
  • The updated docs describe the split between local TXT/Markdown extraction and MinerU-backed DOCX/PPTX extraction.
  • The updated docs preserve /api/parse-pdf as a compatibility route while recommending /api/extract-document for new upload UI code.

Evidence

  • CI passes (pnpm check && pnpm lint && npx tsc --noEmit)
  • Manually tested locally
  • Screenshots / recordings attached (if UI changes)

Local verification:

/tmp/openmaic-multiformat-upload/node_modules/.bin/prettier README.md lib/pdf/README.md --check
All matched files use Prettier code style!

git diff --name-only
README.md
lib/pdf/README.md

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have added/updated documentation as needed
  • My changes do not introduce new warnings

@yanpgwang

Copy link
Copy Markdown
Collaborator

Suggestion on PR structure (not the content): I'd fold this into #741 rather than keep it as a separate docs PR.

This PR only touches docs — README.md (+2/-2) and a large rewrite of lib/pdf/README.md (+108/-299) — and that README documents exactly the document/extractor subsystem that #741 introduces (lib/document/extractors/*, lib/document/mime.ts, lib/pdf/pdf-providers.ts). So it's the companion docs for #741, split off along the feat/ vs docs/ branch boundary.

Two reasons to merge them into one PR:

  • Our own convention points this way. The PR template asks for "I have added/updated documentation as needed," and CONTRIBUTING says "one concern per PR" — a feature and its own usage docs are the same concern. The common GitHub convention is also to land code and its docs together.
  • Right now the docs describe behavior that isn't merged and may still change. feat(document): add multi-format course material upload #741 is still under review and the MinerU Cloud / Office scope is still open (the self-host vs cloud capability split I raised). If feat(document): add multi-format course material upload #741 changes that behavior, this README has to change too — keeping them in lockstep across two PRs is exactly the window we'd avoid by merging them.

Standalone docs PRs (revising existing docs, README polish) are totally fine — it's specifically new-feature usage docs that should ride with the implementation. Suggest closing this and moving the README changes into #741.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants