What's new
This release focuses on modular installation extras, CLI robustness improvements, and a sweep of bug fixes across summarization, logging, and setup.
✨ Features
- CLI: Accept kebab-case flags (
--foo-bar→--foo_bar) automatically ([e9bfb80]) - CLI: Warn on every
sys.argvmutation viaArgvState([4fe38f2]); accept--yt_*as shorthand for--youtube_doc_*([f800805]) - YouTube: Auto-detect original-language subtitle track (
-orig), falling back toen/en-US([0753b00]) - Prompts: Skip per-bullet citations when only one source; mention it once at the top instead ([c0097b7])
- Exception: for YouTube/timecoded sources, use per-bullet timecodes (e.g.
[02:17:33]) ([c0779a1])
- Exception: for YouTube/timecoded sources, use per-bullet timecodes (e.g.
🐛 Fixes
- Summarize: Strip
*DEEP BREATH*-style LLM intro artifacts from all top-level bullets, not just the first ([dd09942], [c837143]) - Summarize: Fix model name in output summary ([e253d45])
- Logger: Actually remove the default DEBUG stderr handler instead of stacking a second sink on top of it ([2f3b295])
- Env: Match
--debug/--verboseby exact argv token, not substring, preventing false positives from argument values ([e183729]) - Loaders: Better check for empty documents ([5ebbbd2], [586bc5c])
- YouTube: Add troubleshooting instructions on failed extraction ([397d133]); fix default language handling ([29207e8])
- Audio: Fix
WDOC_WHISPER_API_KEYhandling whenOPENAI_API_KEYis unset ([23b6b1f]) - Setup: Guard
openparse-downloadbehind animport openparseprobe ([f4445b4]); scope yt-dlp pre-release upgrade to[youtube]users only ([9242566])
♻️ Refactors
- Setup: Split
install_requiresinto modular extras[youtube],[audio],[anki],[office],[logseq],[full]([7bb4744]) - CLI: Centralize all
sys.argvmutations inArgvStatehelper class ([1098157], [a22f56d]) - Logger: Move handler setup out of import side-effects into
setup_cli_logging(), called only from__main__.py([203ab6f])
🧪 Tests
- Cover
ArgvStatehelpers with unit tests ([6bc897a], [a22f56d]) - Move API-key precheck from
test_wdoc.pytorun_all_tests.shfor faster fail ([dbf4410]) - Skip
test_parse_docxon HTTP 429 instead of failing ([a02d684]) - Improve venv management in
run_all_tests.sh([b6a0dd8])
📚 Docs
- Clarify
uvx wdoc[full]usage throughout README andexamples.md([0f72eaf], [7653e9a]) - Add/fix
[anki]extra in Anki parse example ([0f72eaf]) - Improve installation instructions recommending
uvx([d88c461]) - Clarify how to use a cloned repository ([b8b4b5e])
Commits details since the last release
- [3e5834a] by @thiswillbeyourgithub, 2 hours ago:
bump version 5.0.1 -> 5.1.0
bumpver.toml
docs/source/conf.py
setup.py
wdoc/wdoc.py
- [dd09942] by @thiswillbeyourgithub, 25 minutes ago:
fix(summarize): clean LLM intro artifacts on all top-level bullets
Extract the 'deep breath' / "i'll summarize" cleanup into
_strip_llm_intro_artifacts and run it on every top-level line, not just
the first one. Previously a source reference on line 1 would leave a
later deep-breath bullet untouched.
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/utils/tasks/summarize.py
- [c2c2ae5] by @thiswillbeyourgithub, 84 minutes ago:
add done todo
Signed-off-by: thiswillbeyourgithub [email protected]
README.md
- [b6a0dd8] by @thiswillbeyourgithub, 89 minutes ago:
test: improved run_all_test.sh venv management
Signed-off-by: thiswillbeyourgithub [email protected]
tests/run_all_tests.sh
- [a02d684] by @thiswillbeyourgithub, 2 hours ago:
test(parsing): skip test_parse_docx on HTTP 429 instead of failing
The test downloads a sample DOCX from freetestdata.com, which sometimes
returns 429 (rate limited). That is not a wdoc bug, so skip rather than
fail in that case.
tests/test_parsing.py
- [c0779a1] by @thiswillbeyourgithub, 2 hours ago:
feat(prompts): use timecodes as per-bullet source for YouTube single-source
Extends the single-source citation exception: when the unique source is a
YouTube video (or other timecoded media), don't drop citations entirely.
Mention the video source once at the top, then use each bullet's timecode
(e.g. [02:17:33]) as its precise per-bullet pointer. Applied to both the
summary prompt (Sam) and the combine prompt (Carl).
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/utils/prompts.py
- [c0097b7] by @thiswillbeyourgithub, 2 hours ago:
feat(prompts): skip per-bullet citations when only one source
Avoids wasting tokens by repeating the same page/WDOC_ID citation on every
bullet point when all information shares a single unique source. In that
case the citation is mentioned once at the top instead. Applies to both
the summary prompt (Sam) and the combine prompt (Carl).
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/utils/prompts.py
- [c837143] by @thiswillbeyourgithub, 2 hours ago:
fix(summarize): strip "DEEP BREATH -" style prefixes from first line
Permissive on asterisks, "breath"/"breaths", and the separator character
so variants like "- DEEP BREATH - ", "DEEP BREATHS: ", "DEEP BREATH, "
are handled while preserving the bullet marker.
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/utils/tasks/summarize.py
- [e253d45] by @thiswillbeyourgithub, 2 hours ago:
fix: model name in output summary
Signed-off-by: thiswillbeyourgithub [email protected]
wdoc/utils/tasks/summarize.py
- [c67f564] by @thiswillbeyourgithub, 2 hours ago:
docs(setup): note nltk punkt_tab download is likely redundant
unstructured already lazily downloads punkt_tab on first tokenize call,
so the eager post-install download is probably duplicate work. Keep it
as a safety net (and to front-load the network hit at install time
instead of on the first office-document parse), but document it.
Co-Authored-By: Claude Opus 4.7 [email protected]
setup.py
- [f4445b4] by @thiswillbeyourgithub, 2 hours ago:
fix(setup): only run openparse-download when openparse is installed
Guard the post-install weight download with animport openparseprobe
so a stripped-down install (no openparse[ml] in install_requires) does
not call a missing console-script and emit a confusing error.
Co-Authored-By: Claude Opus 4.7 [email protected]
setup.py
- [9242566] by @thiswillbeyourgithub, 2 hours ago:
fix(setup): scope yt-dlp pre-release upgrade to [youtube] users
yt-dlp lives in the optional [youtube] extra, but the post-install hook
was force-installing it for everyone (with--user, which is wrong
inside a venv and quietly drops the install outside the env). Probe for
yt_dlp first and only run thepip install -U --pre yt-dlpif it's
already there. This keeps yt-dlp truly optional while still letting
[youtube] users track YouTube extractor fixes that land in pre-releases.
Co-Authored-By: Claude Opus 4.7 [email protected]
setup.py
- [eb9eba4] by @thiswillbeyourgithub, 2 hours ago:
refactor(setup): declare audioop-lts via the [audio] extra
Move the audioop-lts 3.13+ install out of the imperative post-install
hook and into the[audio]extra with apython_version>='3.13'
environment marker. audioop-lts is only needed because pydub needs it,
and pydub already lives in[audio], so the conditional belongs there.
This also makes the dependency visible to non-python setup.py install
installers (pip install wdoc[audio], uv, pipx, etc.) which never ran
the post-install hook in the first place.
Co-Authored-By: Claude Opus 4.7 [email protected]
setup.py
- [bafb379] by @thiswillbeyourgithub, 2 hours ago:
chore(setup): drop python-magic git install from post-install
The git install existed to get the FIFO/pipe fix from upstream PR for
issue #261, used viamagic.from_bufferon stdin bytes. That code path
is commented out in batch_file_loader.py, and the two remaining call
sites (magic.from_filein batch_file_loader.py and pdf.py) work fine
with the released 0.4.27 wheel on PyPI. Both call sites are already
wrapped in try/except, so python-magic stays optional at runtime.
Co-Authored-By: Claude Opus 4.7 [email protected]
setup.py
- [203ab6f] by @thiswillbeyourgithub, 2 hours ago:
refactor(logger): move handler setup out of import side effects
When wdoc was imported as a library (e.g. as an open-webui tool),
wdoc/utils/logger.py mutated the global loguru logger at import time:
removing the default stderr sink and adding its own stdout/stderr/file
sinks. That clobbered the host application's loguru configuration.
Wrap the handler installation in a setup_cli_logging() function that
is called explicitly from wdoc/main.py. Library users get whatever
loguru handlers the host already configured (since loguru is a
singleton, wdoc's records will flow through them automatically); CLI
users get the customized colorized stdout/stderr plus the rotated
file log.
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/init.py
wdoc/main.py
wdoc/utils/logger.py
wdoc/wdoc.py
- [b8b4b5e] by @thiswillbeyourgithub, 2 hours ago:
doc: clarify how to use a cloned repository
Signed-off-by: thiswillbeyourgithub [email protected]
README.md
- [dbf4410] by @thiswillbeyourgithub, 2 hours ago:
test(env): move API-key precheck from test_wdoc.py to run_all_tests.sh
Fails fast at the shell level before spinning up the venv and pytest,
rather than only when test_wdoc.py is imported.
Co-Authored-By: Claude Opus 4.7 [email protected]
tests/run_all_tests.sh
tests/test_wdoc.py
- [e183729] by @thiswillbeyourgithub, 2 hours ago:
fix(env): match --debug/--verbose by argv token, not substring
check_kwargs() joined sys.argv into a single string and looked for
" debug" / " verbose" / " -d" / " -v" as substrings, so any argument
value containing the word (e.g. --query "what does debug mean") would
silently flip WDOC_DEBUG/WDOC_VERBOSE to true. Now we iterate argv
tokens and match exact forms (--flag, --flag=value, -x, -x=value).
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/utils/env.py
- [2f3b295] by @thiswillbeyourgithub, 2 hours ago:
fix(logger): actually remove default DEBUG stderr handler
The block that detects loguru's default DEBUG-level stderr handler had
logger.remove() commented out, so it added an ERROR-level stderr sink
alongside the default one instead of replacing it. Result: DEBUG/INFO
log records still poured out of stderr regardless of log_level.
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/utils/logger.py
- [0f72eaf] by @thiswillbeyourgithub, 2 hours ago:
docs: clarify uvx extras usage and fix missing [anki] in anki parse example
Add prominent notes recommending uvx wdoc[full] in the README quick
explainer and at the top of examples.md so users don't trip on missing
optional deps. Also fix the anki parse one-liner in the FAQ to include
the [anki] extra, and drop the redundant [full,audio] (audio is already
in full).
Co-Authored-By: Claude Opus 4.7 [email protected]
README.md
wdoc/docs/examples.md
- [7653e9a] by @thiswillbeyourgithub, 3 hours ago:
doc: added uvx before each wdoc invocation
Signed-off-by: thiswillbeyourgithub [email protected]
README.md
wdoc/docs/examples.md
- [0753b00] by @thiswillbeyourgithub, 3 hours ago:
feat(youtube): auto-detect original-language subtitle track
When youtube_language is unset, inspect the video's available subtitle
keys and pick the first one ending in -orig (youtube's original-language
track). Falls back to en/en-US/en-UK if no -orig track exists. Avoids
hard-coding english as the implicit default for non-english videos.
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/docs/help.md
wdoc/utils/loaders/youtube.py
- [4fe38f2] by @thiswillbeyourgithub, 3 hours ago:
feat: warn on every sys.argv mutation in ArgvState
Each rewrite, rename, removal, append, and promotion now emits a
logger.warning so users can see when their CLI input is being
auto-corrected. The kebab->snake normalizer only warns when a key
actually contained a dash.
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/main.py
- [e9bfb80] by @thiswillbeyourgithub, 3 hours ago:
feat: accept kebab-case CLI flags by normalizing -- to _
Rewrite --foo-bar tokens to --foo_bar in sys.argv before parsing so
modern --kebab-case usage works without breaking the existing
--snake_case API.
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/main.py
- [fe2d9c0] by @thiswillbeyourgithub, 3 hours ago:
refactor: move py_ankiconnect to anki extras with requests fallback
Dropped py_ankiconnect from the core install_requires and into the [anki]
extra. The Anki GUI open path is now extracted into open_anki_gui() in
misc.py, which falls back to a plain requests.post against the
AnkiConnect endpoint when py_ankiconnect is not installed (honoring the
PY_ANKICONNECT_DEFAULT_HOST and PY_ANKICONNECT_DEFAULT_PORT env vars).
Co-Authored-By: Claude Opus 4.7 [email protected]
setup.py
wdoc/utils/misc.py
wdoc/wdoc.py
- [8db7697] by @thiswillbeyourgithub, 3 hours ago:
fix: rtd was failing because of ModuleNotFoundError: No module named 'pkg_resources'
Signed-off-by: thiswillbeyourgithub [email protected]
docs/requirements.txt
- [a22f56d] by @thiswillbeyourgithub, 3 hours ago:
refactor: route remaining sys.argv mutations and checks through ArgvState
Adds append_positional plus read-only helpers (is_empty, kwarg_equals,
has_flag, has_arg, argv_contains) so cli_launcher no longer pokes at
sys.argv/args/kwargs directly for the recurring patterns. Covers the
new helpers with unit tests.
Co-Authored-By: Claude Opus 4.7 [email protected]
tests/test_argv_state.py
wdoc/main.py
- [7bb4744] by @thiswillbeyourgithub, 3 hours ago:
refactor: split install_requires into modular extras
Move heavy/loader-specific dependencies (unstructured[all-docs], yt-dlp,
deepgram-sdk, torchaudio, ankipandas, etc.) out of the mandatory
install_requires into dedicated extras: [youtube], [audio], [anki],
[office], [logseq]. The base install keeps the RAG engine plus PDF and
URL loaders (the most common cases). [full] aggregates all loader
extras via self-reference (requires pip >= 21.2).
A trimmed base unstructured package stays in core because the PDF
loader imports clean_extra_whitespace from it.
README installation section updated with the new extras.
Co-Authored-By: Claude Opus 4.7 [email protected]
README.md
setup.py
- [6bc897a] by @thiswillbeyourgithub, 3 hours ago:
test: cover ArgvState CLI rewrite helpers
Adds tests/test_argv_state.py exercising every helper invoked from
cli_launcher plus the documented edge cases (absent source, target
collision, non-force overwrite, space-separated --key value form), locking
in the three-way sync between args, kwargs, and sys.argv.
Co-Authored-By: Claude Opus 4.7 [email protected]
tests/test_argv_state.py
- [d88c461] by @thiswillbeyourgithub, 3 hours ago:
doc: better installation instructions recommending uvx
Signed-off-by: thiswillbeyourgithub [email protected]
README.md
- [b0821de] by @thiswillbeyourgithub, 4 hours ago:
doc: mention that I coded this by hand
Signed-off-by: thiswillbeyourgithub [email protected]
README.md
- [53488d6] by @thiswillbeyourgithub, 4 hours ago:
bump langchain and litellm versions
Signed-off-by: thiswillbeyourgithub [email protected]
setup.py
- [1098157] by @thiswillbeyourgithub, 4 hours ago:
refactor: centralize CLI argv rewrites via ArgvState helper
The cli_launcher had grown a dozen ad-hoc fixups that each had to mutate
sys.argv, args, and kwargs in lockstep. Move that bookkeeping into a single
ArgvState class exposing rename_kwarg, rename_kwarg_prefix, rename_positional,
remove_positional, set_kwarg(force=), and promote_positional_to_kwarg. Drops
the needs_reparse round-trip, deduplicates sys.argv on forced sets, and fixes
the stale log line in the ddg_max_result rename.
Co-Authored-By: Claude Opus 4.7 [email protected]
wdoc/main.py
- [9bdad51] by @thiswillbeyourgithub, 4 hours ago:
doc: minor yt language precision
Signed-off-by: thiswillbeyourgithub [email protected]
wdoc/docs/help.md
- [f800805] by @thiswillbeyourgithub, 4 hours ago:
new: parse --yt_ args as --youtube_doc
Signed-off-by: thiswillbeyourgithub [email protected]
wdoc/main.py
wdoc/docs/help.md
- [29207e8] by @thiswillbeyourgithub, 4 hours ago:
minor: better handling of default yt language argument
Signed-off-by: thiswillbeyourgithub [email protected]
wdoc/utils/loaders/youtube.py
- [aa0e4f1] by @thiswillbeyourgithub, 4 hours ago:
bump yt-dlp and youtube-transcript-api
Signed-off-by: thiswillbeyourgithub [email protected]
setup.py
- [23b6b1f] by @thiswillbeyourgithub, 4 hours ago:
fix: handling of WDOC_WHISPER_API_KEY if OPENAI_API_KEY is unset
Signed-off-by: thiswillbeyourgithub [email protected]
wdoc/utils/loaders/shared_audio.py
- [397d133] by @thiswillbeyourgithub, 5 hours ago:
fix: add troubleshooting instructions for failed youtube document extraction
Signed-off-by: thiswillbeyourgithub [email protected]
wdoc/utils/loaders/youtube.py
- [5ebbbd2] by @thiswillbeyourgithub, 5 hours ago:
fix: better check for empty documents
Signed-off-by: thiswillbeyourgithub [email protected]
wdoc/utils/loaders/init.py
- [586bc5c] by @thiswillbeyourgithub, 6 hours ago:
fix: check docs is not empty
Signed-off-by: thiswillbeyourgithub [email protected]
wdoc/utils/misc.py