Skip to content

Release 5.1.0

Latest

Choose a tag to compare

@thiswillbeyourgithub thiswillbeyourgithub released this 15 May 14:05
· 12 commits to main since this release

What's new

This release focuses on modular installation extras, CLI robustness improvements, and a sweep of bug fixes across summarization, logging, and setup.

✨ Features

  • CLI: Accept kebab-case flags (--foo-bar--foo_bar) automatically ([e9bfb80])
  • CLI: Warn on every sys.argv mutation via ArgvState ([4fe38f2]); accept --yt_* as shorthand for --youtube_doc_* ([f800805])
  • YouTube: Auto-detect original-language subtitle track (-orig), falling back to en/en-US ([0753b00])
  • Prompts: Skip per-bullet citations when only one source; mention it once at the top instead ([c0097b7])
    • Exception: for YouTube/timecoded sources, use per-bullet timecodes (e.g. [02:17:33]) ([c0779a1])

🐛 Fixes

  • Summarize: Strip *DEEP BREATH*-style LLM intro artifacts from all top-level bullets, not just the first ([dd09942], [c837143])
  • Summarize: Fix model name in output summary ([e253d45])
  • Logger: Actually remove the default DEBUG stderr handler instead of stacking a second sink on top of it ([2f3b295])
  • Env: Match --debug/--verbose by exact argv token, not substring, preventing false positives from argument values ([e183729])
  • Loaders: Better check for empty documents ([5ebbbd2], [586bc5c])
  • YouTube: Add troubleshooting instructions on failed extraction ([397d133]); fix default language handling ([29207e8])
  • Audio: Fix WDOC_WHISPER_API_KEY handling when OPENAI_API_KEY is unset ([23b6b1f])
  • Setup: Guard openparse-download behind an import openparse probe ([f4445b4]); scope yt-dlp pre-release upgrade to [youtube] users only ([9242566])

♻️ Refactors

  • Setup: Split install_requires into modular extras [youtube], [audio], [anki], [office], [logseq], [full] ([7bb4744])
    • Move audioop-lts into [audio] extra with python_version>='3.13' marker ([eb9eba4])
    • Move py_ankiconnect into [anki] extra with requests fallback ([fe2d9c0])
    • Drop python-magic git install from post-install hook ([bafb379])
  • CLI: Centralize all sys.argv mutations in ArgvState helper class ([1098157], [a22f56d])
  • Logger: Move handler setup out of import side-effects into setup_cli_logging(), called only from __main__.py ([203ab6f])

🧪 Tests

  • Cover ArgvState helpers with unit tests ([6bc897a], [a22f56d])
  • Move API-key precheck from test_wdoc.py to run_all_tests.sh for faster fail ([dbf4410])
  • Skip test_parse_docx on HTTP 429 instead of failing ([a02d684])
  • Improve venv management in run_all_tests.sh ([b6a0dd8])

📚 Docs

  • Clarify uvx wdoc[full] usage throughout README and examples.md ([0f72eaf], [7653e9a])
  • Add/fix [anki] extra in Anki parse example ([0f72eaf])
  • Improve installation instructions recommending uvx ([d88c461])
  • Clarify how to use a cloned repository ([b8b4b5e])

Commits details since the last release

bumpver.toml
docs/source/conf.py
setup.py
wdoc/wdoc.py

  • [dd09942] by @thiswillbeyourgithub, 25 minutes ago:
    fix(summarize): clean LLM intro artifacts on all top-level bullets
    Extract the 'deep breath' / "i'll summarize" cleanup into
    _strip_llm_intro_artifacts and run it on every top-level line, not just
    the first one. Previously a source reference on line 1 would leave a
    later deep-breath bullet untouched.

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/utils/tasks/summarize.py

README.md

tests/run_all_tests.sh

  • [a02d684] by @thiswillbeyourgithub, 2 hours ago:
    test(parsing): skip test_parse_docx on HTTP 429 instead of failing
    The test downloads a sample DOCX from freetestdata.com, which sometimes
    returns 429 (rate limited). That is not a wdoc bug, so skip rather than
    fail in that case.

tests/test_parsing.py

  • [c0779a1] by @thiswillbeyourgithub, 2 hours ago:
    feat(prompts): use timecodes as per-bullet source for YouTube single-source
    Extends the single-source citation exception: when the unique source is a
    YouTube video (or other timecoded media), don't drop citations entirely.
    Mention the video source once at the top, then use each bullet's timecode
    (e.g. [02:17:33]) as its precise per-bullet pointer. Applied to both the
    summary prompt (Sam) and the combine prompt (Carl).

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/utils/prompts.py

  • [c0097b7] by @thiswillbeyourgithub, 2 hours ago:
    feat(prompts): skip per-bullet citations when only one source
    Avoids wasting tokens by repeating the same page/WDOC_ID citation on every
    bullet point when all information shares a single unique source. In that
    case the citation is mentioned once at the top instead. Applies to both
    the summary prompt (Sam) and the combine prompt (Carl).

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/utils/prompts.py

  • [c837143] by @thiswillbeyourgithub, 2 hours ago:
    fix(summarize): strip "DEEP BREATH -" style prefixes from first line
    Permissive on asterisks, "breath"/"breaths", and the separator character
    so variants like "- DEEP BREATH - ", "DEEP BREATHS: ", "DEEP BREATH, "
    are handled while preserving the bullet marker.

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/utils/tasks/summarize.py

wdoc/utils/tasks/summarize.py

  • [c67f564] by @thiswillbeyourgithub, 2 hours ago:
    docs(setup): note nltk punkt_tab download is likely redundant
    unstructured already lazily downloads punkt_tab on first tokenize call,
    so the eager post-install download is probably duplicate work. Keep it
    as a safety net (and to front-load the network hit at install time
    instead of on the first office-document parse), but document it.

Co-Authored-By: Claude Opus 4.7 [email protected]

setup.py

  • [f4445b4] by @thiswillbeyourgithub, 2 hours ago:
    fix(setup): only run openparse-download when openparse is installed
    Guard the post-install weight download with an import openparse probe
    so a stripped-down install (no openparse[ml] in install_requires) does
    not call a missing console-script and emit a confusing error.

Co-Authored-By: Claude Opus 4.7 [email protected]

setup.py

  • [9242566] by @thiswillbeyourgithub, 2 hours ago:
    fix(setup): scope yt-dlp pre-release upgrade to [youtube] users
    yt-dlp lives in the optional [youtube] extra, but the post-install hook
    was force-installing it for everyone (with --user, which is wrong
    inside a venv and quietly drops the install outside the env). Probe for
    yt_dlp first and only run the pip install -U --pre yt-dlp if it's
    already there. This keeps yt-dlp truly optional while still letting
    [youtube] users track YouTube extractor fixes that land in pre-releases.

Co-Authored-By: Claude Opus 4.7 [email protected]

setup.py

  • [eb9eba4] by @thiswillbeyourgithub, 2 hours ago:
    refactor(setup): declare audioop-lts via the [audio] extra
    Move the audioop-lts 3.13+ install out of the imperative post-install
    hook and into the [audio] extra with a python_version>='3.13'
    environment marker. audioop-lts is only needed because pydub needs it,
    and pydub already lives in [audio], so the conditional belongs there.
    This also makes the dependency visible to non-python setup.py install
    installers (pip install wdoc[audio], uv, pipx, etc.) which never ran
    the post-install hook in the first place.

Co-Authored-By: Claude Opus 4.7 [email protected]

setup.py

  • [bafb379] by @thiswillbeyourgithub, 2 hours ago:
    chore(setup): drop python-magic git install from post-install
    The git install existed to get the FIFO/pipe fix from upstream PR for
    issue #261, used via magic.from_buffer on stdin bytes. That code path
    is commented out in batch_file_loader.py, and the two remaining call
    sites (magic.from_file in batch_file_loader.py and pdf.py) work fine
    with the released 0.4.27 wheel on PyPI. Both call sites are already
    wrapped in try/except, so python-magic stays optional at runtime.

Co-Authored-By: Claude Opus 4.7 [email protected]

setup.py

  • [203ab6f] by @thiswillbeyourgithub, 2 hours ago:
    refactor(logger): move handler setup out of import side effects
    When wdoc was imported as a library (e.g. as an open-webui tool),
    wdoc/utils/logger.py mutated the global loguru logger at import time:
    removing the default stderr sink and adding its own stdout/stderr/file
    sinks. That clobbered the host application's loguru configuration.

Wrap the handler installation in a setup_cli_logging() function that
is called explicitly from wdoc/main.py. Library users get whatever
loguru handlers the host already configured (since loguru is a
singleton, wdoc's records will flow through them automatically); CLI
users get the customized colorized stdout/stderr plus the rotated
file log.

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/init.py
wdoc/main.py
wdoc/utils/logger.py
wdoc/wdoc.py

README.md

  • [dbf4410] by @thiswillbeyourgithub, 2 hours ago:
    test(env): move API-key precheck from test_wdoc.py to run_all_tests.sh
    Fails fast at the shell level before spinning up the venv and pytest,
    rather than only when test_wdoc.py is imported.

Co-Authored-By: Claude Opus 4.7 [email protected]

tests/run_all_tests.sh
tests/test_wdoc.py

  • [e183729] by @thiswillbeyourgithub, 2 hours ago:
    fix(env): match --debug/--verbose by argv token, not substring
    check_kwargs() joined sys.argv into a single string and looked for
    " debug" / " verbose" / " -d" / " -v" as substrings, so any argument
    value containing the word (e.g. --query "what does debug mean") would
    silently flip WDOC_DEBUG/WDOC_VERBOSE to true. Now we iterate argv
    tokens and match exact forms (--flag, --flag=value, -x, -x=value).

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/utils/env.py

  • [2f3b295] by @thiswillbeyourgithub, 2 hours ago:
    fix(logger): actually remove default DEBUG stderr handler
    The block that detects loguru's default DEBUG-level stderr handler had
    logger.remove() commented out, so it added an ERROR-level stderr sink
    alongside the default one instead of replacing it. Result: DEBUG/INFO
    log records still poured out of stderr regardless of log_level.

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/utils/logger.py

  • [0f72eaf] by @thiswillbeyourgithub, 2 hours ago:
    docs: clarify uvx extras usage and fix missing [anki] in anki parse example
    Add prominent notes recommending uvx wdoc[full] in the README quick
    explainer and at the top of examples.md so users don't trip on missing
    optional deps. Also fix the anki parse one-liner in the FAQ to include
    the [anki] extra, and drop the redundant [full,audio] (audio is already
    in full).

Co-Authored-By: Claude Opus 4.7 [email protected]

README.md
wdoc/docs/examples.md

README.md
wdoc/docs/examples.md

  • [0753b00] by @thiswillbeyourgithub, 3 hours ago:
    feat(youtube): auto-detect original-language subtitle track
    When youtube_language is unset, inspect the video's available subtitle
    keys and pick the first one ending in -orig (youtube's original-language
    track). Falls back to en/en-US/en-UK if no -orig track exists. Avoids
    hard-coding english as the implicit default for non-english videos.

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/docs/help.md
wdoc/utils/loaders/youtube.py

  • [4fe38f2] by @thiswillbeyourgithub, 3 hours ago:
    feat: warn on every sys.argv mutation in ArgvState
    Each rewrite, rename, removal, append, and promotion now emits a
    logger.warning so users can see when their CLI input is being
    auto-corrected. The kebab->snake normalizer only warns when a key
    actually contained a dash.

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/main.py

  • [e9bfb80] by @thiswillbeyourgithub, 3 hours ago:
    feat: accept kebab-case CLI flags by normalizing -- to _
    Rewrite --foo-bar tokens to --foo_bar in sys.argv before parsing so
    modern --kebab-case usage works without breaking the existing
    --snake_case API.

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/main.py

  • [fe2d9c0] by @thiswillbeyourgithub, 3 hours ago:
    refactor: move py_ankiconnect to anki extras with requests fallback
    Dropped py_ankiconnect from the core install_requires and into the [anki]
    extra. The Anki GUI open path is now extracted into open_anki_gui() in
    misc.py, which falls back to a plain requests.post against the
    AnkiConnect endpoint when py_ankiconnect is not installed (honoring the
    PY_ANKICONNECT_DEFAULT_HOST and PY_ANKICONNECT_DEFAULT_PORT env vars).

Co-Authored-By: Claude Opus 4.7 [email protected]

setup.py
wdoc/utils/misc.py
wdoc/wdoc.py

docs/requirements.txt

  • [a22f56d] by @thiswillbeyourgithub, 3 hours ago:
    refactor: route remaining sys.argv mutations and checks through ArgvState
    Adds append_positional plus read-only helpers (is_empty, kwarg_equals,
    has_flag, has_arg, argv_contains) so cli_launcher no longer pokes at
    sys.argv/args/kwargs directly for the recurring patterns. Covers the
    new helpers with unit tests.

Co-Authored-By: Claude Opus 4.7 [email protected]

tests/test_argv_state.py
wdoc/main.py

  • [7bb4744] by @thiswillbeyourgithub, 3 hours ago:
    refactor: split install_requires into modular extras
    Move heavy/loader-specific dependencies (unstructured[all-docs], yt-dlp,
    deepgram-sdk, torchaudio, ankipandas, etc.) out of the mandatory
    install_requires into dedicated extras: [youtube], [audio], [anki],
    [office], [logseq]. The base install keeps the RAG engine plus PDF and
    URL loaders (the most common cases). [full] aggregates all loader
    extras via self-reference (requires pip >= 21.2).

A trimmed base unstructured package stays in core because the PDF
loader imports clean_extra_whitespace from it.

README installation section updated with the new extras.

Co-Authored-By: Claude Opus 4.7 [email protected]

README.md
setup.py

  • [6bc897a] by @thiswillbeyourgithub, 3 hours ago:
    test: cover ArgvState CLI rewrite helpers
    Adds tests/test_argv_state.py exercising every helper invoked from
    cli_launcher plus the documented edge cases (absent source, target
    collision, non-force overwrite, space-separated --key value form), locking
    in the three-way sync between args, kwargs, and sys.argv.

Co-Authored-By: Claude Opus 4.7 [email protected]

tests/test_argv_state.py

README.md

README.md

setup.py

  • [1098157] by @thiswillbeyourgithub, 4 hours ago:
    refactor: centralize CLI argv rewrites via ArgvState helper
    The cli_launcher had grown a dozen ad-hoc fixups that each had to mutate
    sys.argv, args, and kwargs in lockstep. Move that bookkeeping into a single
    ArgvState class exposing rename_kwarg, rename_kwarg_prefix, rename_positional,
    remove_positional, set_kwarg(force=), and promote_positional_to_kwarg. Drops
    the needs_reparse round-trip, deduplicates sys.argv on forced sets, and fixes
    the stale log line in the ddg_max_result rename.

Co-Authored-By: Claude Opus 4.7 [email protected]

wdoc/main.py

wdoc/docs/help.md

wdoc/main.py
wdoc/docs/help.md

wdoc/utils/loaders/youtube.py

setup.py

wdoc/utils/loaders/shared_audio.py

wdoc/utils/loaders/youtube.py

wdoc/utils/loaders/init.py

wdoc/utils/misc.py