Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
name: CI Docs

on:
push:
branches:
- main
pull_request:

jobs:
build-docs:
name: "Build Docs"
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
fetch-depth: 0 # Full history for accurate page timestamps

- uses: actions/setup-python@v6
with:
python-version: "3.12"

- name: Install package and dependencies
run: |
python -m pip install uv
uv sync --all-extras

- name: Set up Quarto
uses: quarto-dev/quarto-actions/setup@v2

- name: Build docs
run: uv run great-docs build

- name: Save docs artifact
uses: actions/upload-artifact@v7
with:
name: docs-html
path: great-docs/_site
include-hidden-files: true

- name: Upload build timings
uses: actions/upload-artifact@v7
with:
name: build-timings
path: great-docs/_site/build-timings.json

publish-docs:
name: "Publish Docs"
runs-on: ubuntu-latest
needs: "build-docs"
if: github.ref == 'refs/heads/main'
permissions:
pages: write
id-token: write
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- uses: actions/download-artifact@v7
with:
name: docs-html
path: great-docs/_site

- name: Upload Pages artifact
uses: actions/upload-pages-artifact@v5
with:
path: great-docs/_site
include-hidden-files: true

- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v5

preview-docs:
name: "Preview Docs"
runs-on: ubuntu-latest
needs: "build-docs"
if: github.event_name == 'pull_request'
permissions:
deployments: write
pull-requests: write
steps:
- uses: actions/download-artifact@v7
with:
name: docs-html
path: great-docs/_site

# Start deployment
- name: Configure pull release name
if: ${{ github.event_name == 'pull_request' }}
run: |
echo "RELEASE_NAME=pr-${{ github.event.number }}" >> $GITHUB_ENV

- name: Configure branch release name
if: ${{ github.event_name != 'pull_request' }}
run: |
# use branch name, but replace slashes. E.g. feat/a -> feat-a
echo "RELEASE_NAME=${GITHUB_REF_NAME//\//-}" >> $GITHUB_ENV

# Deploy
- name: Create Github Deployment
uses: bobheadxi/deployments@v1
id: deployment
if: ${{ !github.event.pull_request.head.repo.fork }}
with:
step: start
token: ${{ secrets.GITHUB_TOKEN }}
env: ${{ env.RELEASE_NAME }}
ref: ${{ github.head_ref }}
logs: "https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}"
1 change: 0 additions & 1 deletion .github/workflows/pytest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,3 @@ jobs:
- name: Test with pytest
run: |
uv run --python ${{ matrix.python-version }} --all-extras pytest

4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,5 +127,5 @@ dmypy.json
.vscode/settings.json
tmp.py

# Added by Sam Edwardes
site
# Great Docs build output (ephemeral)
great-docs/
46 changes: 35 additions & 11 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,16 +36,40 @@ just format
just lint
```

### Pre-commit hooks

Git pre-commit hooks are managed with [prek](https://github.com/j178/prek), a
drop-in replacement for `pre-commit`. Configuration lives in `prek.toml`. prek is
part of the `dev` dependency group. Install the hooks once, then run them on
demand as needed:

```bash
# Install the Git hooks (one time after cloning)
uv run prek install

# Run all hooks against every file
uv run prek run --all-files
```

### Documentation

The docs site is built with [Great Docs](https://posit-dev.github.io/great-docs/)
(a Quarto-based generator). Great Docs requires Python 3.11+ and the Quarto CLI, and
is run via `uvx` so it stays separate from the package's own dependencies. Content lives
in `great-docs.yml`, the `user_guide/` directory, and the package docstrings. The
landing page is generated automatically from `README.md`.

```bash
# Preview docs locally
# Preview docs locally (http://localhost:3000)
just preview-docs

# Publish docs to GitHub Pages
just publish-docs
# Build the static site into great-docs/_site
just build-docs
```

Docs are published to GitHub Pages automatically by the `CI Docs` workflow
(`.github/workflows/docs.yml`) on every push to `main` — there is no manual publish step.

### Building and Publishing

```bash
Expand All @@ -64,22 +88,22 @@ just publish
### Core Components

- **`spacypdfreader.spacypdfreader.pdf_reader()`**: Main entry point function that converts a PDF to a spaCy `Doc` object
- Takes a PDF path and a spaCy `Language` object
- Returns a `Doc` object with custom extensions
- Supports multiprocessing via `n_processes` parameter
- Supports page range extraction via `page_range` parameter
- Takes a PDF path and a spaCy `Language` object
- Returns a `Doc` object with custom extensions
- Supports multiprocessing via `n_processes` parameter
- Supports page range extraction via `page_range` parameter

### Parser System

The library uses a pluggable parser architecture in `spacypdfreader/parsers/`:

- **pdfminer** (`parsers/pdfminer.py`): Default parser, fast but lower accuracy
- Uses `pdfminer.high_level.extract_text()`
- Zero-indexed internally but converts from 1-indexed API
- Uses `pdfminer.high_level.extract_text()`
- Zero-indexed internally but converts from 1-indexed API

- **pytesseract** (`parsers/pytesseract.py`): OCR-based parser, slower but higher accuracy
- Converts PDF pages to images first
- Requires optional dependencies: `pip install 'spacypdfreader[pytesseract]'`
- Converts PDF pages to images first
- Requires optional dependencies: `pip install 'spacypdfreader[pytesseract]'`

Each parser implements a `parser(pdf_path: str, page_number: int, **kwargs)` function that returns text for a single page.

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,4 +112,4 @@ nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)
```

Note that the `nlp.add_pipe` is not used by spaCyPDFreader.
Note that the `nlp.add_pipe` is not used by spaCyPDFreader.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 0 additions & 5 deletions docs/api/spacypdfreader.parsers.md

This file was deleted.

3 changes: 0 additions & 3 deletions docs/api/spacypdfreader.spacypdfreader.md

This file was deleted.

55 changes: 0 additions & 55 deletions docs/contributing.md

This file was deleted.

5 changes: 0 additions & 5 deletions docs/hooks.py

This file was deleted.

3 changes: 0 additions & 3 deletions docs/index.md

This file was deleted.

77 changes: 77 additions & 0 deletions great-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Great Docs Configuration
# See https://posit-dev.github.io/great-docs/user-guide/configuration.html

# Display Name
# ------------
display_name: spacypdfreader

# Docstring Parser
# ----------------
# The docstring format used in your package (numpy, google, or sphinx)
parser: google

# Dynamic Introspection
# ---------------------
# Use runtime introspection for more accurate documentation (default: true)
dynamic: true

# Logo & Favicon
# --------------
# show_title displays the package name next to the logo in the navbar; the
# name links to the homepage.
logo:
light: assets/user-astronaut-solid.svg
show_title: true

# GitHub Integration
# ------------------
# GitHub link style: "widget" (shows stars count) or "icon" (simple icon)
github_style: widget

# Site URL
# --------
# Canonical address of the deployed documentation site.
site_url: "https://samedwardes.github.io/spacypdfreader/"

# Author Information
# ------------------
authors:
- name: Sam Edwardes
role: Maintainer
email: edwardes.s@gmail.com
github: SamEdwardes
homepage: https://www.linkedin.com/in/samedwardes

# User Guide
# ----------
# Section ordering for the User Guide .qmd files in user_guide/.
user_guide:
- section: "Guides"
contents:
- parsers.qmd
- multiprocessing.qmd
- spacy-extensions.qmd
- section: "Project"
contents:
- changelog.qmd
- contributing.qmd

# Jupyter Kernel
# --------------
jupyter: python3

# API Reference Structure
# -----------------------
reference:
- title: Functions
desc: The main entry point for converting a PDF into a spaCy Doc.
contents:
- spacypdfreader.pdf_reader

- title: Parsers
desc: >-
Built-in PDF-to-text parsers. Pass one of these to the `pdf_parser`
argument of `pdf_reader`, or bring your own.
contents:
- parsers.pdfminer.parser
- parsers.pytesseract.parser
10 changes: 5 additions & 5 deletions justfile
Original file line number Diff line number Diff line change
Expand Up @@ -47,15 +47,15 @@ test-matrix:
test-gha:
gh workflow run pytest.yml --ref $(git branch --show-current)

# great-docs requires Python 3.11+. The pytesseract extras are included so the
# API reference can introspect every parser module.
[group('docs')]
preview-docs:
uv run mkdocs serve
uv run --python 3.12 --all-extras great-docs preview

[group('docs')]
publish-docs:
rm -rf site
uv run mkdocs build
uv run mkdocs gh-deploy
build-docs:
uv run --python 3.12 --all-extras great-docs build

[group('docs')]
test-docs:
Expand Down
Loading
Loading