pdf-xtract

Overview

Go-based PDF processing library providing high-fidelity text, content, and metadata extraction capabilities.

Originally forked from ledongthuc/pdf, this library has been extensively refactored to meet enterprise-grade observability, performance, and compliance requirements.

Efficient parsing and extraction of plain text, structured content, and document metadata.
Robust logging and tracing instrumentation for production debugging.
Compatibility with PDF v1.4 to v2.0 standards.

Installation

You can install the library using Go modules:

go get -u github.com/sassoftware/pdf-xtract

Getting Started

Import the library in your Go code:

import github.com/sassoftware/pdf-xtract

Logging & Observability

import "github.com/sassoftware/pdf-xtract/logger"

The refactored library includes a structured logging layer and a lightweight tracer interface to ensure production-grade observability.

High-level structured logs added at major functional boundaries.
Error logs include contextual information (file, object, and parsing state)

Tracer Integration

import "github.com/sassoftware/pdf-xtract/tracer"

The library includes a lightweight Tracer subsystem that provides fine-grained observability into PDF parsing and extraction operations. It is designed to support debugging and operational monitoring in production environments where PDFs may be large, malformed, or complex.

Object-level processing times (fonts, content streams, metadata objects, etc.)
Recovery attempts (e.g., corrupted xref tables, missing object references)
Execution flow, enabling reconstruction of what happened during extraction
Error points, with the ability to dump the trace when failures occur

Running

After installing the library, you can either integrate pdf-xtract into your own Go applications or run the provided example programs to get started quickly.

git clone https://github.com/sassoftware/pdf-xtract.git
cd pdf-xtract
cd examples
go run main.go

Usage

Check in examples
This library supports two primary extraction modes depending on the use case and PDF size:
1. Standard Extraction Mode (Batch Mode) – best for small/medium PDFs, returns complete text at once
2. Streaming Extraction Mode – best for large PDFs, returns text page-by-page without loading the entire file into memory

Standard Extraction Mode (Batch)

cfg := xtract.NewDefaultConfig()
cfg.MaxConcurrentPDFs = 1
cfg.MaxWorkersPerPDF = 4
cfg.ParsingMode = xtract.BestEffort
cfg.MaxTotalChars = 1000
cfg.MaxMemoryPerPDF = 10 << 20

cfg.Logger = func(level logger.LogLevel, msg string, keyvals ...interface{}) {
	// no-op logger
}

proc := xtract.NewProcessor(cfg)

text, truncated, err := proc.Extract(ctx, "pdf_test.pdf")
if err != nil {
	tracer.PrintTrace()
	return
}

fmt.Println("Truncated?", truncated)
fmt.Println("Extracted Text:", text)

// Metadata extraction
fmt.Println("---- PDF Metadata ----")
if err := proc.Metadata(ctx, "pdf_test.pdf", os.Stdout); err != nil {
	tracer.PrintTrace()
}

Streaming Extraction Mode

stream, truncated, err := proc.ExtractAsStream(ctx, "pdf_test.pdf")
if err != nil {
	return
}

fmt.Println("Streaming output:")
var total string

for pageText := range stream {
	fmt.Println(pageText)
	total += pageText
}

fmt.Println("Truncated?", truncated)
fmt.Println("Final concatenated length:", len(total))

Metadata Extraction

// Print metadata as JSON
err := proc.Metadata(ctx, "yourfile.pdf", os.Stdout)
if err != nil {
	fmt.Println("Failed to extract metadata:", err)
}

Performance Improvements

Recent optimizations have improved extraction performance across all PDF sizes:

PDF Size	Pages	Extraction Time (Before Changes)	Execution Time (After Changes)
21 KB	1	1.18s	1.36s
260 KB	46	6.2s	1.26s
350 KB	70	8.7s	1.39s
800 KB	148	5.3s	1.43s
5 MB	9	41s	0.30s
5.9 MB	1000	5 min 12s	7.5s
6 MB	772	5 min 18s	5.36s
11 MB	602	5 min 25s	10s

Memory Usage per PDF

The library implements bounded memory allocation to prevent out-of-memory errors on large PDFs:

Note: Memory limits can be configured via MaxMemoryPerPDF. If the limit is exceeded during parsing, the operation will fail gracefully with a memory limit error.

Contributing

Maintainers are accepting patches and contributions to this project. Please read CONTRIBUTING.md for details about submitting contributions to this project.

License

This project is licensed under the BSD 3-Clause License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
.vscode		.vscode
alloc		alloc
examples		examples
logger		logger
testdata		testdata
tracer		tracer
CONTRIBUTING.md		CONTRIBUTING.md
ContributorAgreement.txt		ContributorAgreement.txt
LICENSE		LICENSE
README.md		README.md
SUPPORT.md		SUPPORT.md
ascii85.go		ascii85.go
ascii85_test.go		ascii85_test.go
bufpool.go		bufpool.go
bufpool_test.go		bufpool_test.go
config.go		config.go
config_test.go		config_test.go
go.mod		go.mod
go.sum		go.sum
gotagger.json		gotagger.json
lex.go		lex.go
metadata.go		metadata.go
metadata_test.go		metadata_test.go
name.go		name.go
page.go		page.go
page_test.go		page_test.go
processor.go		processor.go
processor_test.go		processor_test.go
ps.go		ps.go
ps_test.go		ps_test.go
read.go		read.go
read_test.go		read_test.go
text.go		text.go
text_test.go		text_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-xtract

Overview

Installation

Getting Started

Logging & Observability

Tracer Integration

Running

Usage

Standard Extraction Mode (Batch)

Streaming Extraction Mode

Metadata Extraction

Performance Improvements

Memory Usage per PDF

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pdf-xtract

Overview

Installation

Getting Started

Logging & Observability

Tracer Integration

Running

Usage

Standard Extraction Mode (Batch)

Streaming Extraction Mode

Metadata Extraction

Performance Improvements

Memory Usage per PDF

Contributing

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages