Skip to content

Transform unstructured documents into validated, rich and queryable knowledge graphs.

License

Notifications You must be signed in to change notification settings

IBM/docling-graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

259 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 


Docling Graph

Docling Graph

Docs PyPI version Python 3.10 | 3.11 | 3.12 uv Ruff License MIT Pydantic v2 Docling NetworkX Typer Rich vLLM Ollama LF AI & Data OpenSSF Best Practices

Docling-Graph turns documents into validated Pydantic objects, then builds a directed knowledge graph with explicit semantic relationships.

This transformation enables high-precision use cases in chemistry, finance, and legal domains, where AI must capture exact entity connections (compounds and reactions, instruments and dependencies, properties and measurements) rather than rely on approximate text embeddings.

This toolkit supports two extraction paths: local VLM extraction via Docling, and LLM-based extraction routed through LiteLLM for local runtimes (vLLM, Ollama) and API providers (Mistral, OpenAI, Gemini, IBM WatsonX), all orchestrated through a flexible, config-driven pipeline.

Key Capabilities

  • ✍🏻 Input Formats: Ingest PDFs, images, DoclingDocument, Markdown, URLs and more.

  • 🧠 Data Extraction: Extract structured data using VLM or LLM. Supports intelligent chunking and flexible processing modes.

  • πŸ’Ž Graph Construction: Convert validated Pydantic models into NetworkX directed graphs with semantic relationships and stable node IDs, and rich edge metadata.

  • πŸ“¦ Export: Save graphs in multiple Neo4j-compatible formats CSV, and Cypher for bulk import.

  • πŸ“Š Visualization: Explore graphs with interactive HTML pages, and detailed Markdown reports.

Latest Changes

Coming Soon

  • πŸͺœ Multi-Stage Extraction: Define extraction_stage in templates to control multi-pass extraction.

  • 🧩 Interactive Template Builder: Guided workflows for building Pydantic templates.

  • 🧲 Ontology-Based Templates: Match content to the best Pydantic template using semantic similarity.

  • πŸ” External OCR Engine: Pass custom OCR engine URL to convert documents before graph creation.

  • πŸ’Ύ Graph Database Integration: Export data straight into Neo4j, ArangoDB, and similar databases.

Quick Start

Requirements

  • Python 3.10 or higher
  • uv package manager

Installation

# Clone the repository
git clone https://github.com/IBM/docling-graph
cd docling-graph

# Install with uv
uv sync                    # Core + LiteLLM + VLM

For detailed installation instructions, see Installation Guide.

API Key Setup (Remote Inference)

export OPENAI_API_KEY="..."        # OpenAI
export MISTRAL_API_KEY="..."       # Mistral
export GEMINI_API_KEY="..."        # Google Gemini

# IBM WatsonX
export WATSONX_API_KEY="..."       # IBM WatsonX API Key
export WATSONX_PROJECT_ID="..."    # IBM WatsonX Project ID
export WATSONX_URL="..."           # IBM WatsonX URL (optional)

Basic Usage

CLI

# Initialize configuration
uv run docling-graph init

# Convert document from URL
uv run docling-graph convert "https://arxiv.org/pdf/2207.02720" \
    --template "docs.examples.templates.rheology_research.ScholarlyRheologyPaper" \
    --processing-mode "many-to-one"

# Visualize results
uv run docling-graph inspect outputs

Python API - Default Behavior

from docling_graph import run_pipeline, PipelineContext
from docs.examples.templates.rheology_research import ScholarlyRheologyPaper

# Create configuration
config = {
    "source": "https://arxiv.org/pdf/2207.02720",
    "template": ScholarlyRheologyPaper,
    "backend": "llm",
    "inference": "remote",
    "processing_mode": "many-to-one",
    "provider_override": "mistral",
    "model_override": "mistral-medium-latest",
    "use_chunking": True,
}

# Run pipeline - returns data directly, no files written to disk
context: PipelineContext = run_pipeline(config)

# Access results
graph = context.knowledge_graph
models = context.extracted_models
metadata = context.graph_metadata

print(f"Extracted {len(models)} model(s)")
print(f"Graph: {graph.number_of_nodes()} nodes, {graph.number_of_edges()} edges")

For debugging, use --debug with the CLI to save intermediate artifacts to disk; see Trace Data & Debugging. For more examples, see Examples.

Pydantic Templates

Templates define both the extraction schema and the resulting graph structure.

from pydantic import BaseModel, Field
from docling_graph.utils import edge

class Person(BaseModel):
    """Person entity with stable ID."""
    model_config = {
        'is_entity': True,
        'graph_id_fields': ['last_name', 'date_of_birth']
    }
    
    first_name: str = Field(description="Person's first name")
    last_name: str = Field(description="Person's last name")
    date_of_birth: str = Field(description="Date of birth (YYYY-MM-DD)")

class Organization(BaseModel):
    """Organization entity."""
    model_config = {'is_entity': True}
    
    name: str = Field(description="Organization name")
    employees: list[Person] = edge("EMPLOYS", description="List of employees")

For complete guidance, see:

Documentation

Comprehensive documentation can be found on Docling Graph's Page.

Documentation Structure

The documentation follows the docling-graph pipeline stages:

  1. Introduction - Overview and core concepts
  2. Installation - Setup and environment configuration
  3. Schema Definition - Creating Pydantic templates
  4. Pipeline Configuration - Configuring the extraction pipeline
  5. Extraction Process - Document conversion and extraction
  6. Graph Management - Exporting and visualizing graphs
  7. CLI Reference - Command-line interface guide
  8. Python API - Programmatic usage
  9. Examples - Working code examples
  10. Advanced Topics - Performance, testing, error handling
  11. API Reference - Detailed API documentation
  12. Community - Contributing and development guide

Contributing

We welcome contributions! Please see:

Development Setup

# Clone and setup
git clone https://github.com/IBM/docling-graph
cd docling-graph

# Install with dev dependencies
uv sync --extra dev

# Run Execute pre-commit checks
uv run pre-commit run --all-files

License

MIT License - see LICENSE for details.

Acknowledgments

IBM ❀️ Open Source AI

Docling Graph has been brought to you by IBM.

About

Transform unstructured documents into validated, rich and queryable knowledge graphs.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors 6