🧬 MedReportAI

AI-powered biomedical research report generator

Parallel LangGraph agents that plan → retrieve → synthesise → compile evidence-based medical reports from PubMed and the web.

✨ Features

Feature	Description
🔬 Automated Report Planning	DSPy-driven planner generates structured section outlines from a single research query
📡 Triple-Source Retrieval	Live PubMed search via BioPython Entrez, local PubMed FAISS index, and Tavily web search
🧠 Parallel Agent Architecture	LangGraph orchestrates multiple section-writing agents concurrently with tool access
📝 Scratchpad Protocol	Agents follow a disciplined extract → note → synthesise workflow for traceable research
🔁 Hybrid Retrieval (BM25 + Dense)	Ensemble retriever with cross-encoder reranking for high-precision document retrieval
🔎 Live PubMed Search	On-demand querying of PubMed with automatic CSV persistence and active-source switching
📊 Unified Dataset Per Run	Single CSV and FAISS index per run, all section branches share one deduplicated dataset
🔢 Numbered Citation System	Academic-style `[N]` inline citations with a clean References section at the end
✅ Pre-synthesis Verification	Evidence quality gate checks source count, quantitative depth, and scratchpad length before synthesis

🏗 Architecture

MedReportAI is a multi-agent LangGraph pipeline with three main phases:

Two-Phase Section Writer

Each research section runs a two-phase loop inside a parallel sub-graph:

Phase 1 - Tool-Based Research: The section agent uses PubMed search, FAISS retrieval, and web search tools to gather evidence into a scratchpad. A verification gate checks the scratchpad before advancing.
Phase 2 - Synthesis: The agent writes the final section from the scratchpad, using [N] numbered citations that map to the shared citation registry.

Unified Dataset Per Run

Every pipeline run receives a unique run_id. All pubmed_scraper_tool calls within a run write to one shared CSV file, deduplicated by PMID. A single FAISS index reflects this unified dataset and is queried by retriever_tool across all section branches.

Numbered Citation System

A citation registry (dict[str, int]) maps each unique source URL to a stable number. Section agents emit [N] inline citations during synthesis. The final report includes a References section containing only sources whose [N] appears in the body text.

┌──────────────────────────────────────────────────────────┐
│                   LangGraph Pipeline                      │
│                                                          │
│  ┌──────────────┐    ┌─────────────────────────────┐     │
│  │ Plan Report  │───▶│ Build Sections (parallel)   │     │
│  │  (DSPy)      │    │  ┌─────────┐  ┌─────────┐  │     │
│  └──────────────┘    │  │Section 1│  │Section N│  │     │
│                      │  │Phase 1: │  │Phase 1: │  │     │
│                      │  │ research│  │ research│  │     │
│                      │  │Phase 2: │  │Phase 2: │  │     │
│                      │  │ synth.  │  │ synth.  │  │     │
│                      │  └─────────┘  └─────────┘  │     │
│                      └────────────┬────────────────┘     │
│                                   ▼                      │
│  ┌──────────────┐    ┌─────────────────────────────┐     │
│  │ Compile      │◀───│ Write Final Sections        │     │
│  │ Final Report │    │ (intro, conclusion)          │     │
│  │ + References │    └─────────────────────────────┘     │
│  └──────────────┘                                        │
└──────────────────────────────────────────────────────────┘
       │               │               │
┌────────────┐  ┌────────────┐  ┌────────────┐
│ PubMed RAG │  │ Live       │  │ Tavily     │
│ (FAISS +   │  │ PubMed     │  │ Web        │
│  BM25)     │  │ Search     │  │ Search     │
└────────────┘  └────────────┘  └────────────┘
       │               │               │
       └───────────────┼───────────────┘
                       ▼
          ┌──────────────────────┐
          │ Unified CSV + FAISS  │
          │ (one per run_id)     │
          └──────────────────────┘

Graph Topology

Entry: generate_plan
Fan-out: initiate_section_writing → parallel build_section_with_tools sub-graphs
Gather: gather_completed_sections
Fan-out: initiate_final_section_writing → parallel write_final_sections
Compile: compile_final_report
Validate: validate_report_quality
Exit: END

LangGraph Studio view of the pipeline graph

📁 Project Structure

MedReportAI/
├── app.py                  # LangGraph pipeline definition & entry point
├── config.py               # Model, retriever, and path configuration
├── langgraph.json          # LangGraph Studio deployment config
├── pyproject.toml          # Project metadata & dependencies
│
├── agents/
│   └── planner.py          # Report plan generation & final section writing (DSPy)
│
├── core/
│   ├── nodes.py            # Graph node functions (fan-out, synthesis, compile)
│   ├── quality.py          # Citation validation, reference building, truncation detection
│   ├── schemas.py          # Pydantic schemas (scratchpad ops, Section model)
│   ├── signatures.py       # DSPy signatures (ReportPlanner, FinalInstructions)
│   ├── states.py           # LangGraph state definitions with reducer annotations
│   ├── tool_node.py        # Tool execution node with routing + citation registry
│   └── verification.py     # Pre-synthesis verification gate
│
├── rag/
│   ├── chain.py            # RAG chain construction
│   ├── embeddings.py       # FastEmbed wrapper for LangChain
│   ├── retrieval_builder.py # Ensemble retriever + cross-encoder reranker
│   ├── retrieval_formatter.py # Structured report from retriever results
│   └── source_formatter.py # Web search result formatting & deduplication
│
├── tools/
│   ├── pubmed_search.py    # Live PubMed search with unified CSV persistence
│   ├── retrieval.py        # PubMed FAISS retriever tool
│   ├── web_search.py       # Tavily web search tool
│   ├── scratchpad.py       # Read/write/clear scratchpad operations
│   └── query_generator.py  # DSPy multi-query generator
│
├── prompts/
│   ├── planner.py          # Context persona & report structure prompts
│   ├── section_writer.py   # Two-phase section writing protocol
│   └── scraper.py          # PubMed query parsing prompt
│
├── scripts/
│   └── pubmed_scraper.py   # PubMed article scraper (BioPython Entrez + DeepSeek)
│
├── utils/
│   ├── data_processing.py  # CSV loading, semantic chunking, FAISS indexing
│   ├── formatting.py       # Rich console formatters
│   ├── helpers.py          # Environment setup, logging, file helpers
│   └── scratchpad_helpers.py # Scratchpad read/write/clear handlers
│
└── tests/                  # Detailed test suite

🚀 Getting Started

Prerequisites

Python 3.12+
A DeepSeek API key
A Tavily API key (for web search)
(Optional) An email for NCBI Entrez (for live PubMed search)

Installation

git clone https://github.com/Chrisolande/MedReportAI.git
cd MedReportAI

# Install with uv (recommended)
uv sync --extra dev

# Or with pip
pip install -e ".[dev]"

Environment Variables

Create a .env file in the project root:

DEEPSEEK_API_KEY=your_deepseek_api_key_here
TAVILY_API_KEY=your_tavily_api_key_here
ENTREZ_EMAIL=your_email@example.com  # Optional: for live PubMed search

🖥 Usage

LangGraph Studio

The sole interface is LangGraph Studio. Open the project directory in LangGraph Studio and invoke the graph with a topic:

{
  "topic": "Long-term pediatric health outcomes in conflict settings"
}

The pipeline runs fully autonomously from topic input to final report.

Programmatic API

from app import graph

result = await graph.ainvoke(
    {"topic": "Impact of malnutrition on child neurodevelopment"},
    config={
        "configurable": {
            "context": "You are a pediatric nutrition researcher...",
            "report_organization": "Structure the report with..."
        }
    }
)

print(result["final_report"])

⚙ Configuration

All settings are centralized in config.py:

Setting	Default	Description
`deepseek_model`	`deepseek-chat`	LLM model
`deepseek_temperature`	`1.3`	Generation temperature
`max_tokens`	`512`	Maximum tokens per LLM call
`embedding_model`	`sentence-transformers/all-MiniLM-L6-v2`	Embedding model (FastEmbed)
`k`	`15`	Documents to retrieve
`sparse_weight`	`0.65`	BM25 weight in ensemble
`dense_weight`	`0.35`	Dense retrieval weight
`top_n`	`5`	Documents after reranking

🛠 Tech Stack

Layer	Technology
Orchestration	LangGraph - stateful multi-agent pipeline
Prompt Framework	DSPy - structured signatures
LLM	DeepSeek
Retrieval	FAISS + BM25 ensemble with FastEmbed reranking
Live Search	BioPython Entrez - real-time PubMed
Web Search	Tavily - web search with raw content
Tracing	LangSmith - observability

🧪 Development

# Run tests
uv run pytest tests/ -v

# Lint
uvx ruff check .
uvx ruff format .

# Complexity check
uv run radon cc -s -a core/ tools/ scripts/

📄 License

This project is licensed under the MIT License.

Built with ❤ by @Chrisolande

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.github/workflows		.github/workflows
agents		agents
assets		assets
core		core
downloaded_docs		downloaded_docs
outputs		outputs
prompts		prompts
rag		rag
scripts		scripts
tests		tests
tools		tools
utils		utils
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.md		LICENSE.md
README.md		README.md
app.py		app.py
config.py		config.py
langgraph.json		langgraph.json
pyproject.toml		pyproject.toml
report_history.json		report_history.json
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 MedReportAI

✨ Features

🏗 Architecture

Two-Phase Section Writer

Unified Dataset Per Run

Numbered Citation System

Graph Topology

📁 Project Structure

🚀 Getting Started

Prerequisites

Installation

Environment Variables

🖥 Usage

LangGraph Studio

Programmatic API

⚙ Configuration

🛠 Tech Stack

🧪 Development

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 MedReportAI

✨ Features

🏗 Architecture

Two-Phase Section Writer

Unified Dataset Per Run

Numbered Citation System

Graph Topology

📁 Project Structure

🚀 Getting Started

Prerequisites

Installation

Environment Variables

🖥 Usage

LangGraph Studio

Programmatic API

⚙ Configuration

🛠 Tech Stack

🧪 Development

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages