COSMIC: COncept-aware Semantic Meta-chunking with Intelligent Classification

A production-ready intelligent text chunking framework for Retrieval-Augmented Generation (RAG) systems.

Developed by Manceps Research Division

Research Objectives

COSMIC addresses fundamental limitations in existing text chunking approaches for RAG systems:

Problem Statement

Current chunking methods suffer from three critical issues:

Semantic Fragmentation - Fixed-length chunkers split mid-concept, breaking coherent ideas
Context Loss - Simple overlap strategies create redundancy without preserving meaning
Domain Blindness - One-size-fits-all approaches ignore domain-specific structure

Our Approach

COSMIC introduces a 6-stage pipeline that combines:

Discourse Coherence Scoring (DCS) - Multi-signal boundary detection using topical coherence, coreference density, and discourse markers
MST-based Domain Clustering - Minimum spanning tree clustering for domain classification
Adaptive Boundary Fusion - Weighted combination of structural and semantic signals
LLM Verification - Optional verification of uncertain boundaries
Zero-Overlap Architecture - Self-contained conceptual chunks without redundant overlap

Target Metrics

Metric	Target	Description
Coherence Score	> 0.85	Semantic unity within chunks
Cross-Concept Splits	< 5%	Chunks that break conceptual boundaries
Latency	< 150ms/page	Processing speed
Fallback Rate	< 15%	Graceful degradation frequency

Installation

Prerequisites

Python 3.10+
CUDA-capable GPU (recommended) or CPU
8GB+ RAM

Install from PyPI

# Install core package
pip install cosmic-chunker

# Install with all optional dependencies (recommended)
pip install cosmic-chunker[all]

# Install spaCy model for coreference resolution
python -m spacy download en_core_web_trf

Install from Source

# Clone the repository
git clone https://github.com/manceps/cosmic.git
cd cosmic

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS
# or: venv\Scripts\activate  # Windows

# Install with all dependencies
pip install -e ".[all]"

# Install spaCy model for coreference resolution
python -m spacy download en_core_web_trf

Docker Installation

# Build container
docker build -t cosmic:latest .

# Run with GPU support
docker run --gpus all -v $(pwd):/workspace cosmic:latest

Configuration

Environment Variables

Create a .env file in the project root (see .env.example):

# LLM Provider: "openai", "ollama", or "auto"
COSMIC_LLM_PROVIDER=openai

# LLM endpoint for Stage 5 verification (OpenAI-compatible API)
COSMIC_LLM_URL=http://localhost:8000/v1
COSMIC_LLM_MODEL=default

# Ollama configuration (when using provider=ollama)
OLLAMA_HOST=http://localhost:11434
COSMIC_OLLAMA_MODEL=auto  # "auto" or specific model

# Embedding computation device
COSMIC_EMBEDDING_DEVICE=cuda  # Options: cuda, cpu, mps

Using Ollama for LLM Verification

COSMIC integrates with Ollama for local LLM verification. The CLI can automatically detect, start, and stop Ollama:

# Auto-detect and use the best available model
cosmic chunk document.txt --strategy full --ollama auto

# Use a specific model
cosmic chunk document.txt --strategy full --ollama gemma3:latest

# Check Ollama status and available models
cosmic ollama status
cosmic ollama list

When using --ollama:

COSMIC checks if Ollama is installed and has models available
If the server isn't running, it starts automatically
After chunking completes, the server is stopped (if COSMIC started it)

Recommended models for verification (in order of preference):

gemma3 - Fast, good quality (3.3 GB)
qwen2.5-coder:7b - Good balance (4.7 GB)
llama3.2 - Versatile (various sizes)

Configuration Files

Default configuration: configs/default.yaml

dcs:
  alpha: 0.4    # Topical coherence weight
  beta: 0.35   # Coreference density weight
  gamma: 0.25  # Discourse signal weight

structure:
  heading_weight: 0.4
  list_weight: 0.3
  table_weight: 0.3

fusion:
  structural_weight: 0.6
  semantic_weight: 0.4
  acceptance_threshold: 0.5

chunk_constraints:
  min_tokens: 100
  max_tokens: 512
  target_tokens: 350

Domain taxonomy: configs/taxonomies/default.yaml

Defines domain-specific terminology and patterns for classification.

Usage

Basic Usage

from cosmic import COSMICChunker, Document

# Initialize chunker with default configuration
chunker = COSMICChunker()

# Create document from text
doc = Document.from_text("""
Your document text here. COSMIC will analyze the structure,
detect semantic boundaries, and create coherent chunks.
""")

# Chunk with automatic strategy selection
chunks = chunker.chunk_document(doc, strategy="auto")

# Access chunk data
for chunk in chunks:
    print(f"Domain: {chunk.domain}")
    print(f"Coherence: {chunk.coherence_score:.2f}")
    print(f"Text: {chunk.text[:100]}...")
    print("---")

Strategy Selection

# Full 6-stage pipeline (highest quality)
chunks = chunker.chunk_document(doc, strategy="full")

# Semantic-only (faster, DCS without structure analysis)
chunks = chunker.chunk_document(doc, strategy="semantic")

# Sliding window (basic similarity-based)
chunks = chunker.chunk_document(doc, strategy="sliding")

# Fixed-length (fastest, token-based splitting)
chunks = chunker.chunk_document(doc, strategy="fixed")

# Auto (recommended) - selects based on document structure
chunks = chunker.chunk_document(doc, strategy="auto")

Batch Processing

from cosmic import BatchProcessor, Document, COSMICConfig

# Initialize batch processor
processor = BatchProcessor(
    config=COSMICConfig(),
    max_workers=4,
)

# Process multiple documents
documents = [Document.from_text(text) for text in texts]
result = processor.process(documents, strategy="auto", show_progress=True)

print(f"Processed: {result.documents_processed}")
print(f"Failed: {result.documents_failed}")
print(f"Total chunks: {result.total_chunks}")

for doc_id, chunks in result.chunks_by_document.items():
    print(f"Document {doc_id}: {len(chunks)} chunks")

Custom Configuration

from cosmic import COSMICChunker, COSMICConfig
from cosmic.core.config import DCSConfig, ChunkConstraints

# Create custom configuration
config = COSMICConfig(
    dcs=DCSConfig(
        alpha=0.5,   # Increase topical coherence weight
        beta=0.3,
        gamma=0.2,
    ),
    chunk_constraints=ChunkConstraints(
        min_tokens=50,
        max_tokens=1024,
        target_tokens=512,
    ),
)

chunker = COSMICChunker(config=config)

Loading from YAML

from cosmic import COSMICChunker, COSMICConfig

config = COSMICConfig.from_yaml("configs/custom.yaml")
chunker = COSMICChunker(config=config)

Architecture

6-Stage Pipeline

Document → Structure Analysis → Semantic Boundaries → Domain Classification
                                                              ↓
              Reference Linking ← LLM Verification ← Boundary Fusion
                      ↓
               COSMICChunks (with rich metadata)

Stage 1: Structure Analysis

Detects headings, lists, tables, and other structural elements
Computes structure score (0-1)
Selects processing pathway based on document structure

Stage 2: Semantic Boundary Detection

Computes Discourse Coherence Score (DCS) between sentences
Identifies candidate boundaries where coherence drops

Stage 3: Domain Classification

Uses MST-based clustering on chunk embeddings
Matches clusters to domain taxonomy
Assigns domain labels to chunks

Stage 4: Boundary Fusion

Merges structural (weight: 0.6) and semantic (weight: 0.4) signals
Applies acceptance threshold filtering

Stage 5: LLM Verification

Verifies uncertain boundaries (confidence < 0.8) via external LLM
Auto-accepts high-confidence boundaries
Supports OpenAI-compatible APIs and Ollama
Use --ollama flag for automatic Ollama integration
Skipped if no LLM endpoint configured

Stage 6: Reference Linking

Detects explicit references (regex patterns)
Resolves coreferences using spaCy
Links related chunks for retrieval

DCS Formula

DCS = α × topical_coherence + β × coreference_density + γ × discourse_signal

Where:

α = 0.4: Topical coherence from embedding similarity
β = 0.35: Coreference density measuring entity continuity
γ = 0.25: Discourse markers indicating transitions

Lower DCS → Higher boundary confidence

Fallback Chain

COSMIC implements graceful degradation:

Full COSMIC → Semantic-only → Sliding window → Fixed-length
(structure)   (DCS only)     (basic similarity) (token split)

Each fallback level maintains chunking quality while reducing computational requirements.

Benchmarks

Running Benchmarks

# Run full benchmark suite
python -m benchmarks.runner

# Run with specific datasets
python -m benchmarks.runner --datasets arxiv pubmed

# Run with limited samples
python -m benchmarks.runner --limit 100

Available Baselines

Fixed-length (512 tokens) - Standard token-based splitting
LangChain Recursive - RecursiveCharacterTextSplitter
Semantic Chunking - Embedding similarity-based splitting
Percentile Semantic - Adaptive threshold semantic chunking

Metrics

Coherence Score - Average intra-chunk semantic similarity
Cross-Concept Splits - Percentage of boundaries breaking concepts
Latency - Processing time per page (ms)
Throughput - Documents per second

Project Structure

cosmic/
├── src/cosmic/
│   ├── core/           # Data structures
│   │   ├── chunk.py    # COSMICChunk dataclass
│   │   ├── config.py   # Configuration system
│   │   ├── document.py # Document representation
│   │   └── enums.py    # Enumerations
│   │
│   ├── pipeline/       # 6 pipeline stages
│   │   ├── structure.py    # Stage 1
│   │   ├── semantic.py     # Stage 2
│   │   ├── domain.py       # Stage 3
│   │   ├── fusion.py       # Stage 4
│   │   ├── verification.py # Stage 5
│   │   └── reference.py    # Stage 6
│   │
│   ├── scoring/        # Scoring algorithms
│   │   ├── dcs.py      # Discourse Coherence Score
│   │   └── clustering.py # MST clustering
│   │
│   ├── models/         # ML model wrappers
│   │   ├── embeddings.py # Sentence-transformers
│   │   ├── llm.py        # LLM client
│   │   ├── ollama.py     # Ollama integration
│   │   └── coreference.py # spaCy coreference
│   │
│   ├── fallback/       # Degradation strategies
│   ├── chunker.py      # Main entry point
│   ├── cli.py          # Command-line interface
│   └── batch.py        # Batch processing
│
├── benchmarks/
│   ├── runner.py       # Benchmark orchestration
│   ├── metrics/        # Evaluation metrics
│   ├── baselines/      # Comparison methods
│   └── datasets/       # Data loaders
│
├── configs/
│   ├── default.yaml    # Default configuration
│   └── taxonomies/     # Domain taxonomies
│
└── tests/              # Unit and integration tests

Development

Running Tests

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=cosmic --cov-report=html

# Run specific test module
pytest tests/unit/test_dcs.py -v

Type Checking

mypy src/cosmic/

Code Style

# Format code
black src/ tests/

# Sort imports
isort src/ tests/

# Lint
ruff check src/ tests/

API Reference

COSMICChunker

class COSMICChunker:
    def __init__(
        self,
        config: Optional[COSMICConfig] = None,
        taxonomy_path: Optional[Path] = None,
    ) -> None: ...

    def chunk_document(
        self,
        document: Document,
        strategy: str = "auto",
    ) -> list[COSMICChunk]: ...

COSMICChunk

@dataclass(frozen=True)
class COSMICChunk:
    chunk_id: str
    text: str
    token_count: int
    char_start: int
    char_end: int
    sentence_indices: tuple[int, ...]
    domain: str
    coherence_score: float
    boundary_confidence: float
    cross_references: tuple[str, ...]
    intent: Intent
    metadata: dict

Document

class Document:
    @classmethod
    def from_text(
        cls,
        text: str,
        doc_id: Optional[str] = None,
        metadata: Optional[dict] = None,
    ) -> Document: ...

    @classmethod
    def from_file(cls, path: Path) -> Document: ...

Citation

If you use COSMIC in your research, please cite:

@article{cosmic2026,
  title={COSMIC: COncept-aware Semantic Meta-chunking with Intelligent Classification},
  author={Al Kari, Manceps Research Division},
  journal={arXiv preprint},
  year={2026}
}

License

Apache 2.0 License - see LICENSE file for details.

Contributing

Contributions are welcome! Please read our Contributing Guidelines before submitting pull requests.

Documentation

Full documentation: https://cosmic-chunker.readthedocs.io/

Acknowledgments

COSMIC builds upon research in:

Meta-Chunking (Yu et al., 2024)
S² Chunking (Shi et al., 2024)
Discourse Coherence Scoring (Ji et al., 2023)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
configs		configs
docs		docs
src/cosmic		src/cosmic
tests		tests
.env.example		.env.example
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CITATION.cff		CITATION.cff
CLI.md		CLI.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SECURITY.md		SECURITY.md
benchmark_results.json		benchmark_results.json
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

COSMIC: COncept-aware Semantic Meta-chunking with Intelligent Classification

Research Objectives

Problem Statement

Our Approach

Target Metrics

Installation

Prerequisites

Install from PyPI

Install from Source

Docker Installation

Configuration

Environment Variables

Using Ollama for LLM Verification

Configuration Files

Usage

Basic Usage

Strategy Selection

Batch Processing

Custom Configuration

Loading from YAML

Architecture

6-Stage Pipeline

Stage 1: Structure Analysis

Stage 2: Semantic Boundary Detection

Stage 3: Domain Classification

Stage 4: Boundary Fusion

Stage 5: LLM Verification

Stage 6: Reference Linking

DCS Formula

Fallback Chain

Benchmarks

Running Benchmarks

Available Baselines

Metrics

Project Structure

Development

Running Tests

Type Checking

Code Style

API Reference

COSMICChunker

COSMICChunk

Document

Citation

License

Contributing

Documentation

Acknowledgments

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages