SvPhaser

Haplotype-aware structural-variant (SV) phasing and genotyping from long-read data

SvPhaser assigns haplotype-aware genotypes to pre-called structural variants (SVs) using HP-tagged long-read alignments (PacBio HiFi, ONT Q20+, etc.).

It fills a critical gap in long-read SV analysis:

SV callers (e.g. Sniffles2) discover variants
SvPhaser phases and genotypes them (1|0, 0|1, 1|1, or ./.)
with explicit read-level evidence and a quantitative genotype quality (GQ) score

SvPhaser is:

Caller-agnostic — works with any SV VCF format
Deterministic — no random sampling or HMMs; reproducible results
Designed for large-scale benchmarking and biological interpretation — CSV-first output for transparent analysis

Key features

Post-hoc SV phasing from HP-tagged BAM/CRAM — no re-calling needed
Per-chromosome parallelization — efficiently scales on HPC and multi-core systems
SV-type-aware evidence detection — specialized logic for DEL / INS / INV / BND / DUP
Deterministic Δ-based decision logic — haplotype imbalance thresholds, no sampling
Strict size consistency controls — optional size-matching for DEL/INS variants
Explicit confidence scoring — Phred-scaled GQ capped at 99, with derivable binning
CSV-first design — transparent per-SV metrics for benchmarking and debugging
VCF-compliant output — rich SVP_* INFO annotations for downstream analysis
Read-level evidence tracking — counts by haplotype (HP1, HP2, untagged) with reason codes
Hybrid support counting — combines HP-tagged + untagged reads with configurable thresholds

Installation

From PyPI (recommended)

# Requires Python >= 3.9
pip install svphaser

Optional extras:

pip install "svphaser[plots]"   # plotting utilities
pip install "svphaser[bench]"   # benchmarking helpers
pip install "svphaser[dev]"     # development + linting

From source

git clone https://github.com/SFGLab/SvPhaser.git
cd SvPhaser
pip install -e .

Inputs & requirements

SvPhaser requires two inputs only:

Unphased SV VCF (.vcf / .vcf.gz)
- Produced by an SV caller (e.g. Sniffles2)
- May optionally contain RNAMES INFO for precise read support
HP-tagged BAM/CRAM
- Long-read alignments with haplotype tags (HP=1/2)
- Generated by an upstream phasing pipeline (e.g. WhatsHap)

⚠️ If the BAM does not contain HP tags, SvPhaser cannot assign haplotypes.

Quick start (CLI)

svphaser phase \
  sample_unphased.vcf.gz \
  sample.sorted_phased.bam \
  --out-dir results/ \
  --min-support 10 \
  --min-tagged-support 3 \
  --major-delta 0.60 \
  --equal-delta 0.10 \
  --support-mode hybrid \
  --dynamic-window \
  --tie-to-hom-alt \
  --gq-bins "30:High,10:Moderate" \
  --threads 32

Key parameters

Parameter	Default	Meaning
`--min-support`	10	Minimum total supporting reads (HP1+HP2+NOHP) to keep an SV; others are dropped to `./.`
`--min-tagged-support`	3	Minimum HP-tagged reads (HP1+HP2) needed for directional phasing (`1\|0` or `0\|1`)
`--major-delta`	0.60	Haplotype imbalance threshold (max HP count / tagged total) for strong consensus
`--equal-delta`	0.10	Tie threshold (\|HP1-HP2\| / tagged total); below this, treated as both haplotypes support (→ `1\|1`)
`--tie-to-hom-alt`	True	When tie detected and both haplotypes carry reads, emit `1\|1` (else `./.`)
`--support-mode`	hybrid	Count method: `hybrid` (HP tagged preferred), `tagged-only`, or `all`
`--gq-bins`	"30:High,10:Moderate"	Confidence cutoffs for soft binning into labels (e.g., High≥30, Moderate≥10)
`--threads`	1	Number of parallel workers (one per chromosome)
`--no-svp-info`	—	Disable writing `SVP_*` INFO annotations to output VCF
`--size-match-required`	True	For DEL/INS: enforce size consistency between VCF record and read evidence
`--size-tol-abs`	10	Absolute size tolerance (bp) for DEL/INS matching
`--size-tol-frac`	0.0	Fractional size tolerance for DEL/INS matching

Outputs

For an input sample.vcf.gz, SvPhaser produces:

Primary: `sample_phased.csv`

A tabular summary with per-SV analysis, including:

Metadata: chrom, pos, id, end, svtype (DEL/INS/INV/BND/DUP)
Evidence counts: hp1, hp2, nohp (haplotype-tagged and untagged supporting reads)
Totals: tagged_total (HP1+HP2), support_total (HP1+HP2+NOHP)
Decision metrics:
- delta — haplotype imbalance (max/tagged_total)
- equal_delta — absolute difference (|HP1-HP2|/tagged_total)
- tag_frac — fraction of support that is HP-tagged
Final calls:
- gt — phased genotype (1|0, 0|1, 1|1, or ./.)
- gq — Phred-scaled genotype quality (0–99)
- gq_label — optional binned confidence level (e.g., "High", "Moderate")
- reason — explanation code (e.g., "MinSupport", "Tie", "LowTagged")

Secondary: `sample_phased.vcf`

Interoperability output with:

FORMAT fields: GT (phased), GQ (quality)
INFO annotations (when --svp-info enabled):
- SVP_HP1, SVP_HP2, SVP_NOHP — read counts
- SVP_TAGFRAC — fraction tagged
- SVP_DELTA — haplotype imbalance
- SVP_GQBIN — confidence level label

The CSV is the primary artifact for analysis; the VCF is for compatibility and downstream tools.

Phasing decision logic (quick reference)

For each SV, SvPhaser counts reads by haplotype tag (HP=1, HP=2, or missing) and applies a deterministic decision tree:

Minimum support gate: If support_total (HP1+HP2+NOHP) < min_support → emit ./. and drop SV
Tagged support gate: If tagged_total (HP1+HP2) < min_tagged_support → emit ./.
Tie detection: If |HP1 - HP2| / tagged_total ≤ equal_delta
- If tie_to_hom_alt=True and both HP1 > 0 and HP2 > 0 → emit 1|1 (both haplotypes carry)
- Else → emit ./. (ambiguous)
Strong majority: If max(HP1, HP2) / tagged_total ≥ major_delta
- If HP1 > HP2 → emit 1|0 (ALT on haplotype 1)
- If HP2 > HP1 → emit 0|1 (ALT on haplotype 2)
Else: → emit ./. (weak or no signal)

Genotype Quality (GQ) is calculated from a Phred-scaled binomial tail probability:

For shallow coverage (N ≤ 200): exact binomial test
For deep coverage (N > 200): continuity-corrected normal approximation (avoids overflow)
Capped at 99 (Phred scale)

A full, implementation-faithful description of the algorithm—including:

evidence collection
haplotype decision logic
pseudoalgorithm
workflow diagram

is provided in:

➡️ docs/Methodology.md

This document is the authoritative reference for reviewers and users seeking algorithmic clarity.

Python API

from pathlib import Path
from svphaser import phase

# Simple usage
out_vcf, out_csv = phase(
    "sample.vcf.gz",
    "sample.sorted_phased.bam",
    out_dir="results",
)

# Full control
out_vcf, out_csv = phase(
    "sample.vcf.gz",
    "sample.sorted_phased.bam",
    out_dir="results",
    min_support=10,
    min_tagged_support=3,
    major_delta=0.60,
    equal_delta=0.10,
    support_mode="hybrid",
    bp_window=100,
    dynamic_window=True,
    tie_to_hom_alt=True,
    gq_bins="30:High,10:Moderate",
    threads=8,
    size_match_required=True,
    size_tol_abs=10,
    size_tol_frac=0.0,
)

print(f"Phased VCF: {out_vcf}")
print(f"Summary CSV: {out_csv}")

Returns a tuple: (phased_vcf_path, summary_csv_path)

Alternatively, use the lower-level API directly:

from svphaser.phasing.io import phase_vcf
from svphaser.phasing.types import WorkerOpts

opts = WorkerOpts(
    min_support=10,
    min_tagged_support=3,
    major_delta=0.60,
    equal_delta=0.10,
    tie_to_hom_alt=True,
    support_mode="hybrid",
    bp_window=100,
    dynamic_window=True,
    size_match_required=True,
    size_tol_abs=10,
    size_tol_frac=0.0,
    gq_bins=[(30, "High"), (10, "Moderate")],
)

phase_vcf(
    Path("sample.vcf.gz"),
    Path("sample.bam"),
    out_dir=Path("results"),
    worker_opts=opts,
    threads=8,
)

Repository structure

SvPhaser/
├─ src/svphaser/            # main package
│  ├─ cli.py               # CLI interface (Typer app)
│  ├─ __init__.py          # public API (phase() function)
│  ├─ logging.py           # logging configuration
│  ├─ phasing/             # core algorithms & I/O
│  │  ├─ algorithms.py     # haplotype classification, GQ calculation (pure math)
│  │  ├─ io.py            # orchestration, CSV/VCF writing (per-chromosome workers)
│  │  ├─ _workers.py      # internal: per-chromosome worker, read evidence counting
│  │  ├─ types.py         # WorkerOpts, CallTuple, type aliases
│  │  └─ __init__.py      # public API exports
│  └─ py.typed            # PEP 561 marker for type information
│
├─ tests/                   # unit & regression tests
│  ├─ test_algorithms.py   # GQ, classification logic
│  ├─ test_cli_smoke.py    # CLI smoke tests
│  ├─ test_io.py          # CSV/VCF output validation
│  ├─ test_workers.py     # BAM parsing, read counting
│  └─ data/               # minimal test fixtures
│
├─ docs/                    # documentation
│  ├─ Methodology.md       # algorithmic deep-dive (implementation-faithful)
│  └─ Presentation/        # slide decks & figures
│
├─ Benchmarking_Analysis/   # perf analysis & results
├─ pyproject.toml          # PEP 621 metadata, build config
├─ requirements.txt        # runtime dependencies (mirror of pyproject)
├─ requirements-dev.txt    # dev/test dependencies
├─ README.md              # this file
├─ CONTRIBUTING.md        # contributor guidelines
├─ CODE_OF_CONDUCT.md     # community standards
├─ LICENSE                # MIT
└─ CHANGELOG.md           # version history

Core modules

algorithms.py — Pure mathematics (no I/O)

phasing_gq(n1, n2) — Phred-scaled genotype quality (binomial tail + normal approx)
classify_haplotype(n1, n2, ...) — GT decision tree (returns ("1|0"|"0|1"|"1|1"|"./.", gq))
Threshold logic: major_delta, equal_delta, min_support, tie_to_hom_alt

_workers.py — Per-chromosome logic

Read BAM for each chromosome, count HP tags
Apply size-consistency filters (DEL/INS)
Call classify_haplotype() for each SV
Return formatted results (gt, gq, reason)

io.py — Orchestration & I/O

Parse VCF header, spawn workers (one per chromosome)
Merge per-chromosome results, apply global filters
Write phased VCF + CSV summary
Backfill optional columns (gq_label, tag_frac, etc.)

Citing SvPhaser

If SvPhaser contributes to your research, please cite:

@software{svphaser2026,
  author  = {Pranjul Mishra and Sachin Gadakh},
  title   = {SvPhaser: Haplotype-aware phasing of structural variants from long-read data},
  version = {2.1.x},
  year    = {2026},
  url     = {https://github.com/SFGLab/SvPhaser},
  note    = {PyPI: https://pypi.org/project/svphaser/}
}

For maximum reproducibility, include the exact git commit hash used.

License

SvPhaser is released under the MIT License — see LICENSE.

Contact

Developed at SFG Lab (BioAI).

Pranjul Mishra — pranjul.mishra@proton.me
Sachin Gadakh — s.gadakh@cent.uw.edu.pl

Bug reports and feature requests: please open a GitHub issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SvPhaser

Key features

Installation

From PyPI (recommended)

From source

Inputs & requirements

Quick start (CLI)

Key parameters

Outputs

Primary: `sample_phased.csv`

Secondary: `sample_phased.vcf`

Phasing decision logic (quick reference)

Python API

Repository structure

Core modules

Citing SvPhaser

License

Contact

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
docs		docs
src/svphaser		src/svphaser
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SvPhaser

Key features

Installation

From PyPI (recommended)

From source

Inputs & requirements

Quick start (CLI)

Key parameters

Outputs

Primary: sample_phased.csv

Secondary: sample_phased.vcf

Phasing decision logic (quick reference)

Python API

Repository structure

Core modules

Citing SvPhaser

License

Contact

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Primary: `sample_phased.csv`

Secondary: `sample_phased.vcf`

Packages