GenoCLAMP: consolidation of sequence motifs from genomic-scale data

Justin S Cha¹, B Franklin Pugh¹, William KM Lai^1,2*

¹Department of Molecular Biology and Genetics, Cornell University, USA
²Department of Computational Biology, Cornell University, USA

Correspondence: wkl29@cornell.edu

Abstract

Genome-wide protein–DNA mapping studies have produced thousands of enriched sequence motifs, many of which are redundant or highly similar because transcription factors (TFs) often bind cooperatively as complexes to shared sites. Consolidating these motifs into representative profiles is essential for reducing redundancy and improving the characterization of TF–DNA interactions. Here, we present Genomic CLustered Alignment of Motif Profiles (GenoCLAMP), a new algorithm that unifies motif clustering and alignment to collapse redundant motifs into high-quality merged matrix representations. Using motifs generated from a large-scale Saccharomyces cerevisiae ChIP-exo project, we demonstrate that GenoCLAMP reliably recovers known TF complexes that bind common motifs and uncovers evidence for previously uncharacterized motifs. Moreover, GenoCLAMP-consolidated motifs exhibit distinct positional and combinatorial patterns across genomic features, revealing a broader DNA-sequence “grammar” that structures TF binding across the genome. Together, these results establish GenoCLAMP as a robust and efficient framework for motif consolidation, enhancing both computational analyses and biological interpretation of large-scale TF–DNA binding datasets.

General software notes

GenoCLAMP includes a Python version and a Julia version of the software. Python is usually already installed on most operating systems, so the Python version may be easier to set up. The Julia version may be better for your needs if speed is important. Singularity/Apptainer definition files are provided for both versions.

Software details (Python)

Dependencies

Python >= 3.6
numpy
scipy
numba
openpyxl (for the output summary)
cairosvg (optional, for the output summary)

Getting started

To download GenoCLAMP:

git clone https://github.com/CEGRcode/GenoCLAMP.git

Support via PyPI and conda is coming soon.

Running GenoCLAMP

Example command on simulated motifs:

python genoclamp-python/run_genoclamp.py --meme simulated_motifs/SIM-1.meme --output-dest simulated_motifs/SIM-1_results

GenoCLAMP takes in a list of motifs using the MEME file format. This list can be provided using --meme:

python genoclamp-python/run_genoclamp.py --meme motif1.meme motif2.meme ... {other_args}

They can also be provided as a newline-delimited text file using --meme-list:

python genoclamp-python/run_genoclamp.py --meme-list meme_files.txt {other_args}

The default output location is ./genoclamp_out. This can be changed with --output-dest or -o:

python genoclamp-python/run_genoclamp.py --output-dest new_genoclamp_out {other_args}

Flags

--meme: Input MEME files
--meme-list: Plain text file with one MEME file path per line
- One of --meme or --meme-list is required
--nsites-thresh: Motifs with nsites less than the threshold will be filtered out
--evalue-thresh: Motifs with E-value greater than the threshold will be filtered out
--info-score-thresh: Motifs with information score less than the threshold will be filtered out
--periodicity1-thresh: Motifs with periodicity score for period 1 greater than the threshold will be filtered out
--periodicity2-thresh: Motifs with periodicity score for period 2 greater than the threshold will be filtered out
--periodicity3-thresh: Motifs with periodicity score for period 3 greater than the threshold will be filtered out
--pc: Alpha parameters for the Dirichlet prior (one value per nucleotide)
--min-base-overlap: Minimum number of overlapping bases required for merging clusters
--min-information-overlap: Minimum bit overlap dot product required for merging clusters
--max-information-overhang: Maximum sum of absolute bit difference allowed for merging clusters
--concentration: Concentration parameter (clustering score = BLLR * cluster_size ^ concentration)
--n-workers: Number of worker threads (default is number of CPUs)
--trim-thresh: Bases on the periphery of the consensus motif with information below the threshold will be trimmed
--get-sites: When specified, outputs BED files of the union of binding sites for each cluster (input MEME files must include site information)
--output-dest: Output directory

Output

GenoCLAMP will create folders inside the folder specified by --output-dest with names formatted as clusterX. Each folder will contain the following files:

Aligned Position Frequency Matrices (PFMs) in transfac format named clusterX_aligned-motifs.transfac
Consensus PFM in TRANSFAC format named clusterX_consensus-motif.transfac
Consensus PFM in MEME format named clusterX_consensus-motif.meme
Aligned stack of logos as an SVG named clusterX_aligned-motifs.svg
Consensus logo as an SVG named clusterX_consensus-motif.svg
Optional BED file of binding sites named clusterX_binding-sites.bed

Example output for one of the clusters in the simulated data is in simulated_motifs/cluster404. A summary spreadsheet of all clusters, summary.xlsx, is also written to the output root directory (Python version only).

Software details (Julia)

The Julia version of the code is about 30-50% faster and also has less overhead than the Python version.

Dependencies

Running GenoCLAMP

Example command on simulated motifs:

julia genoclamp-julia/run_genoclamp.jl --meme simulated_motifs/SIM-1.meme --output-dest simulated_motifs/SIM-1_results

Arguments and output are the same as for the Python version, with one exception: the Julia version does not support --n-workers. To control parallelism, pass --threads directly to Julia: julia --threads <n_workers> genoclamp-julia/run_genoclamp.jl ...

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
genoclamp-julia		genoclamp-julia
genoclamp-python		genoclamp-python
logo_symbols		logo_symbols
simulate-motifs		simulate-motifs
simulated_motifs		simulated_motifs
supplementary_data		supplementary_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
genoclamp-julia.def		genoclamp-julia.def
genoclamp-python.def		genoclamp-python.def

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenoCLAMP: consolidation of sequence motifs from genomic-scale data

Justin S Cha¹, B Franklin Pugh¹, William KM Lai^1,2*

Correspondence: wkl29@cornell.edu

Abstract

General software notes

Software details (Python)

Dependencies

Getting started

Running GenoCLAMP

Flags

Output

Software details (Julia)

Dependencies

Running GenoCLAMP

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenoCLAMP: consolidation of sequence motifs from genomic-scale data

Justin S Cha1, B Franklin Pugh1, William KM Lai1,2*

Correspondence: wkl29@cornell.edu

Abstract

General software notes

Software details (Python)

Dependencies

Getting started

Running GenoCLAMP

Flags

Output

Software details (Julia)

Dependencies

Running GenoCLAMP

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Justin S Cha¹, B Franklin Pugh¹, William KM Lai^1,2*

Packages