Skip to content

ICCB-Cologne/refphase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

431 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

refphase

An R package for reference phasing multi-region copy number segmentations. Note: For the unmaintained old version as used in Watkins et al. 2020, Nature, please see here.

This is a development branch

Citation

Pease cite the accompanying paper Refphase: Multi-sample reference phasing reveals haplotype-specific copy number heterogeneity, PLOS Computational Biology 2023.

Installation

Some dependencies of refphase have to be installed through Bioconductor, easiest done by installing GenomicRanges:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GenomicRanges")

Then you can install refphase and the remaining dependecies using devtools as follows:

devtools::install_bitbucket("schwarzlab/refphase")        # installs the latest release

If you want to install a specific branch or version you can run the following commands

devtools::install_bitbucket("schwarzlab/refphase@dev")    # installs the development version of the code
devtools::install_bitbucket("schwarzlab/[email protected]") # installs a specific release version (v0.1.1 in this example)

Usage

Quick Start

This is a basic example which demonstrate the usual workflow for refphase.

# Install using devtools
devtools::install_bitbucket("schwarzlab/refphase")   

# Import library
library(refphase)

# Define your data
patient <- "patient1"
samples <- c("sample1", "sample2", "sample3", "sample4")

# Import your data
# Here we load the data from three TSV files. We also support output from ASCAT (see below)
refphase_data_tsv <- refphase_load(data_format = "tsv", samples = samples, tsv_prefix = "path/to/data/patient1-")

# Run refphase on your data
refphase_results <- refphase(refphase_input)

# Output the segments as a tsv-file
write_segs(refphase_results$phased_segs, file = paste0("path/to/output/segmentation-", patient, ".tsv"))

# Plot the whole genome and individual chromosomes (A4 paper size, 300 dpi)
png("path/to/output/genome.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "genome")
dev.off()
png("path/to/output/chromosome%02d.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "chromosome")
dev.off()

Data Loading

refphase requires three inputs: SNP data including the number of reads, copy-numbers segments and ploidy and purity for each sample. Currently, we support the outputs from the commonly used copy-number caller ASCAT. If you want to create your copy-number segments using a different copy-number caller you will have to export your SNP and segment data to TSV files (see below).

Copy-number calling with ASCAT

ASCAT version 2

If you used ASCAT (version 2) as your copy-number calling tool, refphase excpects two lists as input: ascat_input and ascat_output. Each entry corresponds to a single sample in your data.

Here's a short example of how ascat_input and ascat_output are created:

ascat_input <- list()
ascat_output <- list()
for (sample in samples) {
  cur_ascat_input = ascat.loadData("Tumor_LogR.txt", "Tumor_BAF.txt", "Germline_LogR.txt", "Germline_BAF.txt")
  cur_ascat_input = ascat.aspcf(cur_ascat_input, ...)
  cur_ascat_output <- ascat.runAscat(cur_ascat_input, ...)
  ascat_input[[sample]] <- cur_ascat_input
  ascat_output[[sample]] <- cur_ascat_output
}

This data is then provided to the function refphase_load with the parameter data_format = "ascat2" (data_format = "ascat" also works for backwards compatibility) as follows:

refphase_input_data <- refphase_load(
    data_format = "ascat2", samples = samples,
    ascat_input = ascat_input, ascat_output = ascat_output
)

We also provide a wrapper function that provides easier access to the ASCAT input/output

refphase_input_data <- run_ascat_for_refphase(data_prefix = "path/to/dir/patient-",
                                     samples = samples,
                                     penalty = 100, gamma = 1, run_aspcf = TRUE,
                                     img_dir = "path/to/dir",
                                     save_prefix = "path/to/dir/patient-output", # optional
                                     save_tsv = TRUE, # optional
                                     save_rds = TRUE) # optional

ASCAT version 3

ASCAT version 3 implements some improvements over version 2, however there are some post-processing steps that are incompatible with Refphase. Therefore you will have to use our fork of ASCAT version 3 which can be found here: https://bitbucket.org/schwarzlab/ascat_v3_fork . Install the fork from source and run as usual. The ASCAT output can then by used with Refphase by setting data_format = "ascat3" as follows:

refphase_input_data <- refphase_load(
    data_format = "ascat3", samples = samples,
    ascat_input = ascat_input, ascat_output = ascat_output
)

TSV input

If you use a copy-number caller different from ASCAT you can load in your data as TSV files (entries are tab separated (\t)). refphase expects three files as determined by the tsv_prefix parameter: {tsv_prefix}data-snps.tsv, {tsv_prefix}segments.tsv and {tsv_prefix}purity_ploidy.tsv. Every file has to contain the data for all samples.

Example: tsv_prefix=path/to/data/patient-data-, then refphase will load the following three files: path/to/data/patient-data-data-snps.tsv, path/to/data/patient-data-segments.tsv and path/to/data/patient-data-purity_ploidy.tsv

Expected structure of the {prefix}data-snps.tsv file (showing the columns and an example first entry for each):

chrom    pos    baf    germline_baf    logR    sample_id
1    20    1    0.24    0    sample1

If the column germline_zygosity is present (with values of either "het" or "hom") it will be used. Otherwise the column germline_baf and parameter homozygous_cutoff are used to determine whether a SNP is homozygous or heterozygous.

Expected structure of the {prefix}segments.tsv file:

sample_id    chrom    start    end    cn_major    cn_minor
sample1    1    20    2866880    1    1               

Expected structure of the {prefix}purity_ploidy.tsv file:

sample    purity    ploidy
sample1    0.95    2.

Finally, we can load the data as follows

refphase_data_tsv <- refphase_load(data_format = "tsv", samples = samples,
                                   tsv_prefix = "path/to/data/patient-data-",
                                   homozygous_cutoff = 0.7)

Legacy data loading

In previous version of refphase, the data was loaded using a sample_csv object. We discourage this method but provide the refphase_load_legacy function to maintain old workflows.

Example usage:

patient <- "patient1"
sample_csv <- "
sample_id,segmentation,snps,purity,ploidy
sample1,./input/segs/patient1/sample1_segs.tsv,./input/snps/patient1/sample1_snps.tsv.gz,0.8,2.1
sample2,./input/segs/patient1/sample2_segs.tsv,./input/snps/patient1/sample2_snps.tsv.gz,0.9,4.3
sample3,./input/segs/patient1/sample3_segs.tsv,./input/snps/patient1/sample3_snps.tsv.gz,0.54,2.6
"
sample_data <- read.csv(text = sample_csv, stringsAsFactors = FALSE)

# You can also read this from a file
# sample_data <- read.csv("samples.csv", header = TRUE)

# Load all experimental data and sample metadata
refphase_input <- refphase_load_legacy(sample_data, segs_format = "refphase", snps_format = "refphase")

Execution

Once loaded you can run refphase using the refphase function with the refphase_input_data object as follows

refphase_output <- refphase(refphase_input_data, ...)

refphase takes the following parameters:

  • name: ID of the run, typically the patient ID. Used to identify the run together with an internal input hash. Also used for plotting.
  • cn_method: Method to use for re-estimating copy numbers when required (possible values = ascat, nbayes) (default = "nbayes")
  • cn_subtype: Variant of the copy number estimation to use. (possible values = "raw", "") (default = "raw")
  • phase_all_segs: Force phase all segments, regardless of whether there is allelic imbalance in the input segmentation (default = FALSE, meaning we only segment regions that have allelic imbalance in the input segmentation)
  • merging_threshold: Merge together all segment boundaries that are smaller than this threshold (measured in basepairs)
  • verbosity: Verbosity of logging output (possible values: DEBUG, INFO) (default = INFO)
  • use_checkpoint: save intermediate checkpoints; if TRUE, stores intermediate results in checkpoint_folder (default = FALSE)
  • checkpoint_folder: this gives the folder in which to store intermediate results that will be re-used in any subsequent run; if NULL uses system tmp folder; ignored if use_checkpoint = FALSE (default = NULL)

The output refphase_output consists of three entries:

  • joint_segs: the unphased joint segmentation
  • phased_segs: phased segmentation
  • phased_snps: list of GPos objects for each sample with "phasing" column (= a, b, or none)
  • sample_data: data.frame with purity, and ploidy information for each sample

Plotting

The final plots are created using the plot function. There are 5 possible plots which can be selected using the what parameter:

  • what = "genome": Overview plot containing BAF, logR, copy-numbers and summary statistics
  • what = "chromosome": Same as genome but one figure per chromosome
  • what = "summary": Creates summary statistics such as location of gains, losses and MSAIs
  • what = "copy_numbers": Plots the copy-number of the samples
  • what = "copy_numbers_integer": Plots the integer-values copy-number of the samples
  • what = "BAF": Plots the BAF tracks of the samples

Example:

png("path/to/results/dir/genome.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "genome")
dev.off()

# Note the %02d in the file name which will be replaced with the chromosome number
# This function creates one plot per chromosome
png("path/to/results/dir/chromosome%02d.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "chromosome")
dev.off()

png("path/to/results/dir/summary.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "summary")
dev.off()

png("path/to/results/dir/copy_numbers.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "copy_numbers")
dev.off()

png("path/to/results/dir/BAF.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "BAF")
dev.off()

The above example can also be run with pdf instead of png to create the figures as vector-graphics.

Complete Example Workflow

This is an example of a more complete, though very simple, workflow starting with mapped reads and ending with copy numbers segmentations that have been phased using refphase.

For a real analysis you should carefully consider each step of this workflow and tailor tools/arguments to your individual dataset. This is not a recommended workflow, but rather a minimal one to show, in the most simple way possible, how to get refphase running when starting with a .bam file.

Before starting, there are a few programs you'll need to have installed. One easy way to install the required applications is using conda:

$ conda create -n refphase-demo samtools tabix bcftools r-devtools
$ conda activate refphase-demo

Next you can install refphase itself into your new conda environment:

$ R
> devtools::install_bitbucket("schwarzlab/refphase")

This will the refphase R packages as well as all dependencies into the refphase-demo environment.

After everything is installed, the first step of the analysis is to use our normal sample to identify germline variants in the patient.

$ samtools faidx reference.fa
$ samtools index normal.bam
$ samtools index sample1.bam
$ samtools index sample2.bam
$ samtools index sample3.bam
$ bcftools mpileup -Ou -f reference.fa normal.bam | bcftools call -mv -Ob -o germline-variants.bcf

Now that we have germline variants, we want to pile up reads from all of our tumor samples at these positions.

$ bcftools query --format '%CHROM\t%POS\n' germline-variants.bcf | bgzip -c > germline-variants.pos.gz
$ bgzip --reindex germline-variants.pos.gz
$ export SAMPLES=( normal sample1 sample2 sample3 )
$ for sample in ${SAMPLES[@]}; do
     bcftools mpileup --regions-file germline-variants.pos.gz --fasta-ref reference.fa --annotate FORMAT/AD ${sample}.bam > ${sample}.mpileup.bcf
     bcftools query --format "%CHROM\t%POS\t[%AD{0}]\t[%AD{1}]\n" ${sample}.mpileup.bcf | gzip -c > ${sample}.mpileup.gz
done

The next step is to do copy number segmentation. For this example we will use ASCAT. ASCAT requires the input to be preprocessed, converting the read depths per germline SNP position to B-allele frequencies (BAF) and log depth ratio between the tumor and normal (logR). So first we do that preprocessing in R:

library(readr)
library(gtools)

chrom2integer <- function(x) {
  print(unique(x))
  as.integer(gsub("M", "25", gsub("X", "23", gsub("Y", "24", gsub("chr", "", x)))))
}

column_names <- c("chrom", "pos", "ref", "alt")
normal <- as.data.frame(read_tsv("normal.mpileup.gz", col_names = column_names, progress = FALSE))

tumor_samples <- c("sample1", "sample2", "sample3")
for (tumor_sample in tumor_samples) {
  tumor <- as.data.frame(read_tsv(paste0(tumor_sample, ".mpileup.gz"), col_names = column_names, progress = FALSE))

  # Merge tumor and normal samples to only keep the positions that are shared
  # between both samples
  merged <- merge(normal, tumor, by = c("chrom", "pos"), suffixes = c("_normal", "_tumor"))

  merged$pos <- as.integer(merged$pos)
  merged <- merged[order(chrom2integer(merged$chrom), merged$pos), ]

  total_reads_tumor <- sum(merged$alt_tumor, na.rm = TRUE) + sum(merged$ref_tumor, na.rm = TRUE)
  total_reads_normal <- sum(merged$alt_normal, na.rm = TRUE) + sum(merged$ref_normal, na.rm = TRUE)

  merged$baf_tumor <- merged$alt_tumor / (merged$alt_tumor + merged$ref_tumor)
  merged$logr_tumor <- log2(((merged$alt_tumor + merged$ref_tumor) / total_reads_tumor) / ((merged$alt_normal + merged$ref_normal) / total_reads_normal))

  merged$baf_normal <- merged$alt_normal / (merged$alt_normal + merged$ref_normal)

  # logR of normal should always be 0
  merged$logr_normal <- merged$logr_tumor
  merged$logr_normal[] <- 0

  # Remove NAs and +/-Inf
  keep_numeric <- apply(merged[, c("logr_tumor", "logr_normal", "baf_normal", "baf_tumor")], 1, function(x) all(unlist(Map(is.finite, x))))
  merged <- merged[keep_numeric, ]

  # Write the specific tsv format that we can later use with refphase
  # Set positions with germline BAF < 0.1 or BAF > 0.9 as germline homozygous,
  # and the rest as heterozygous
  write.table(data.frame(chrom = merged$chrom, pos = merged$pos, baf = merged$baf_tumor, logr = merged$logr_tumor, germline_zygosity = ifelse(abs(merged$baf_normal - 0.5) > 0.4, "hom", "het")), file = paste0(tumor_sample, "_refphase_snps.tsv"), sep = "\t", quote = FALSE, row.names = FALSE)

  # ASCAT only takes integer chromosome names
  merged$chrom <- gsub("Y", "24",
                  gsub("X", "23",
                  gsub("chr", "", merged$chrom)))

  # Write the strange input format files that ASCAT expects
  rownames(merged) <- paste0("SNP", seq_len(nrow(merged)))
  write.table(data.frame(chrs = merged$chrom, pos = merged$pos, sample = merged$baf_tumor, row.names = rownames(merged)), file = paste0(tumor_sample, "_baf_tumor.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)

  # Tumor
  dat <- data.frame(chrs = merged$chrom, pos = merged$pos, sample = merged$baf_tumor, row.names = rownames(merged))
  colnames(dat)[[3]] <- tumor_sample

  write.table(dat, file = paste0(tumor_sample, "_baf_tumor.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)

  dat[[tumor_sample]] <- merged$logr_tumor
  write.table(dat, file = paste0(tumor_sample, "_logr_tumor.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)

  # Normal
  dat[[tumor_sample]] <- merged$baf_normal
  write.table(dat, file = paste0(tumor_sample, "_baf_normal.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)

  dat[[tumor_sample]] <- merged$logr_normal
  write.table(dat, file = paste0(tumor_sample, "_logr_normal.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)

}

And then process each sample with ASCAT:

library(ASCAT)

tumor_samples <- c("sample1", "sample2", "sample3")

# Do some extra work to create a table that will be needed when we run refphase in the next step. This is not needed by ASCAT itself.
refphase_sample_data <- data.frame(sample_id = tumor_samples, segmentation = paste0(tumor_samples, "_refphase_segs.tsv"), snps = paste0(tumor_samples, "_refphase_snps.tsv"), purity = NA, ploidy = NA, row.names = tumor_samples)

# ascat_input and ascat_output are required for refphase later on
ascat_input <- list()
ascat_output <- list()
for (tumor_sample in tumor_samples) {
  cur_ascat_input <- ascat.loadData(Tumor_LogR_file = paste0(tumor_sample, "_logr_tumor.tsv"),
                             Tumor_BAF_file = paste0(tumor_sample, "_baf_tumor.tsv"),
                             Germline_LogR_file = paste0(tumor_sample, "_logr_normal.tsv"),
                             Germline_BAF_file = paste0(tumor_sample, "_baf_normal.tsv"))
  ascat.plotRawData(cur_ascat_input)
  cur_ascat_input <- ascat.aspcf(cur_ascat_input)
  ascat.plotSegmentedData(cur_ascat_input)
  cur_ascat_output <- ascat.runAscat(cur_ascat_input, gamma = 1.0)

  ascat_input[[sample]] <- cur_ascat_input
  ascat_output[[sample]] <- cur_ascat_output

  # Save the segmentation in the default ASCAT format
  write.table(cur_ascat_output$segments, file = paste0(tumor_sample, "_segments.txt"), sep = "\t", quote = FALSE, row.names = FALSE)

  refphase_sample_data[tumor_sample, "purity"] <- cur_ascat_output$aberrantcellfraction[[1]]
  refphase_sample_data[tumor_sample, "ploidy"] <- cur_ascat_output$ploidy[[1]]
}

write.table(refphase_sample_data, file = "refphase-sample-data.tsv", sep = "\t", quote = FALSE, row.names = FALSE)

Now we have an initial copy number segmentation for each tumor sample, and we're ready to run refphase, plot the results, and save the phased segmentation.

library(refphase)

# Load the ascat input and output
refphase_input <- refphase_load(
    data_format = "ascat", samples = tumor_samples,
    ascat_input = ascat_input, ascat_output = ascat_output
)
# (Optional) If your data shows reference bias, which presents itself as BAFs
# in regions with a balanced copy number that are systematically shifted away
# from 0.5 (usually something like 0.47), this can try to correct for that.
refphase_input <- center_baf(refphase_input)

# (Optional) Fit SNP logr data to improve copy number re-estimation in refphase,
# when using the default ASCAT formula-based method fo re-estimating copy
# numbers
refphase_input <- fit_logr_to_ascat(refphase_input)

# Run refphase on the experimental data
results <- refphase(refphase_input)

write_segs(results$phased_segs, file = paste0(patient, "-refphase-segmentation.tsv"))

# (optional) output the SNPs, including phasing information
write_snps(results$phased_snps, file = paste0(patient, "-refphase-phased-snps.tsv.gz"))

# Ploidy might have changed, if we have updated any copy numbers
write.table(results$sample_data, file = paste0(patient, "-refphase-sample-data-updated.tsv"), sep = "\t", row.names = FALSE)

# Plot the results as PDFs (these files can be huge, over 100 MB in some cases)
pdf(paste0(patient, "-refphase-genome.pdf"), width = 10, height = 4 * nrow(results$sample_data), family = "sans")
plot(results, what = "genome")
dev.off()

pdf(paste0(patient, "-refphase-chromosomes.pdf"), width = 10, height = 2 + 4 * nrow(results$sample_data), family = "sans")
plot_all_chromosomes(results$sample_data, results$phased_snps, list(refphase = results$phased_segs, orig = refphase_input$segs), cn_event_calls = results$cn_event_calls)

dev.off()

# Plot the results as individual PNG files
png(paste0(patient, "-refphase-genome.png"), width = 600, height = 200 + 300 * nrow(results$sample_data), family = "sans")
plot(results, what = "genome")
dev.off()

png(paste0(patient, "-refphase-chromosomes-%02d.png"), width = 700, height = 300 * nrow(results$sample_data), family = "sans")
plot_all_chromosomes(results$sample_data, results$phased_snps, list(refphase = results$phased_segs, orig = refphase_input$segs), cn_event_calls = results$cn_event_calls)
dev.off()

Credits

  • Prototype implementations: Roland F Schwarz, Thomas BK Watkins, Nicholas McGranahan
  • refphase R package: Matthew R Huska, Tom L Kaufmann, Emma C Colliver, Thomas BK Watkins, Teodora Bucaciuc, Roland F Schwarz
  • Current contact: Tom L Kaufmann

About

No description, website, or topics provided.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages