An R package for reference phasing multi-region copy number segmentations. Note: For the unmaintained old version as used in Watkins et al. 2020, Nature, please see here.
This is a development branch
Pease cite the accompanying paper Refphase: Multi-sample reference phasing reveals haplotype-specific copy number heterogeneity, PLOS Computational Biology 2023.
Some dependencies of refphase have to be installed through Bioconductor, easiest done by installing GenomicRanges:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GenomicRanges")Then you can install refphase and the remaining dependecies using devtools as follows:
devtools::install_bitbucket("schwarzlab/refphase") # installs the latest releaseIf you want to install a specific branch or version you can run the following commands
devtools::install_bitbucket("schwarzlab/refphase@dev") # installs the development version of the code
devtools::install_bitbucket("schwarzlab/[email protected]") # installs a specific release version (v0.1.1 in this example)This is a basic example which demonstrate the usual workflow for refphase.
# Install using devtools
devtools::install_bitbucket("schwarzlab/refphase")
# Import library
library(refphase)
# Define your data
patient <- "patient1"
samples <- c("sample1", "sample2", "sample3", "sample4")
# Import your data
# Here we load the data from three TSV files. We also support output from ASCAT (see below)
refphase_data_tsv <- refphase_load(data_format = "tsv", samples = samples, tsv_prefix = "path/to/data/patient1-")
# Run refphase on your data
refphase_results <- refphase(refphase_input)
# Output the segments as a tsv-file
write_segs(refphase_results$phased_segs, file = paste0("path/to/output/segmentation-", patient, ".tsv"))
# Plot the whole genome and individual chromosomes (A4 paper size, 300 dpi)
png("path/to/output/genome.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "genome")
dev.off()
png("path/to/output/chromosome%02d.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "chromosome")
dev.off()refphase requires three inputs: SNP data including the number of reads, copy-numbers segments and ploidy and purity for each sample. Currently, we support the outputs from the commonly used copy-number caller ASCAT. If you want to create your copy-number segments using a different copy-number caller you will have to export your SNP and segment data to TSV files (see below).
If you used ASCAT (version 2) as your copy-number calling tool, refphase excpects two lists as input: ascat_input and ascat_output. Each entry corresponds to a single sample in your data.
Here's a short example of how ascat_input and ascat_output are created:
ascat_input <- list()
ascat_output <- list()
for (sample in samples) {
cur_ascat_input = ascat.loadData("Tumor_LogR.txt", "Tumor_BAF.txt", "Germline_LogR.txt", "Germline_BAF.txt")
cur_ascat_input = ascat.aspcf(cur_ascat_input, ...)
cur_ascat_output <- ascat.runAscat(cur_ascat_input, ...)
ascat_input[[sample]] <- cur_ascat_input
ascat_output[[sample]] <- cur_ascat_output
}This data is then provided to the function refphase_load with the parameter data_format = "ascat2" (data_format = "ascat" also works for backwards compatibility) as follows:
refphase_input_data <- refphase_load(
data_format = "ascat2", samples = samples,
ascat_input = ascat_input, ascat_output = ascat_output
)We also provide a wrapper function that provides easier access to the ASCAT input/output
refphase_input_data <- run_ascat_for_refphase(data_prefix = "path/to/dir/patient-",
samples = samples,
penalty = 100, gamma = 1, run_aspcf = TRUE,
img_dir = "path/to/dir",
save_prefix = "path/to/dir/patient-output", # optional
save_tsv = TRUE, # optional
save_rds = TRUE) # optionalASCAT version 3 implements some improvements over version 2, however there are some post-processing steps that are incompatible with Refphase. Therefore you will have to use our fork of ASCAT version 3 which can be found here: https://bitbucket.org/schwarzlab/ascat_v3_fork .
Install the fork from source and run as usual. The ASCAT output can then by used with Refphase by setting data_format = "ascat3" as follows:
refphase_input_data <- refphase_load(
data_format = "ascat3", samples = samples,
ascat_input = ascat_input, ascat_output = ascat_output
)If you use a copy-number caller different from ASCAT you can load in your data as TSV files (entries are tab separated (\t)).
refphase expects three files as determined by the tsv_prefix parameter: {tsv_prefix}data-snps.tsv, {tsv_prefix}segments.tsv and {tsv_prefix}purity_ploidy.tsv. Every file has to contain the data for all samples.
Example: tsv_prefix=path/to/data/patient-data-, then refphase will load the following three files: path/to/data/patient-data-data-snps.tsv, path/to/data/patient-data-segments.tsv and path/to/data/patient-data-purity_ploidy.tsv
Expected structure of the {prefix}data-snps.tsv file (showing the columns and an example first entry for each):
chrom pos baf germline_baf logR sample_id
1 20 1 0.24 0 sample1
If the column germline_zygosity is present (with values of either "het" or "hom") it will be used. Otherwise the column germline_baf and parameter homozygous_cutoff are used to determine whether a SNP is homozygous or heterozygous.
Expected structure of the {prefix}segments.tsv file:
sample_id chrom start end cn_major cn_minor
sample1 1 20 2866880 1 1
Expected structure of the {prefix}purity_ploidy.tsv file:
sample purity ploidy
sample1 0.95 2.
Finally, we can load the data as follows
refphase_data_tsv <- refphase_load(data_format = "tsv", samples = samples,
tsv_prefix = "path/to/data/patient-data-",
homozygous_cutoff = 0.7)In previous version of refphase, the data was loaded using a sample_csv object. We discourage this method but provide the refphase_load_legacy function to maintain old workflows.
Example usage:
patient <- "patient1"
sample_csv <- "
sample_id,segmentation,snps,purity,ploidy
sample1,./input/segs/patient1/sample1_segs.tsv,./input/snps/patient1/sample1_snps.tsv.gz,0.8,2.1
sample2,./input/segs/patient1/sample2_segs.tsv,./input/snps/patient1/sample2_snps.tsv.gz,0.9,4.3
sample3,./input/segs/patient1/sample3_segs.tsv,./input/snps/patient1/sample3_snps.tsv.gz,0.54,2.6
"
sample_data <- read.csv(text = sample_csv, stringsAsFactors = FALSE)
# You can also read this from a file
# sample_data <- read.csv("samples.csv", header = TRUE)
# Load all experimental data and sample metadata
refphase_input <- refphase_load_legacy(sample_data, segs_format = "refphase", snps_format = "refphase")Once loaded you can run refphase using the refphase function with the refphase_input_data object as follows
refphase_output <- refphase(refphase_input_data, ...)refphase takes the following parameters:
name: ID of the run, typically the patient ID. Used to identify the run together with an internal input hash. Also used for plotting.cn_method: Method to use for re-estimating copy numbers when required (possible values = ascat, nbayes) (default = "nbayes")cn_subtype: Variant of the copy number estimation to use. (possible values = "raw", "") (default = "raw")phase_all_segs: Force phase all segments, regardless of whether there is allelic imbalance in the input segmentation (default = FALSE, meaning we only segment regions that have allelic imbalance in the input segmentation)merging_threshold: Merge together all segment boundaries that are smaller than this threshold (measured in basepairs)verbosity: Verbosity of logging output (possible values: DEBUG, INFO) (default = INFO)use_checkpoint: save intermediate checkpoints; if TRUE, stores intermediate results in checkpoint_folder (default = FALSE)checkpoint_folder: this gives the folder in which to store intermediate results that will be re-used in any subsequent run; if NULL uses system tmp folder; ignored if use_checkpoint = FALSE (default = NULL)
The output refphase_output consists of three entries:
joint_segs: the unphased joint segmentationphased_segs: phased segmentationphased_snps: list of GPos objects for each sample with "phasing" column (= a, b, or none)sample_data: data.frame with purity, and ploidy information for each sample
The final plots are created using the plot function. There are 5 possible plots which can be selected using the what parameter:
what = "genome": Overview plot containing BAF, logR, copy-numbers and summary statisticswhat = "chromosome": Same asgenomebut one figure per chromosomewhat = "summary": Creates summary statistics such as location of gains, losses and MSAIswhat = "copy_numbers": Plots the copy-number of the sampleswhat = "copy_numbers_integer": Plots the integer-values copy-number of the sampleswhat = "BAF": Plots the BAF tracks of the samples
Example:
png("path/to/results/dir/genome.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "genome")
dev.off()
# Note the %02d in the file name which will be replaced with the chromosome number
# This function creates one plot per chromosome
png("path/to/results/dir/chromosome%02d.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "chromosome")
dev.off()
png("path/to/results/dir/summary.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "summary")
dev.off()
png("path/to/results/dir/copy_numbers.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "copy_numbers")
dev.off()
png("path/to/results/dir/BAF.png", width = 21, height = 29.7, units = "cm", res = 300, family = "sans")
plot(results, what = "BAF")
dev.off()The above example can also be run with pdf instead of png to create the figures as vector-graphics.
This is an example of a more complete, though very simple, workflow starting with mapped reads and ending with copy numbers segmentations that have been phased using refphase.
For a real analysis you should carefully consider each step of this workflow and tailor tools/arguments to your individual dataset. This is not a recommended workflow, but rather a minimal one to show, in the most simple way possible, how to get refphase running when starting with a .bam file.
Before starting, there are a few programs you'll need to have installed. One easy way to install the required applications is using conda:
$ conda create -n refphase-demo samtools tabix bcftools r-devtools
$ conda activate refphase-demoNext you can install refphase itself into your new conda environment:
$ R
> devtools::install_bitbucket("schwarzlab/refphase")This will the refphase R packages as well as all dependencies into the
refphase-demo environment.
After everything is installed, the first step of the analysis is to use our normal sample to identify germline variants in the patient.
$ samtools faidx reference.fa
$ samtools index normal.bam
$ samtools index sample1.bam
$ samtools index sample2.bam
$ samtools index sample3.bam
$ bcftools mpileup -Ou -f reference.fa normal.bam | bcftools call -mv -Ob -o germline-variants.bcfNow that we have germline variants, we want to pile up reads from all of our tumor samples at these positions.
$ bcftools query --format '%CHROM\t%POS\n' germline-variants.bcf | bgzip -c > germline-variants.pos.gz
$ bgzip --reindex germline-variants.pos.gz
$ export SAMPLES=( normal sample1 sample2 sample3 )
$ for sample in ${SAMPLES[@]}; do
bcftools mpileup --regions-file germline-variants.pos.gz --fasta-ref reference.fa --annotate FORMAT/AD ${sample}.bam > ${sample}.mpileup.bcf
bcftools query --format "%CHROM\t%POS\t[%AD{0}]\t[%AD{1}]\n" ${sample}.mpileup.bcf | gzip -c > ${sample}.mpileup.gz
doneThe next step is to do copy number segmentation. For this example we will use ASCAT. ASCAT requires the input to be preprocessed, converting the read depths per germline SNP position to B-allele frequencies (BAF) and log depth ratio between the tumor and normal (logR). So first we do that preprocessing in R:
library(readr)
library(gtools)
chrom2integer <- function(x) {
print(unique(x))
as.integer(gsub("M", "25", gsub("X", "23", gsub("Y", "24", gsub("chr", "", x)))))
}
column_names <- c("chrom", "pos", "ref", "alt")
normal <- as.data.frame(read_tsv("normal.mpileup.gz", col_names = column_names, progress = FALSE))
tumor_samples <- c("sample1", "sample2", "sample3")
for (tumor_sample in tumor_samples) {
tumor <- as.data.frame(read_tsv(paste0(tumor_sample, ".mpileup.gz"), col_names = column_names, progress = FALSE))
# Merge tumor and normal samples to only keep the positions that are shared
# between both samples
merged <- merge(normal, tumor, by = c("chrom", "pos"), suffixes = c("_normal", "_tumor"))
merged$pos <- as.integer(merged$pos)
merged <- merged[order(chrom2integer(merged$chrom), merged$pos), ]
total_reads_tumor <- sum(merged$alt_tumor, na.rm = TRUE) + sum(merged$ref_tumor, na.rm = TRUE)
total_reads_normal <- sum(merged$alt_normal, na.rm = TRUE) + sum(merged$ref_normal, na.rm = TRUE)
merged$baf_tumor <- merged$alt_tumor / (merged$alt_tumor + merged$ref_tumor)
merged$logr_tumor <- log2(((merged$alt_tumor + merged$ref_tumor) / total_reads_tumor) / ((merged$alt_normal + merged$ref_normal) / total_reads_normal))
merged$baf_normal <- merged$alt_normal / (merged$alt_normal + merged$ref_normal)
# logR of normal should always be 0
merged$logr_normal <- merged$logr_tumor
merged$logr_normal[] <- 0
# Remove NAs and +/-Inf
keep_numeric <- apply(merged[, c("logr_tumor", "logr_normal", "baf_normal", "baf_tumor")], 1, function(x) all(unlist(Map(is.finite, x))))
merged <- merged[keep_numeric, ]
# Write the specific tsv format that we can later use with refphase
# Set positions with germline BAF < 0.1 or BAF > 0.9 as germline homozygous,
# and the rest as heterozygous
write.table(data.frame(chrom = merged$chrom, pos = merged$pos, baf = merged$baf_tumor, logr = merged$logr_tumor, germline_zygosity = ifelse(abs(merged$baf_normal - 0.5) > 0.4, "hom", "het")), file = paste0(tumor_sample, "_refphase_snps.tsv"), sep = "\t", quote = FALSE, row.names = FALSE)
# ASCAT only takes integer chromosome names
merged$chrom <- gsub("Y", "24",
gsub("X", "23",
gsub("chr", "", merged$chrom)))
# Write the strange input format files that ASCAT expects
rownames(merged) <- paste0("SNP", seq_len(nrow(merged)))
write.table(data.frame(chrs = merged$chrom, pos = merged$pos, sample = merged$baf_tumor, row.names = rownames(merged)), file = paste0(tumor_sample, "_baf_tumor.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)
# Tumor
dat <- data.frame(chrs = merged$chrom, pos = merged$pos, sample = merged$baf_tumor, row.names = rownames(merged))
colnames(dat)[[3]] <- tumor_sample
write.table(dat, file = paste0(tumor_sample, "_baf_tumor.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)
dat[[tumor_sample]] <- merged$logr_tumor
write.table(dat, file = paste0(tumor_sample, "_logr_tumor.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)
# Normal
dat[[tumor_sample]] <- merged$baf_normal
write.table(dat, file = paste0(tumor_sample, "_baf_normal.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)
dat[[tumor_sample]] <- merged$logr_normal
write.table(dat, file = paste0(tumor_sample, "_logr_normal.tsv"), sep = "\t", quote = FALSE, row.names = TRUE, col.names = NA)
}And then process each sample with ASCAT:
library(ASCAT)
tumor_samples <- c("sample1", "sample2", "sample3")
# Do some extra work to create a table that will be needed when we run refphase in the next step. This is not needed by ASCAT itself.
refphase_sample_data <- data.frame(sample_id = tumor_samples, segmentation = paste0(tumor_samples, "_refphase_segs.tsv"), snps = paste0(tumor_samples, "_refphase_snps.tsv"), purity = NA, ploidy = NA, row.names = tumor_samples)
# ascat_input and ascat_output are required for refphase later on
ascat_input <- list()
ascat_output <- list()
for (tumor_sample in tumor_samples) {
cur_ascat_input <- ascat.loadData(Tumor_LogR_file = paste0(tumor_sample, "_logr_tumor.tsv"),
Tumor_BAF_file = paste0(tumor_sample, "_baf_tumor.tsv"),
Germline_LogR_file = paste0(tumor_sample, "_logr_normal.tsv"),
Germline_BAF_file = paste0(tumor_sample, "_baf_normal.tsv"))
ascat.plotRawData(cur_ascat_input)
cur_ascat_input <- ascat.aspcf(cur_ascat_input)
ascat.plotSegmentedData(cur_ascat_input)
cur_ascat_output <- ascat.runAscat(cur_ascat_input, gamma = 1.0)
ascat_input[[sample]] <- cur_ascat_input
ascat_output[[sample]] <- cur_ascat_output
# Save the segmentation in the default ASCAT format
write.table(cur_ascat_output$segments, file = paste0(tumor_sample, "_segments.txt"), sep = "\t", quote = FALSE, row.names = FALSE)
refphase_sample_data[tumor_sample, "purity"] <- cur_ascat_output$aberrantcellfraction[[1]]
refphase_sample_data[tumor_sample, "ploidy"] <- cur_ascat_output$ploidy[[1]]
}
write.table(refphase_sample_data, file = "refphase-sample-data.tsv", sep = "\t", quote = FALSE, row.names = FALSE)Now we have an initial copy number segmentation for each tumor sample, and we're ready to run refphase, plot the results, and save the phased segmentation.
library(refphase)
# Load the ascat input and output
refphase_input <- refphase_load(
data_format = "ascat", samples = tumor_samples,
ascat_input = ascat_input, ascat_output = ascat_output
)
# (Optional) If your data shows reference bias, which presents itself as BAFs
# in regions with a balanced copy number that are systematically shifted away
# from 0.5 (usually something like 0.47), this can try to correct for that.
refphase_input <- center_baf(refphase_input)
# (Optional) Fit SNP logr data to improve copy number re-estimation in refphase,
# when using the default ASCAT formula-based method fo re-estimating copy
# numbers
refphase_input <- fit_logr_to_ascat(refphase_input)
# Run refphase on the experimental data
results <- refphase(refphase_input)
write_segs(results$phased_segs, file = paste0(patient, "-refphase-segmentation.tsv"))
# (optional) output the SNPs, including phasing information
write_snps(results$phased_snps, file = paste0(patient, "-refphase-phased-snps.tsv.gz"))
# Ploidy might have changed, if we have updated any copy numbers
write.table(results$sample_data, file = paste0(patient, "-refphase-sample-data-updated.tsv"), sep = "\t", row.names = FALSE)
# Plot the results as PDFs (these files can be huge, over 100 MB in some cases)
pdf(paste0(patient, "-refphase-genome.pdf"), width = 10, height = 4 * nrow(results$sample_data), family = "sans")
plot(results, what = "genome")
dev.off()
pdf(paste0(patient, "-refphase-chromosomes.pdf"), width = 10, height = 2 + 4 * nrow(results$sample_data), family = "sans")
plot_all_chromosomes(results$sample_data, results$phased_snps, list(refphase = results$phased_segs, orig = refphase_input$segs), cn_event_calls = results$cn_event_calls)
dev.off()
# Plot the results as individual PNG files
png(paste0(patient, "-refphase-genome.png"), width = 600, height = 200 + 300 * nrow(results$sample_data), family = "sans")
plot(results, what = "genome")
dev.off()
png(paste0(patient, "-refphase-chromosomes-%02d.png"), width = 700, height = 300 * nrow(results$sample_data), family = "sans")
plot_all_chromosomes(results$sample_data, results$phased_snps, list(refphase = results$phased_segs, orig = refphase_input$segs), cn_event_calls = results$cn_event_calls)
dev.off()- Prototype implementations: Roland F Schwarz, Thomas BK Watkins, Nicholas McGranahan
- refphase R package: Matthew R Huska, Tom L Kaufmann, Emma C Colliver, Thomas BK Watkins, Teodora Bucaciuc, Roland F Schwarz
- Current contact: Tom L Kaufmann