HMM-integrated Bayesian approach for detecting CNV and LOH events from single-cell RNA-seq data
HoneyBADGER
enables detection of megabase-level copy number alterations such as deletions, amplifications, and copy-neutral loss-of-heterozygosity in single cells from single-cell RNA-seq data. HoneyBADGER
relies on allele-information and gene-expression information derived from single-cell RNA-seq data. In this tutorial, we walk you through how to prepare your dataset for HoneyBADGER
.
library(HoneyBADGER)
HoneyBADGER
provides an HMM-integrated Bayesian hierarchical allele-based model to identify and infer the probability of copy number alterations in each single cell on the basis of persistent allelic imbalance.
To run the allele-based model, you will need matrices of heterozygous SNP counts. Specifically, we will need the counts of the reference allele, alternate allele, and total coverage for each SNP in each cell. Functions within the HoneyBADGER
package getSnpMats
, getAlleleCount
, and getCellAlleleCount
will help you create these matrices given a list of indexed bams where each cell corresponds to one bam (common for single-cell datasets generated from plate-based approaches), or a single bam with multiple cell barcodes (common for droplet-based single-cell datasets).
Heterozygous SNP positions will also need to be provided. These will ideally obtained from previous WES data from the same sample. When WES data from the same sample is not available, common heterozygous SNPs can be derived from databses such as ExAC database.
In this example, we will create a list of heterozygous SNPs as GRanges from a VCF file:
# Use your own vcf file with heterozygous variants
vcfFile <- "hets.vcf.gz"
# For testing purposes, restrict to set of SNPs on region on chromosome 1
require(GenomicRanges)
testRanges <- GRanges('1', IRanges(start=1e5, width=1e6))
require(VariantAnnotation)
param <- ScanVcfParam(which=testRanges)
# Be sure to use the correct genome species/version
vcf <- readVcf(vcfFile, "hg19", param=param)
snps <- rowRanges(vcf)
# AF is the allele frequency for each alternate allele
info <- info(vcf)
maf <- info[, 'AF']
# limit to common snps by MAF (ie. > 10% in population)
maft <- 0.1
vi <- sapply(maf, function(x) any(x > maft))
snps <- snps[vi,]
# get rid of non single nucleotide changes
vi <- width(snps@elementMetadata$REF) == 1
# result should be GRanges object
snps <- snps[vi,]
This process has already been done for common heterozygous SNPs from ExAC (hg19) and can be loaded directly from HoneyBADGER
:
# available for all autosomes (Chr1 to Chr22) for hg19 only
load(system.file("ExAC", "ExAC_chr1.RData", package = "HoneyBADGER"))
print(head(snps))
## GRanges object with 6 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## 1:17365_C/G 1 [17365, 17365] * | <NA> C
## 1:17385_G/A 1 [17385, 17385] * | <NA> G
## 1:69270_A/G 1 [69270, 69270] * | <NA> A
## rs75062661 1 [69511, 69511] * | <NA> A
## 1:69761_A/T 1 [69761, 69761] * | <NA> A
## 1:69897_T/C 1 [69897, 69897] * | <NA> T
## ALT QUAL FILTER
## <DNAStringSetList> <numeric> <character>
## 1:17365_C/G G 826621.1 InbreedingCoeff_Filter
## 1:17385_G/A A 592354.5 InbreedingCoeff_Filter
## 1:69270_A/G G 1758695.3 PASS
## rs75062661 G 120729371.2 PASS
## 1:69761_A/T T 2041395.6 PASS
## 1:69897_T/C C 1171733.0 PASS
## -------
## seqinfo: 85 sequences from hg19 genome
Now, given this list of potential heterozygous SNPs, we can get the number of reads corresponding to each SNP site for each cell using their .bam
files. Here, we have placed all .bam
and corresponding .bai
index files in the data-raw/
folder. There is one .bam
and .bai
for each cell.
library(HoneyBADGER)
path <- "data-raw/"
files <- list.files(path = path)
# list of paths to bam files
bamFiles <- files[grepl('.bam$', files)]
bamFiles <- paste0(path, bamFiles)
# list of paths to index files
indexFiles <- files[grepl('.bai$', files)]
indexFiles <- paste0(path, indexFiles)
results <- getSnpMats(snps, bamFiles, indexFiles)
getSnpMats
creates a matrix of SNP coverage as well as reference and allele count for use in our HoneyBADGER
allele model.
r <- results$refCount
#cov <- results$cov
cov <- results$refCount + results$altCount ## alternatively
HoneyBADGER
provides an HMM-integrated Bayesian hierarchical expression-based model to identify and infer the probability of copy number alterations in each single cell on the basis of persistent deviations in gene expression from a normal expression reference. Normal references can be ideally obtained from matched normal cells or sorted samples from the same patient but can also be estimated using GTeX.
To run the expression-based model, we recommend quantification by counts transformed to log CPM. The same processing pipeline and transformation is highly recommended for the normal reference.
For 10X data, you can use the output of CellRanger
. For example, the Gene / cell matrix (filtered)
can be normalized to CPMs and log transformmed to serve as the gene expression matrix. For the allele matrix, Genome-aligned BAM
and Genome-aligned BAM index
will be used as bamFile
and indexFile
respectively. However, as all cells will be contained in the same bam, we will use a different function to get the allele counts for each cell getCellAlleleCount
. The column names of the expression matrix will be your cell barcodes cellBarcodes
.
results <- getSnpMats10X(snps, bamFile, indexFile, cellBarcodes)
r <- results$refCount
cov <- results$refCount + results$altCount
An alternative and much faster way of obtaining these allele-specific count tables is with the scAlleleCount package. Once installed you can issue the following command to obtain the tables:
results <- getFastCellAlleleCount(snps, bamFile, cellBarcodes)
If you are using the set of ExAC SNPs that comes with the HoneyBADGER
package, you may need to remove certain chromosomes from the seqlevels
depending on your hg19 build.
library(HoneyBADGER)
load(system.file("ExAC", "ExAC_chr1.RData", package = "HoneyBADGER"))
## ignore alt contigs (restrict to autosomes here)
vi <- seqlevels(snps) %in% as.character(c(1:22))
table(vi)
seqlevels(snps) <- seqlevels(snps)[vi]
The chromosome names used in your snps
parameter needs to match the chromosome names used in your aligned bams
. In this example, 10X used non-canonical chromosome names (other than UCSC RefSeq or Ensemble annotations) in order to distinguish human from mouse so we need to modify the seqlevels
to match.
# need to match snp name with seqlevel in bam
seqlevels(snps) <- paste0('hg19_', seqlevels(snps))
To obtain a set of valid cell barcodes from 10X, you can use 10X Cell Ranger’s whitelisted set of barcodes. However, most of these will not be present in your bam (because none of the cells sequenced have the barcode). So alternatively, you can restrict to the set of filtered barcodes with corresponding expression counts.
## whitelist
cellBarcodes <- readLines('cellranger-3.1.0/cellranger-cs/3.1.0/lib/python/cellranger/barcodes/737K-august-2016.txt')
## from filtered expression matrix
cellBarcodes <- readLines(gzfile('filtered_feature_bc_matrix/barcodes.tsv.gz')) ## barcodes with expression counts (filtered)
head(cellBarcodes)