Custom Sequencing Library Bioinformatics
Jun 9, 2017
Bioinformatics for custom sequencing library constructions
Sequencing has become so streamlined that we often just use standard library prep kits, made for particular sequencers, followed by proprietary bioinformatics software for demultiplexing and quantification. But, if we want to design custom library structure, perhaps for use in multiplexed droplet-based approaches, we will need to come up with our own bioinformatics pipelines. In this tutorial, I will take you through a recent experience I had analyzing reads from a custom library prep.
Consider the library structure below:
For this particular library, we’ve used custom gene specific primers for targeted sequencing of specific genes containing mutations of interest (as opposed to doing a whole exome for example). Let’s assume the data has already been demultiplexed and the paired-end reads have been properly split. Thus, what we should have in our FASTQs are a bunch of sequences containing the gene specific primers and our sample DNA fragment of interest.
For simplicity, let’s just look at one forward read.
library(ShortRead)
file1 <- "../data-raw/N706_S1_L001_R1_001.fastq.gz"
fastq1 <- readFastq(file1)
r1trim <- fastq1@sread[1]
as.character(r1trim)
GGAAGCTGCAGCCCTCACTTAACAGTGGGCGGCTCCTCCATCTAGCCATGCGAGGTCTTGGGGAGCTGGGCGCCCAGCCCCAACAGGACCATGCACGCTGGCCCCGGGGCAGCAGCCTGTCCGAGTGCA
Our first goal is to figure out which gene specific primer is being used for this read. Since we know the sequences for all our gene specific primers, we can just try to align each to our read. Because of potential sequencing errors in the primer itself and since each primer may be a different length, using a simple Needleman–Wunsch alignment algorithm will likely save more reads than just simple string matching.
## primers.valid is a DNAStringSet containing our primer sequences
## so we can assess how well each primer matches our sequence
scores <- sapply(1:length(primers.valid), function(j) {
p <- primers.valid[j]
r <- DNAStringSet(substring(as.character(r1trim), 0, width(p)))
mat <- nucleotideSubstitutionMatrix(match = 100, mismatch = -1, baseOnly = TRUE)
align <- pairwiseAlignment(p, r, substitutionMatrix = mat, gapOpening = -1, gapExtension = -1)
## divide by theoretical max
score(align)/(width(p)*100)
})
j <- which(scores==max(scores)) ## the theoretical max should be 1
p <- primers.valid[j] ## DNAString of the sequence
pr <- primers.valid.info[j,] ## information about the primer
pr
chr pos strand width start end width seq
397 chr17 9792856 + 26 9792832 9792857 26 GGAAGCTGCAGCCCTCACTTAACAGT
So let’s look more closely at the gene specific primer that this sequence matched with. Indeed, we can visually/manually confirm that this is the correct primer:
GGAAGCTGCAGCCCTCACTTAACAGTGGGCGGCTCCTCCATCTAGCCATGCGAGGTCTTGGGGAGCTGGGCGCCCAGCCCCAACAGGACCATGCACGCTGGCCCCGGGGCAGCAGCCTGTCCGAGTGCA
GGAAGCTGCAGCCCTCACTTAACAGT
Or just quickly realign to double check.
r <- DNAStringSet(substring(as.character(r1trim), 0, width(p)))
mat <- nucleotideSubstitutionMatrix(match = 100, mismatch = -1, baseOnly = TRUE)
align <- pairwiseAlignment(p, r, substitutionMatrix = mat, gapOpening = -1, gapExtension = -1)
align
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: [1] GGAAGCTGCAGCCCTCACTTAACAGT
subject: [1] GGAAGCTGCAGCCCTCACTTAACAGT
score: 2600
Yay it’s a perfect match! And we can trim off this gene specific primer to get just the target DNA sequence.
r1final <- DNAStringSet(substring(as.character(r1trim), width(p)+1))
as.character(r1final)
GGGCGGCTCCTCCATCTAGCCATGCGAGGTCTTGGGGAGCTGGGCGCCCAGCCCCAACAGGACCATGCACGCTGGCCCCGGGGCAGCAGCCTGTCCGAGTGCA
Hurray! But now we need to align it to the reference genome. We could take these trimmed reads and run an aligner, but since we know where our gene specific primer is located, we can just look at the upstream or downstream DNA sequence. Note, depending on the strand of the primer (forward vs. reverse), the start and end positions of the DNA sequence that will be sequenced will be different. And always double check for those off-by-one errors!
if(pr$strand=='+') {
start=pr$end+1
end=pr$end+width(r1final)
}
if(pr$strand=='-') {
start=pr$start-width(r1final)+1
end=pr$start
}
align.df <- data.frame(chr=pr$chr, strand=pr$strand, start=start, end=end)
library(GenomicRanges)
align.gr <- with(align.df, GRanges(chr, IRanges(as.numeric(start), as.numeric(end)), strand=strand))
library(BSgenome.Hsapiens.UCSC.hg19)
align.seq <- getSeq(BSgenome.Hsapiens.UCSC.hg19, align.gr)
as.character(align.seq)
GGGCGGCTCCTACATCTAGCCATGCGAGGTCTTGGGGAGCTGGGCGCCCAGCCCCAACAGGACCATGCACGCTGGCCCCGGGGCAGCAGCCTGTCCGAGTGCA
Visually/manually checking how well our sequenced read (top) matches the reference (bottom), we see a very good match!
GGGCGGCTCCTCCATCTAGCCATGCGAGGTCTTGGGGAGCTGGGCGCCCAGCCCCAACAGGACCATGCACGCTGGCCCCGGGGCAGCAGCCTGTCCGAGTGCA
GGGCGGCTCCTACATCTAGCCATGCGAGGTCTTGGGGAGCTGGGCGCCCAGCCCCAACAGGACCATGCACGCTGGCCCCGGGGCAGCAGCCTGTCCGAGTGCA
Now we can find mutation using simple string comparison.
ref <- strsplit(as.character(align.seq), split="")[[1]]
read <- strsplit(as.character(r1final), split="")[[1]]
muts <- do.call(rbind, lapply(1:length(ref), function(i) {
if(ref[i]!=read[i]) {
s = pr$strand
if(s=='+'){
r = ref[i]
m = read[i]
} else {
r = as.character(complement(DNAString(ref[i])))
m = as.character(complement(DNAString(read[i])))
}
info <- data.frame('chr'=pr$chr, 'pos'=start+i-1, 'strand'='+', 'ref'=r, 'mut'=m)
return(info)
}
}))
muts
chr pos strand ref mut
1 chr17 9792869 + A C
Final double check that that our position is correct.
align.df <- data.frame(chr=muts$chr, strand=muts$start, start=muts$pos, end=muts$pos)
align.gr <- with(align.df, GRanges(chr, IRanges(as.numeric(start), as.numeric(end)), strand=strand))
align.seq <- getSeq(BSgenome.Hsapiens.UCSC.hg19, align.gr)
align.seq
A DNAStringSet instance of length 1
width seq
[1] 1 A
Since this could be a sequencing error, we can rely on other information such as the read quality or just look at other reads to see if this mutation pops up again. As you can already tell, things become a little more complicated for reverse primer reads since we need to get complements. Similarly for reverse reads, we need to get reverse complement sequences, which can quickly become confusing if you don’t double check your work by comparing to references and keeping track of the directionality!
- Older
- Newer
RECENT POSTS
- Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve on 04 May 2023
- Aligning Spatial Transcriptomics Data With Stalign on 16 April 2023
- 3D animation of the brain in R on 08 November 2022
- Ethical Challenges in Biomedical Engineering - Data Collection, Analysis, and Interpretation on 15 October 2022
- I use R to (try to) figure out the cost of medical procedures by analyzing insurance data from the Transparency in Coverage Final Rule on 12 September 2022
- Annotating STdeconvolve Cell-Types with ASCT+B Tables on 30 August 2022
- Deconvolution vs Clustering Analysis: An exploration via simulation on 11 July 2022
- Coloring SVGs in R on 17 June 2022
- Deconvolution vs Clustering Analysis for Multi-cellular Pixel-Resolution Spatially Resolved Transcriptomics Data on 03 May 2022
- Exploring UMAP parameters in visualizing single-cell spatially resolved transcriptomics data on 19 January 2022
TAGS
RELATED POSTS
- Impact of normalizing spatial transcriptomics data in dimensionality reduction and clustering versus deconvolution analysis with STdeconvolve
- Aligning Spatial Transcriptomics Data With Stalign
- 3D animation of the brain in R
- I use R to (try to) figure out the cost of medical procedures by analyzing insurance data from the Transparency in Coverage Final Rule