Cigar Strings For Dummies
Mar 28, 2017
Smoke and CIGAR (strings)
The ‘CIGAR’ (Compact Idiosyncratic Gapped Alignment Report) string is how the SAM/BAM format represents spliced alignments. Understanding the CIGAR string will help you understand how your query sequence aligns to the reference genome. For example, the position stored is the left most coordinate of the alignment. To get to the right coordinate, you have to parse the CIGAR string.
Let’s consider a few concrete examples. First example: The shown alignment will give position = 2 (0-based!) and CIGAR = 6M:
AAGTCTAGAA (ref) GTCTAG (query)
CIGAR strings have a number of operators:
M Match Exact match of x positions N Alignment gap Next x positions on ref don’t match D Deletion Next x positions on ref don’t match I Insertion Next x positions on query don’t match
For CIGAR =’6M’, this means there are 6 exact matches to the reference. So if we are starting at position=2, with 6 exact matches, we would end at position 7 (again 0-based):
0123456789 AAGTCTAGAA (ref) GTCTAG (query)
Second example: The shown alignment will give position=2 (0-based) and CIGAR=3M2I3M:
0123456789 AAGTC TAGAA (ref) GTCGATAG (query)
Here, two nucleotides (‘GA’) are inserted into the query. So if we are starting at position=2, based on the CIGAR string, we have 3 exact matches, 2 insertions, then 3 more exact matches, resulting in an end position of 9.
Third example: The shown alignment will give position=2 and CIGAR=2M1D3M:
0123456789 AAGTCTAGAA (ref) GT TAG (query)
Note there is a deletion on the query. The ‘C’ in the reference sequence has no match. So if we are starting at position=2, based on the CIGAR string, we have 2 exact matches, 1 deletion, then 3 more exact matches, resulting in an end position of 7 relative to the reference.
Fourth example: The shown alignment will give position=3 and CIGAR=3M7N4M:
01234567890123456 CCCTACGTCCCAGTCAC (ref) TAC TCAC (query)
This is a gapped alignment (due to a splicing event in RNAseq). So if we are starting at position=3, based on the CIGAR string, we have 3 exact matches, 7 gaps, then 4 more exact matches, resulting in an end position of 16.
- RNA Velocity Analysis (In Situ) - Tutorial and Tips on 14 January 2020
- How to write an abstract on 24 September 2019
- Figure style faux pas on 19 July 2019
- Single-Cell RNA-seq Dimensionality Reduction with Deep Learning in R using Keras on 17 May 2019
- Automate testing of your R package using Travis CI, Codecov, and testthat on 17 February 2019
- Online bargain-hunting in R with rvest on 12 January 2019
- Interactive Exploration Of The Gender Pay Gap on 15 December 2018
- Nih F99 K00 Grant Tips Example And Personal Experience on 31 October 2018
- Single Cell Clustering Comparison on 28 June 2018
- Get your R package on CRAN in 10 steps on 18 June 2018