Question

How to determine gap length of aligned segments for a given read?

2

Entering edit mode

6.8 years ago

cnvspam ▴ 80

Say I have a read with two supplementary alignments. How do I infer the gap length of the two aligned segments relative to the read?

Here is an example of gap lengths

In the example above, the gap length is the size of the insertion (orange thin line). I'm confused as heck as how to calculate the gap length from reads in a BAM file. Would the CIGAR string be informative, or do I have to go back to the original FASTQ?

sequencing alignment split • 3.0k views

ADD COMMENT • link updated 6.8 years ago by Istvan Albert 100k • written 6.8 years ago by cnvspam ▴ 80

0

Entering edit mode

In this kind of situation, I think unsplit alignments are easier to use, since that information is captured in the cigar string. BBMap will produce unsplit alignments supporting long deletions and medium-length insertions (you can never capture a long insertion in a short read, using split or unsplit alignments).

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

score 1 · Answer 1 · 2017-07-18

First, make sure the get the terminology right. Distances between supplementary alignments are not usually called gaps.

The gaps are usually mentioned relative to an aligned sequence. For any individual alignment, the CIGAR string has that information as a count followed by either of the letters D or I. 5 deletions would be 5D , 10 insertions would be written as 10I of the read relative to a reference.

Once you get to supplementary alignments you have to decide what you wish to call the "gap". For both reads the leftmost position will be listed in the POS column. Then for both reads you can compute the alignment length (that is where the end position is). Once you have these you can figure out whatever distance you want to compute (end to end, end to start etc).

That distance is still not a gap though - it is just a distance that you might interpret as a gap based on some other rationale. Usually, gap means missing, so you would need to state why you think anything should be there to being with.

score 0 · Answer 2 · 2017-07-19

The image above appears to suggest that the "gap" in this definition is the distance between the rightmost coordinate of one alignment to the leftmost coordinate of the other alignment.

The easiest way to compute that would be to transform your BAM file into the BED format. You can do that with bedtools

The BED file will list the start and end coordinates of each alignment. You can then find the reads with identical names and compute the distance as described above. Depending on how many such reads there are you can do it by hand or you'd probably need to write a simple script to process the data.