Question

Why some entries from a same pair in the bam file have different TLEN?

0

Entering edit mode

6.8 years ago

verne91 ▴ 20

E00247:267:HMVT3CCXX:4:2219:5172:54541  321     chr1    199910236       45      85H43M  chrUn_gl000220  136768  -235    AGAGAGCAAGAAAAAGAAAGAGAGAGAGAGAGAGAGAGAGACA     ,7,,,A,,7,A,A,A,A7FAFF,AAF7,,<AAFF7<7<F7A,<     QT:Z:AAFFFKKK   BC:Z:GCCACGCT   QX:Z:AAFFFKKKKKKKKKKK   AM:A:0  XM:A:0  TR:Z:AGAGAGA    TQ:Z:KKKKKKK    AS:f:-116       RG:Z:NA12878:LibraryNotSpecified:1:HMVT3CCXX:4  XS:f:-107.5     SA:Z:chr12,66982976,-,54M74S,2,0;       BX:Z:TTCGCCACAATTAGAG-1 XT:i:0  RX:Z:TTCGCCACAATTAGAG   OM:i:36 MI:i:1550405
E00247:267:HMVT3CCXX:4:2219:5172:54541  81      chr12   66982977        2       74S54M  chrUn_gl000220  136768  385     TGTCTCTCTCTCTCTCTCTCTCTCTTTCTTTTTCTTGCTCTCTACCGCTCTCTGTCTAGCTCGGTCTGTGTGTGTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTGTCTGTCTGTCTCTCTCTCTCTC        <,A7F<7<7FFAA<,,7FAA,FFAF7A,A,A,A,7,,A,,,7,,(,,A,,,A,7,,7,(A,,7,,,A,,,A,,,7,A,,,<A,,77,KKA,7,,,7,KF7KKKKF,KKF,KKKKKKKKKKKKKKKKKK        QT:Z:AAFFFKKK   BC:Z:GCCACGCT   QX:Z:AAFFFKKKKKKKKKKK   AM:A:0  XM:A:0  TR:Z:AGAGAGA    TQ:Z:KKKKKKK    AS:f:-106.5     RG:Z:NA12878:LibraryNotSpecified:1:HMVT3CCXX:4  XS:f:-106.5     SA:Z:chr1,199910235,+,85H43M,36,2;      BX:Z:TTCGCCACAATTAGAG-1 XT:i:0  RX:Z:TTCGCCACAATTAGAG   OM:i:2

These two reads come from NA12878 10X Genomics whole genome sequencing data.

Previous examples:

 E00247:267:HMVT3CCXX:8:1202:19624:62663 83      chr1    9996    13      52S76M  =       10004   -68 GAAGACGGCATACGCGATTCACCCTTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC        ,,,<KFAFF<KKF<,FFA,A,,AA,AAFAFAF,<FA<<KFAKKF<FAFK<FFAKF,,F7,F7FFF<KFKFA7KFFFFAKKKKA,KFFK<AKKKKFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKK        QT:Z:AAFFFKKK   BC:Z:AAGGGTGA   QX:Z:AAFFFKKKKKKKKKKK   AM:A:0  XM:A:1  TR:Z:GGGTTAG    TQ:Z:KKKKKKK    AS:f:-76        RG:Z:NA12878:LibraryNotSpecified:1:HMVT3CCXX:8 XS:f:-76.5      BX:Z:AAGTGGGTCCATCGTC-1 XT:i:0  RX:Z:AAGTGGGTCCATCGTC   OM:i:0
 E00247:267:HMVT3CCXX:8:1202:19624:62663 163     chr1    10004   0       75M76S  =       9996    68      CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCGACGATAGACCCACATAGAACGCCAGAGCGTCGTGCAGAGAAAGAGTCTAGATCTCGGTGATCGCCGTATCATTAA AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKFKKKKKKKKKKKKKKKKK,FFKKKFAKFKA,,,(,,,,7FKK<F,,,,,,,<<,A(,AA,,,(,,7<<,,,,77,<,,,<7,,77A<7,,,,,<AFF(,,A7,7,, QT:Z:AAFFFKKK   BC:Z:AAGGGTGA   QX:Z:AAFFFKKKKKKKKKKK   AM:A:0  XM:A:1  RX:Z:AAGTGGGTCCATCGTC   AS:f:-76        RG:Z:NA12878:LibraryNotSpecified:1:HMVT3CCXX:8  XS:f:-76.5      BX:Z:AAGTGGGTCCATCGTC-1 XT:i:0  OM:i:0

bam alignment genome sequencing 10X • 2.1k views

ADD COMMENT • link 6.8 years ago by verne91 ▴ 20

score 0 · Answer 1 · 2017-07-12

0

Entering edit mode

6.8 years ago

Devon Ryan 104k

They have the same TLEN, namely 68. As described in the SAM specification:

The leftmost segment has a plus sign and the rightmost has a minus sign. The sign of segments in the middle is undefined. It is set as 0 for single-segment template or when the information is unavailable.

A better question would be, "why does the TLEN not match the length of read #1"? I don't have a good answer for that.

ADD COMMENT • link 6.8 years ago by Devon Ryan 104k

0

Entering edit mode

It is an odd situation in which the first read's end (1001+57+49+13) extends past the end of the second read (1004+65) . So practically speaking it is not quite what a "proper" alignment would look like.

But taking directly the definition of TLEN based on alignments, it will go from the leftmost coordinate 1001 to the rightmost coordinate of the other read 1004 + 65 and as it happens 1001 - (1004 + 65) = -68

ADD REPLY • link 6.8 years ago by Istvan Albert 100k

0

Entering edit mode

I am very sorry that I took an wrong example. I have updated the example. I think the problem maybe have something to do with hard clipping.

ADD REPLY • link 6.8 years ago by verne91 ▴ 20

1

Entering edit mode

please don't modify the example since the existing explanations don't make sense then. You can always add a new example to the post as an edit.

The problem with your second example is that the pairs map to different chromosomes - at this point TLEN does not have a meaning.

ADD REPLY • link 6.8 years ago by Istvan Albert 100k

0

Entering edit mode

I am sorry I made another mistake. I have added the previous example. Would TLEN be set 0 if the pairs mapped to different chromosomes?

ADD REPLY • link 6.8 years ago by verne91 ▴ 20

0

Entering edit mode

The SAM specification is incomplete and does not define how hard or soft clipping should interact with TLEN (or, for that matter, how indels should affect TLEN). So, if reads are clipped, or if they contain indels, I would not trust TLEN since it's not specified in those situations.

If you want to find out how long a read is, you need to walk the cigar string.

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

I think the issue is that clipping does not factor into alignment and TLEN is a concept deduced from the aligned regions.

The wording is perhaps misleading - instead of "observed Template LENgth" it should be something like "apparent size" or "virtual length" a measure that reflects that we deduced this from comparing to a reference and not necessarily reality.

ADD REPLY • link 6.8 years ago by Istvan Albert 100k

1

Entering edit mode

So... BBMap tries to identify the actual insert size (which it reports as TLEN) based on alignment. If there are long deletions in reads, those are not included in the insert size. Of course if paired reads have an intron / long deletion in the unsequenced middle region there's nothing I can do about that since it's not detectable. Furthermore, it correctly compensates for paired reads that have insert size shorter than read length. So I think it does a very good job of calculating TLEN.

That said - since TLEN is ambiguous... who knows? Yesterday, I was in a code review discussing circularity detection of plasmids. TLEN was being used as a proxy for aligned read length. Specifically, it was assumed that TLEN=10000 meant that the read aligned to 10kbp of reference sequence. I objected, noting that when there is clipping, TLEN is basically undefined (or actually, it's never fully defined, but BLASR does not support long indels so the only important point was clipping). But it turns out that BLASR produces the desired results in this case - specifically, for BLASR, TLEN means the length of the aligned sequence after hard-clipping and is completely unrelated to the observed transcript length, so it contradicts the definition. Essentially, it seems to simply report (rightmost coordinate)-(leftmost coordinate)+1 as TLEN, which is obviously incorrect, but without a strict definition, it's hard to say what is correct.

I suggest using great caution when considering TLEN for any analysis since it is not fully defined. In fact, I suggest never using it, ever.

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k

0

Entering edit mode

How can I calculate the insert size for each entry if I don't use TLEN? just use (rightmost coordinate)-(leftmost coordinate)+1?

ADD REPLY • link 6.8 years ago by verne91 ▴ 20

0

Entering edit mode

No, you parse the cigar string. Matches/mismatches contribute to length, deletions don't, but insertions do. And then adjust using the leftmost and rightmost coordinates of the read pair. For example, if a read has a 10000bp deletion, that should NOT contribute to the TLEN since obviously it is not part of the "template".

ADD REPLY • link 6.8 years ago by Brian Bushnell 20k