Question

Tophat Alignment Score And Parameters

1

Entering edit mode

11.1 years ago

Varun Gupta ★ 1.3k

Hello Everyone

I am using tophat2 for my rna seq data set. I mostly always use default options for tophat. I have some doubts when i see the accepted.hits.bam file

In the tophat manual it is written that

--max-insertion-length <int>     The maximum insertion length. The default is 3.

--max-deletion-length <int>     The maximum deletion length. The default is 3

Now this means that in the CIGAR string in the alignment file i should never get a CIGAR string which has something like 4D or 5I because the default is 3 and these are max insertion and deletion length

But i still get something like this 39M5I4M1D28M or 71M4D5M

Am i missing something??

Secondly we have AS:i in tophat bam files. What is the best alignment score??? When i see bam files i see scores like AS:i:-29 and also like AS:i:-6 , AS:i:0

Hope to hear soon

Regards

Varun

tophat2 • 3.9k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 11.1 years ago by Varun Gupta ★ 1.3k

0

Entering edit mode

It might be that max and min params overridden at runtime if alignment has some significant evidence. Need to ask the developers or look into the code to clear that though. Just guessing.

ADD REPLY • link 11.1 years ago by Pavel Senin ★ 1.9k

Ram · Answer 1 · 2015-06-01

I try to answer the second question. Because the tophat is built based on bowtie2. So I read the bowtie2 manual and found the explanation on http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-score-example.

The examples below are from my tophat output for 2*100bp strand-specific RNA-seq

For example A:

HWI-ST1283:56:C16MCACXX:3:1101:4653:2054     97   1       194063  1       100M    =       194262  294     GCTTGGACCACGAGAGGCACCTGAGTCAGGCAGTCACATACTTCCCGCTGGGGTCTACCATGTGAGGCATGGTGTGGGATCCTGGGAAGGAGACCAAGCC    CCCFFFFFHHHHHIJJJJHIJJJIJIJIJJJIJJJJJJJJJJJJJJGIJJIJJHHHHFFFFFFEEEDDDDDDCDBBBDDDDDDDDDDDDDDDBDDDDDDD    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100  YT:Z:UU NH:i:4  CC:Z:12 CP:i:25601  XS:A:-  HI:i:0

for this line of tophat output bam: the 100bp are the same with the reference, the CIGAR is "100M" and attribute tag "XM:i:0" (means 0 mismatchs), so the AS:i:0, alignment score 0, means there are no differences between the read and the reference.

For example B:

HWI-ST1283:56:C16MCACXX:3:1101:1237:2118 163   1  39451169   50    43M944N57M  =  39451176  1051 CGGGGACCTTGACCTCGTCATGAACCTCATGGATGCACACAAGGTTTTCCAGAAGGAACTGGGAAAGCGAACAGGAACCGTTCAGGTCCTGAAGCGGTCA @@@FF<DDFHBAFG+CFEG=?<EGIBEG3CBHH<GGGB?BDDDEF4B(8B@<F3.=@)=EGHB2?CCE8;38>==A,9<???CDD:>@@A344:5<>BB    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100   YT:Z:UU XS:A:+  NH:i:1

for this line of tophat output bam: the CIGAR is "43M944N57M", which means the first 43bp and last 57bp are the same with reference, although there is a 944bp gap. And the gap is a known splice junction of MACF1 gene.

For example C:

HWI-ST1283:56:C16MCACXX:3:1101:2211:2148 147  X  109054244   50  78M1I21M   = 109054160  -183   CAAATGCCTCGTCATCTAATTAGTGACGCGCATGAATGGATGAACGAGATTCCCACTGTCCCTACCTACTATCCAGCGAAAACCACAGCCAAGGGAACGG    DDDDDBDDDDDDDCDDEEDCDDDBDDDDFEFHGHHGHHJJJJJIJIIIHFIIIHEJIHGIHHD?JIJIJIJGJJJIJJJIHFIGHJJHHHHHFFFFFCCC   AS:i:-13    XN:i:0  XM:i:1  XO:i:1  XG:i:1  NM:i:2  MD:Z:26G72   YT:Z:UU XS:A:-  NH:i:1

for this line of tophat output bam: the CIGAR is "78M1I21M" (a insertion) and there is 1 mismatch (XM:i:1), so the alignment score is -13.

Below is UCSC blat result:

000000001 caaatgcctcgtcatctaattagtgacgcgcatgaatggatgaacgagat 000000050
>>>>>>>>> |||||||||||||||||||||||||| ||||||||||||||||||||||| >>>>>>>>>
109054244 caaatgcctcgtcatctaattagtgaggcgcatgaatggatgaacgagat 109054293
                                    ^

000000051 tcccactgtccctacctactatccagcgaaaaccacagccaagggaacgg 000000100
>>>>>>>>> ||||||||||||||||||||||||||||||| |||||||||||||||||| >>>>>>>>>
109054294 tcccactgtccctacctactatccagcgaaa.ccacagccaagggaacgg 109054342
                                         ^

The bowtie2 manual:

Scores: higher = more similar
An alignment score quantifies how similar the read sequence is to the reference sequence aligned to. The higher the score, the more similar they are. A score is calculated by subtracting penalties for each difference (mismatch, gap, etc) and, in local alignment mode, adding bonuses for each match. The scores can be configured with the --ma (match bonus), --mp (mismatch penalty), --np (penalty for having an N in either the read or the reference), --rdg (affine read gap penalty) and --rfg (affine reference gap penalty) options.

End-to-end alignment score example
A mismatched base at a high-quality position in the read receives a penalty of -6 by default. A length-2 read gap receives a penalty of -11 by default (-5 for the gap open, -3 for the first extension, -3 for the second extension). Thus, in end-to-end alignment mode, if the read is 50 bp long and it matches the reference exactly except for one mismatch at a high-quality position and one length-2 read gap, then the overall score is -(6 + 11) = -17. The best possible alignment score in end-to-end mode is 0, which happens when there are no differences between the read and the reference.

Local alignment score example
A mismatched base at a high-quality position in the read receives a penalty of -6 by default. A length-2 read gap receives a penalty of -11 by default (-5 for the gap open, -3 for the first extension, -3 for the second extension). A base that matches receives a bonus of +2 be default. Thus, in local alignment mode, if the read is 50 bp long and it matches the reference exactly except for one mismatch at a high-quality position and one length-2 read gap, then the overall score equals the total bonus, 2 * 49, minus the total penalty, 6 + 11, = 81. The best possible score in local mode equals the match bonus times the length of the read. This happens when there are no differences between the read and the reference.

Valid alignments meet or exceed the minimum score threshold For an alignment to be considered "valid" (i.e. "good enough") by Bowtie 2, it must have an alignment score no less than the minimum score threshold. The threshold is configurable and is expressed as a function of the read length. In end-to-end alignment mode, the default minimum score threhsold is -0.6 + -0.6 L, where L is the read length. In local alignment mdoe, the default minimum score threshold is 20 + 8.0 ln(L), where L is the read length. This can be configured with the --score-min option. For details on how to set options like --score-min that correpond to functions, see the section on setting function options.