Question

How has StringTie caculated the transcript coverage?

3

Entering edit mode

6.3 years ago

ddzhangzz ▴ 90

Recently I have used Stringtie to compute the reads of RNASeq mapping to transcripts. There are two transcripts of a gene with exactly same length and number of exons (as well as the assembly structure of the two transcripts) and I found the coverages were very different from each other.

 ##transcript
t_id    chr     strand  start   end     t_name             _exons       length  gene_id              ene_name   cov             FPKM
77237   chr17   -       7668402 7687538 ENST00000269305.7       11      2579    ENSG00000141510.14      TP53    31.946598       5.549151
77238   chr17   -       7668402 7687538 ENST00000620739.3       11      2579    ENSG00000141510.14      TP53    2.961419        0.514401

I am wondering how the stringtie has calculated the coverage. By its definition and if my understand were correct, the coverage was calculated as \sum{seq_i*mapped-seq-length_i}{i=1}{m}/transcript_length. If this is true, I expect the coverage should be similar of these two transcripts but why they were so different.

RNA-Seq • 3.2k views

ADD COMMENT • link updated 6.2 years ago by geo.pertea ▴ 130 • written 6.3 years ago by ddzhangzz ▴ 90

0

Entering edit mode

Did you find the solution anywhere else? we are struggling to find out the same. It is not clear anywhere.

ADD REPLY • link 6.2 years ago by lakhujanivijay 5.8k

0

Entering edit mode

you may follow up with this post on github. may be someone is listening

https://github.com/gpertea/stringtie/issues/162

ADD REPLY • link 6.2 years ago by lakhujanivijay 5.8k

score 0 · Answer 1 · 2018-01-27

0

Entering edit mode

6.2 years ago

geo.pertea ▴ 130

Please see this answer about how coverage values are calculated by StringTie. Transcript and exon coverage values for overlapping transcripts (alternate isoforms) are calculated after distributing the read alignments according to the maximum flow algorithm -- it's not as simple as applying a formula.

For this particular question, without further data I presume that ENST00000269305.7 and ENST00000620739.3 are somehow distinct isoforms (so not exactly identical in their intron-exon structure, otherwise one of them would be discarded when the input file is loaded).

ADD COMMENT • link 6.2 years ago by geo.pertea ▴ 130

0

Entering edit mode

ENST00000269305.7 and ENST00000620739.3 are truely identical in exons assembly (even they are assigned to different Ensembl IDs) (probably due to they have differently AA seq). These cases also seem not rare and we found at least "5" duplicated transcripts in one gene. My question was to understand how Stringtie treated them. Comparing to Salmon, it has removed one of these duplicated isoforms but I still wanted to know the details in Stringtie.

ADD REPLY • link 6.2 years ago by ddzhangzz ▴ 90