Question

Stringtie duplicate transcripts

0

Entering edit mode

4.1 years ago

nirkoty • 0

Hi,

I am using Stringtie v2.1.1 on a single bam file. I end up with a gff file, but it looks like some transcripts are duplicated, for example:

chr1    StringTie   transcript  729898  732218  1000    -   .   gene_id "STRG.28"; transcript_id "STRG.28.1"; cov "9.134856"; FPKM "1.605420"; TPM "2.896626";
chr1    StringTie   exon    729898  729955  1000    -   .   gene_id "STRG.28"; transcript_id "STRG.28.1"; exon_number "1"; cov "5.454021";
chr1    StringTie   exon    732017  732218  1000    -   .   gene_id "STRG.28"; transcript_id "STRG.28.1"; exon_number "2"; cov "10.191729";
chr1    StringTie   transcript  729898  732218  1000    -   .   gene_id "STRG.28"; transcript_id "STRG.28.2"; cov "3.270196"; FPKM "0.574726"; TPM "1.036966";
chr1    StringTie   exon    729898  729955  1000    -   .   gene_id "STRG.28"; transcript_id "STRG.28.2"; exon_number "1"; cov "1.947918";
chr1    StringTie   exon    732013  732218  1000    -   .   gene_id "STRG.28"; transcript_id "STRG.28.2"; exon_number "2"; cov "3.642488";

In this example, I have 2 transcripts, starting and ending at the same position. They also have the same exons, except that in one case, the second exon start at position 732017 while on the other, it starts at position 732013.

If you consider another case,

chr1    StringTie   transcript  13483   29654   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; cov "27.115095"; FPKM "4.765386"; TPM "8.598089";
chr1    StringTie   exon    13483   15038   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "1"; cov "23.612379";
chr1    StringTie   exon    15796   15947   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "2"; cov "18.194462";
chr1    StringTie   exon    16607   16765   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "3"; cov "13.165168";
chr1    StringTie   exon    16858   17055   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "4"; cov "40.344353";
chr1    StringTie   exon    17233   17368   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "5"; cov "52.639740";
chr1    StringTie   exon    17606   17742   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "6"; cov "49.598957";
chr1    StringTie   exon    17915   18061   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "7"; cov "45.024239";
chr1    StringTie   exon    18268   18366   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "8"; cov "47.735268";
chr1    StringTie   exon    24738   24891   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "9"; cov "16.246500";
chr1    StringTie   exon    29534   29654   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.1"; exon_number "10"; cov "1.105868";
chr1    StringTie   transcript  13483   29654   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; cov "4.613946"; FPKM "0.810885"; TPM "1.463064";
chr1    StringTie   exon    13483   15038   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "1"; cov "2.968100";
chr1    StringTie   exon    15796   15947   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "2"; cov "2.287062";
chr1    StringTie   exon    16607   16765   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "3"; cov "1.654875";
chr1    StringTie   exon    16858   17055   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "4"; cov "5.071326";
chr1    StringTie   exon    17233   17368   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "5"; cov "6.616868";
chr1    StringTie   exon    17606   17742   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "6"; cov "6.234639";
chr1    StringTie   exon    17915   18061   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "7"; cov "5.659593";
chr1    StringTie   exon    18268   18369   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "8"; cov "6.363258";
chr1    StringTie   exon    18913   24891   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "9"; cov "5.117283";
chr1    StringTie   exon    29534   29654   1000    -   .   gene_id "STRG.1"; transcript_id "STRG.1.2"; exon_number "10"; cov "0.139009";

2 transcripts, almost the same, except for exon 9 which starts at position 24738 in one case and 18913 in the other, although they end at the same position.

What should I do in this case, consider them as a single isoform and add the TPM? Keep them as separate (but then what is the reason behind this), or simply remove on of them.

This is on a human sample, assembled using hg38.

Thanks in advance for your help

Assembly Stringtie RNA-Seq • 2.4k views

ADD COMMENT • link updated 4.0 years ago by Kristoffer Vitting-Seerup ★ 4.0k • written 4.1 years ago by nirkoty • 0

0

Entering edit mode

Why is it different than any other case where there are two isoforms? I understand that the difference between the two is minute in these cases but they are still different.

ADD REPLY • link 4.1 years ago by Asaf 10k

0

Entering edit mode

hi, As a side-note to what has been posted already by Kristoffer, you could load the BAM and stringTie assembled GTF in your local IGV. Once there, check the Sashimi plots. If those alternate exon start/ stop sites are true, then you should see splice-junction support for both the boundaries of the exon.

ADD REPLY • link 4.0 years ago by Amitm ★ 2.2k

score 2 · Answer 1 · 2020-04-08

Those transcripts are not duplicated. There are two transcripts because the bam file contains evidence of alternative splicing causes changes in the exons you have highlighted to give rise to two distinct mRNAs. Slicing can dramatically alter the function of a transcript (think one leads to a stop cordon while the other does not).

In your cases the first one just seem to remove a single codon (3 nt) so that is probably a small change (but it might not be - google microexons if you want to know more) - but the second one is huge: the actual transcript lengths are ~2900nt and ~8600nt respectively! (please also note there are more differences between the two later isoforms than the one you highlight).

You should definitively keep them!

Having more transcripts will not change gene level quantification (since you just sum the transcript level TPM/counts to get the gene level)
Having more transcripts will allow you to functionally characterise them. In your second example what if it is mainly the short one or the long one which is expressed? Those will most likely have very distinct functions.

Hope this helps.

Cheers Kristoffer