Trinity Transcriptome Assembly
4
1
Entering edit mode
11.5 years ago

After Trinity assembler finished its assembly i managed to calculate the basic statistics of the assembly which are as below

File:  Trinity.fasta
Number:  158863
Total size:  176660784
Min size:  201
Max size:  22887
Average size:  1112.03
Median size:  665
N50:  1863
size @ 1Mbp:  11440  
Number @ 1Mbp:  65
size @ 2Mbp:  8461
Number @ 2Mbp:  170
size @ 4Mbp:  7088
Number @ 4Mbp:  430
size @ 10Mbp:  5424
Number @ 10Mbp:  1417

Now my question is does these values look reasonable. Though N50 looks good i am worried about the number of transcripts that are less than 1kb (~ 60%) of the overall transcripts. Is this normal in Trinity?

Also how do people normally do downstream analysis after getting the assembly to select the best transcritps. I ask this because the number of Transcripts is way higher than expect number of genes in related species.

Thanks

trinity transcriptome denovo • 10k views
ADD COMMENT
0
Entering edit mode

Please fix formatting, it's very difficult to read the tables.

ADD REPLY
5
Entering edit mode
11.5 years ago
Ketil 4.1k

Is it reasonable as a transcript assembler output? Possibly. Is it reasonable as an estimate of the real genes? Probably not, when doing transcript assembly, you invariably get a lot of junk: fragmented genes, merged genes, non-coding transcript fragments ("junk" RNA). It's hard to tell from sizes and numbers alone, since this will vary with species.

Did you compare to any other tools? Did you map back the reads to the transcripts, and count pairs and mapping percentages? Did you map transcripts to related transcriptomes to estmate errors and coverage? These are all things you can do to evaluate the assembly.

Finally, it's difficult to suggest how to select the "best transcripts", since it's not clear what you mean by "best". Are you looking for something in particular?

ADD COMMENT
0
Entering edit mode

Thanks Ketil for your response. I had redone the table and it is the best i can do. Anyway i mapped all my Trinity transcripts to my reference genome and there is more than 99% mapping for the transcripts and when i blasted my Trinity transcripts to reference transcriptome (from a different accesion) i got ~84% and so it looks like they are all genuine. But the problem now is how do i deal with those 60% of transcripts that are less than 1kb. How do i know if they are fragmented or not?

ADD REPLY
1
Entering edit mode

I use my own tool (first asmeval then bamstats, both linked from http://blog.malde.org/ ) to map reads to transcripts, and then calculate statistics on various things. One recent development in the latter is to calculate "splits", that is read paris that span chromosomes (or rather, contigs, or in this case, putative transcripts). But you can probably do this easily with your own tools, it's not rocket surgery.

ADD REPLY
0
Entering edit mode

The link for the blog is not working. Can you please fix it. Thanks

ADD REPLY
0
Entering edit mode

Sorry! Fixed now - the markup parser had included the end parenthesis in the URL :-) Thanks for pointing it out!

ADD REPLY
0
Entering edit mode

+1 for "it's not rocket surgery" :)

ADD REPLY
2
Entering edit mode
10.7 years ago
William ★ 5.3k

Here is a presentation about different metrics you can / should use to asses the quality of your transcriptome assembly.

http://www.abrf.org/Committees/Education/Activities/ABRF2013_SW1_oneil_DeNovo-transccript-Assembly.pdf

ADD COMMENT
0
Entering edit mode

interesting ways to show metrics...

ADD REPLY
1
Entering edit mode
11.5 years ago
Biojl ★ 1.7k

You are expected to get a lot more transcripts than genes, along with what Ketil say you must also take into account alternative splicing. If you're working with an eukaryote species and mapping to a decent genome assembly (human, mouse, etc) you should expect to find different isoforms coming from the same gene. Try calculating how many UNIQUE ID's hits you get for genes and transcripts.

For the short transcripts you could filter them for an ORF higher than say 100 bp or the value that best fits you.

ADD COMMENT
0
Entering edit mode

Thanks both of you for your suggestions and comments. During the last few days i have learnt a lot about the Trinity output file going through forums (The trinity website is least explained regarding this). As said above i figure out that my Trinity output consisted of lots of isofoms and sometimes i even found one transcript that had 202 isoroms and so the output is not surprising at all. Also when i checked my reference transcriptome i found ~52% of genes have a gene length of <1kb and so i am pretty happy with what i found with my trinity assembly

I plan to do something like this for trinity output to select best transcripts (copying this pipeline from some other forum)

  1. expression based: after running the abundance estimation, retain those that have some minimum FPKM value (such as 1).

  2. run the ORF extraction pipeline included in Trinity (don't restrict it to complete ORFs, get both complete and partials) - retain those that encode long ORFs (eg. 200 aa)

  3. blastx the trinity transcripts against uniref90, retain those that have homology to known proteins (E<=1e-10)

Take the union of {1,2,3} above and call it 'best'.

Thanks again for your help guys.

I will update you once i finish the analysis.....

ADD REPLY
1
Entering edit mode
10.7 years ago
arnstrm ★ 1.8k

Also, I would like to point out this article "Optimizing de novo assembly of short-read RNA-seq data for phylogenomics". Although it is for phylogenomics, the method can be applied for any studies.

ADD COMMENT

Login before adding your answer.

Traffic: 2494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6