EvidentialGene reduces redundancy of de novo transcriptome assembly?
1
1
Entering edit mode
6.8 years ago
crimsontabaq ▴ 70

Hi!

Since trinity output assembly ('original') has had a lot of duplicated matches with BLASTx, we decided to try reducing redundancy with tr2aacds script from EvidentialGene project. tr2aacds filters and merges contigs according to their coding potential and % of identity - sounds more legit than blast2cap3 approach or simple duplicates removal.

To compare original and filtered assemblies, we've done some check-ups with BUSCO and BLASTx. Results are - yes, yielding decrease of duplicates (BUSCO), but also increased number of missing and fragmented contigs. Yet these nr-assemblies are giving some, albeit much less, duplicated BLAST results. We're afraid to lose biologically meaningful data, but redundancy also leads to problems in further analysis.

Does anybody use a tr2aacds to reduce redundancy in de novo assemblies?

RNA-Seq Assembly transcriptome redundancy trinity • 3.6k views
ADD COMMENT
0
Entering edit mode

Hello crimsontabaq,

I have some questions about how do you use that tool. I have looked for a way to send you a private message but I think that is not possible in this forum. As consequence I have to put my question here (sorry). How do you have applied the EG approach? do you have touched several configuration files or not?

Thank you for your time.

ADD REPLY
0
Entering edit mode

Hello Pablo! We've just used one of the Evigene scripts that are supplied in the project data. We've looked through configs and didn't find anything related to our job, so we just fed the needed options to the script itself on the run.

ADD REPLY
0
Entering edit mode

Thank you for the clarification, I have done the same. In our case we really need to reduce the redundancy of our transcriptome because we have obtained more than 1.000.000 transcripts and CD-hit est didn't help (reduce the dataset but we still had 900k transcripts) for that reason we don't check these effects which you have find. Maybe we had the same issue or maybe not, I'll try to check that but as we have used the same assembler I expect same "problems".

ADD REPLY
0
Entering edit mode

Wow, one million. Just a wild guess - have you changed min contig size in Trinity? We've adjusted this value to a minimum sized protein multiplied on 3 of relative species - mb not very right approach, but the resulting assembly is quite ok except some issues I've described earlier.

ADD REPLY
0
Entering edit mode

No we let the default config, I think it is something like 200 nt of min size. In my opinion your way to do it is fine, but my supervisor its paranoiac about lose biological relevant information, even when we finish the assembly with tha huge amount of high redundant data (and for sure, also a lot of artifacts).

ADD REPLY
0
Entering edit mode

Hi, I am new in the bioinformatics field... I am trying to remove redundancies and encounter your post. I used tr2aacds.pl of EvidentialGene and got a problematic fasta file that had transcripts that had additional line after the ">" line as follow:

>new.strg.14074.1
evgclass=noclass,okay; aalen=128,99%,partial;
cagcagcagcaggagggacTGGTCCGGCTGACCGAGATCGTGATGA
>new.strg.2033.5
evgclass=main,okay,match:new.strg.2033.2,pct:100/87/.; aalen=714,92%,complete;
agcaggagggacTGGTCCGGC

I took the file from okayset Did I do something wrong while running it? Thanks!!! Reut

ADD REPLY
0
Entering edit mode

Please do not add comments via the answer field. Use Add comment/reply instead. Also please use the code option 10101 to highlight code.

ADD REPLY
3
Entering edit mode
6.8 years ago

This "..increased number of missing and fragmented contigs.." could be due to various things, but two I know or surmise from experience are part of your results:

  1. Trinity, and other assemblers, produce joined genes (fusions, chimera), that can be measured as existing/full genes, by BLASTx or whatever BUSCO-software you use, because of the way those measures work. However a transcript made up of two or more gene loci isn't what Evigene considers accurate. You can instead make protein translations of your transcripts (as Evigene does), then measure with BLASTp against reference proteins to count valid proteins. Or else check your BLASTx results for cases of joined genes (before Evigene reduction). 1b. Using several gene assemblers, such as Velvet/Oases, idba_tran, Soap_Trans, with multi-kmer options, will produce a more complete gene set from your RNA, than using Trinity alone (those others resolve gene joins and fragments better for loci where Trinity fails). That is what Evigene was designed for: reducing many gene assemblies to the best coding gene subset.

  2. Some settings for tr2aacds may be changed to return more of the smaller proteins, if those are what are now missing from your reduced transcript set. You can check what genes are missing, and if they are small ones (e.g. 30 to 60 aminos, or smaller), resetting some of tr2aacds minimum protein size settings will recover those. An alternative to that, you can add reference blast scores to tr2aacds for each input transcript to retain those with good reference alignments. As it sounds like you have blast scores already for your input transcripts, make a aablast table of those (trid <tab> refid <tab> blast_bitscore <newline>), then run tr2aacds with that option tr2aacds.pl -ablastab aablast.tab) [I think it also will read standard blastp/blastx -outformat 7 tables ].

  • Don Gilbert
ADD COMMENT
0
Entering edit mode

Hello and thanks for an expertized comment!

I've got your points for missing prots, but could you comment on excessive annotation which we're trying to reduce? Does tr2aacds really suits for this job? We've set the configs to Evigene script according to our task, but we didn't know that it's possible to feed blast results to it. Thanks!

ADD REPLY
0
Entering edit mode

Just to point it out, (I think) he is the author of the tool ;) (congrats Mr. Gilbert).

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6