Question

How do you identify the contigs from trinity assembly?

0

Entering edit mode

5.6 years ago

MAPK ★ 2.1k

I am trying to get the read counts for DESeq2 analysis from meta-genomic data. I have assembled contigs using Trinity for all organisms and I would like to map my reads for each sample to these contigs and get the read counts for DESeq2 analysis. Normally for RNAseq we would use GFF file to annotate the read and annotate as a loci, but for metagenomic data, I can't use one specific genome, so I wanted to use Trinity assembled contigs as reference for mapping. However, before proceeding with the read mapping, I would like to annotate each contigs from Trinity. I wonder if I can do BLAST search against nr. What would be the easiest way to do this? Thanks for your help!

Trinity Assembly blast • 2.5k views

ADD COMMENT • link 5.6 years ago by MAPK ★ 2.1k

1

Entering edit mode

To get counts for each, you don't strictly need to identify them up-front. You could identify the DE ones first and only ID those :-)

You could follow these directions from Trinity for identification.

Edit: Since this is a metagenomic dataset these directions are not useful.

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

That is right, I was planning to do the way you have suggested, but then identifying the DE ones later would be a bit elaborate process. I thought identifying in the beginning would reduce the work later.

ADD REPLY • link 5.6 years ago by MAPK ★ 2.1k

0

Entering edit mode

So rather than identification per se you are looking to reduce redundancy so you don't have the same sequence represented multiple times?

Did you use TriMetAss (http://microbiology.se/software/trimetass/ ) instead of Trinity? That appears to be for metagenomic data.

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

No, these are not overlapping sequences so I wanted to map them to the assembled reference. I haven't used TriMetAss, but will give it a try. Thanks!

ADD REPLY • link 5.6 years ago by MAPK ★ 2.1k

0

Entering edit mode

Additionally, I just wanted to get the loci identified (as which gene,CDS etc) for each cluster of reads after mapping.

ADD REPLY • link 5.6 years ago by MAPK ★ 2.1k

1

Entering edit mode

Since this is bacterial data you would expect the entire sequence to be coding. It may not be full length or start at the ATG depending on how well the assembly worked.

As suggested it should be ok to search using DIAMOND againsr nr (or RefSeq bacterial database) to identify the contigs. It works well but you would need ~80-100G of RAM for this search. You could also try magicblast from NCBI.

ADD REPLY • link 5.6 years ago by GenoMax 141k

0

Entering edit mode

Thanks! I have used Diamond before so yes it makes sense.

ADD REPLY • link 5.6 years ago by MAPK ★ 2.1k

1

Entering edit mode

Out of sheer curiosity: What was your rationale to use trinity? My apologies in case this is question is merely based on my inexperience with trinity: Why would you blast contigs against nr? Or do you get proteins? Is trinity able to define gene boundaries in prokaryotic RNAseq data? Also I think your gff approach should work - you can handle contigs in a metagenome just like any other genome.

For contig annotation Kraken is an excellent tool (though lacks of a good taxonomic binning algorithm, afaik) and as a faster blastp alternative, I recommend diamond

ADD REPLY • link 5.6 years ago by Carambakaracho ★ 3.2k

0

Entering edit mode

I just wanted to annotate the contigs and I also don't think BLAST would be the best solution and therefore I was asking this question here. Since it is a metatranscripome data, I am not sure if I would be able to use GFF file(s). I am using Trinity assembled data as a reference genome to get read counts from the metatranscriptome data I have.

ADD REPLY • link 5.6 years ago by MAPK ★ 2.1k

0

Entering edit mode

Hi, I was just wondering if you ended up finding a way to annotate the contigs from Trinity?

ADD REPLY • link 5.3 years ago by CC ▴ 50