Question

DGE on transcriptomic data, focus on transcript sequences or in derivated ORFs?

0

Entering edit mode

6.4 years ago

pablo61991 ▴ 90

Hi Community,

I have the next doubt, if I'm only interested in protein coding genes it's a good practice to obtain the ORFs from my transcriptome (de novo assembled) using for example Transdecoder or EMBOSS tools and then quantify the levels of expression?? The pipeline could be something like:

Assembly using Trinity+Velvet+SOAP
Merge and redundancy reduction (ex. EvidentialGenes)
ORF obtention (maybe Transdecoder or other tool)
Blastx agaisnt Swissprot
Delete ORFs without hit to construct my final transcriptome
Quantification using Kallisto
Tximport to could use my data on DESeq2

Are there some important caveats here? Should I work with whole my transcriptome (fresh) instead of process it a little bit to take out no "uncoding" or unidentified transcripts?

Thank you for your time

Pablo

differential-gene-expression RNA-Seq ORFs • 2.3k views

ADD COMMENT • link updated 13 days ago by Ram 43k • written 6.4 years ago by pablo61991 ▴ 90

score 2 · Answer 1 · 2017-11-17

EvidentialGene computes ORFs (proteins and coding sequences of those), and its method is drawn on Brian Haas's ORF computations, which also form the Transdecoder package now. ORF computation is fairly straight-forward, the only differences among methods will be at the edges for complicated, unusual cases. I've recently looked at results from Transdecoder versus Evigene, and I don't think Transdecoder is giving you improvements, it may well be reducing the number of best orthology proteins using its Predict variant. The initial TransDecoder.LongOrfs gives way to many results to be useful without the sort of filtering that Evigene does.

Evigene gives you a single longest ORF per transcript, which are checked and filtered for redundancy, with the non-redundant 'okayset' that contains proteins, CDS and transcripts for the non-redundant genes, and those alternate transcripts per gene that have different CDS from longest. So my suggestion is that you will get better results using the Evigene okayset of proteins for orthology computations. I suggest also that using BLASTp of those proteins by reference set (e.g. from Swissprot) will give more accurate results than using BLASTx of the transcript or CDS nucleic sequences to reference proteins. This later method will hide errors in the transcripts (indels, inner stop codings, fragmented CDS). More important to you perhaps, the protein x protein BLASTp is more sensitive in finding significant homology.

The results of ignoring genes that lack homology to your reference proteins is variable: it depends on how complete the reference gene set is w/ respect to all the genes your organism is expressing, and how close in phylogeny they are. I've found some large numbers (1000s) of putative recent orthology genes in some fishes, water fleas, plants and other species that exist across a narrow phylogenetic span at least, but are not found in more distant model species. These recently evolved genes can be among those with active differential expression, active in response to environmental stresses unique to those species. Ignoring these recent genes means possibly ignoring important gene responses in your organism.

In terms of measuring differential expression, there are definite effects among the alternate transcripts, which share large portions of same exons, but differ at certain exons (which may be where your differential effects are). It is valuable, but harder, to measure DE among alternates of same locus due to the high portion of shared reads.

There are also definite effects in non-coding regions, including non-coding genes but also long UTR and intergenic "ambiguous" expression that is hard to define as genic. Whether you measure that or not is your decision.

I recommend measuring expression of all your genes, then report effects in those broad classes of (a) coding genes with homology, (b) coding without definite homology, (c) non-coding.

-- Don Gilbert, author of EvidentialGene

score 0 · Answer 2 · 2017-11-17

0

Entering edit mode

6.4 years ago

h.mon 35k

Delete ORFs without hit to construct my final transcriptome

I think this is a bad idea, you may have novel transcripts with good support from your data, but no hits on the database. You will be biasing your results against novel transcripts.

ADD COMMENT • link 6.4 years ago by h.mon 35k