Question

Differential Expression Analysis Using Two Different De Novo Assembled Transcriptomes

0

Entering edit mode

10.2 years ago

Birdman ▴ 20

Hi,

I'm trying to do DE analysis using DEseq (or DEseq2) to find species differences between RNA-seq samples (n=8) of two very closely related bird species.

In summary, I assembled two 'reference' transcriptomes out of all the Illumina reads (100bp PE) from every species using Trinity. I then blasted the two transcriptomes to obtain gene names for the assembled contig. Finally, I aligned the samples individually with their corresponding transcriptomes using BWA and obtained raw counts with eXpress. I'm now analyzing expression in R with DeSEQ.

The problem is that I sometimes have very similar sequences between both species that are have different gene names (according to a uniprot_sprot + nr blast, evalue < 1e-5, 1 result per query - I then blasted transcriptomes together to see if genes were corresponding to each other). Interestingly, it's usually only a subset of sequences (within the same "compxxxxx") that give different gene names, the other sequences have the same names. I'm not sure why this happens (alternative splicing/sequencing errors/other?) but it surely is a problem when I try to analyze gene expression. Some genes that appear to be 'overexpressed' in one species, when blasted against the other species' transcriptome, are very similar to other genes that, not surprisingly, are 'overexpressed' in the other species. It is most likely that these are in fact the same genes that are similarly expressed in both species.

I was thinking to remove all those ambiguous contigs, but it represents a lot of them (~50% of all contigs). Another idea I had was to blast a transcriptome with the other to obtain gene names instead of blasting against whole databases for every transcriptome.

What do you think would be the best approach to deal with this?

differential-expression rna-seq • 5.7k views

ADD COMMENT • link updated 10.2 years ago by Charles Warden 8.2k • written 10.2 years ago by Birdman ▴ 20

score 2 · Answer 1 · 2014-02-10

Yes - I think this is a tricky task for the reasons that you mentioned.

Here are a couple threads you may be interested in:

How to Compare 2 Differential expressed transcripts from 2 different de novo assembly?

C: Creating reference de novo assembly using velvet/oases

I'm not certain how to get around this problem with assembled transcripts. When I had two samples, I ranked the contigs (not transcripts - I actually found that the transcripts seemed less reliable than the normal transcripts) by coverage, BLASTed the top X contigs, and saw that CLC Bio de novo produced the results that make the most sense with respect to the different tissues of origin (adipose versus muscle).

Also, the first link also mentions some tools that can identify differentially expressed genes by first looking for differentially distributed kmers and then focusing on assembly of those results. In principle, this sounds like the best idea that I have heard for most precisely defining differential expression.