Question

gene duplication and differential gene expression analysis

1

Entering edit mode

7.9 years ago

amoltej ▴ 100

Hello everyone,

I have a insect genome with no annotation. So mapped annotated NCBI protein sequences of closest relative to the genome and created my transcriptome data ( fasta and gff file). After analyzing data for the differential gene expression using EdgeR, I found that featureCount does not calculate or detect genes that have overlapping or same reading frame. and there was one incident in my case. The gene that is supposed to be expressing was not showing any expression. but later I find out there are reads corresponding to that transcript.

Then I checked my whole transcript data that was designed based on closest relative transcript data using CD-hit program. Here even I am using similarity cut-off 100% I am still seeing at least 4000-5000 genes 100% similar with in same database. percent similarity increases as I relax the similarity cut-off.

So looking at the situation I am confused whether my analysis is missing something or giving me wrong information.

if the similar sequences represents gene duplication events how to deal with it's differential expression analysis?

can anyone explain if I am doing anything wrong? or what should I do?

Thank you for time and consideration.

Amol

RNA-Seq gene duplication dfferential gene exp • 2.5k views

ADD COMMENT • link 7.9 years ago by amoltej ▴ 100

1

Entering edit mode

I think part of the problem is with featureCount parameters. Are you using this in Galaxy? Could you try a different count method? How did you map the proteins to the genome (by exonerate?). I didn't get your second paragraph on transcript clustering, what did you give as input there?

ADD REPLY • link 7.9 years ago by Michael 54k

1

Entering edit mode

Yes I mapped by exonerate.

so basically the custom annotated sequences that i got from exonerate mapping is the input for CD-hit clustering. Input file has 22K sequences while after clustering I was getting about 17K clusters. this means CD-hit is finding ~5K sequences 100% similar to other sequences in the same data.

ADD REPLY • link 7.9 years ago by amoltej ▴ 100

1

Entering edit mode

It might be due to duplicated regions in the assembly, maybe due to ploidy, or fragmentation. For gene duplication a 100% nucleotide identity seems unlikely. Extract the full gene sequence of the 100% identical sequences, including the introns + some flanks. Then check if these sequences are also 100% identical, and have 100% coverage and same length. It might well be that CD-hit has clustered different fragments of the same sequence.

ADD REPLY • link 7.9 years ago by Michael 54k

1

Entering edit mode

Hi Michael,

I checked and the sequences are exactly same. they are also mapping at the same location. On NCBI website they have different annotation based on there gene sequence. check following link at genomic regions, transcripts and products, here there are four transcripts based on their nucleotide sequence, they have different length. http://www.ncbi.nlm.nih.gov/gene/?term=XP_008178725.1

When I try to map there protein sequence, which exactly same, basically maps same gene at same location multiple times. so I believe I am loosing signal here to discriminate between genes that may have different regulatory elements involved in there expression. How shall I look into this problem? is it a problem at all? because if I try to map nucleotide sequence then it will be less trustful as compared to protein sequence mapping. So I think there is trade off. is it considerable? is there any way to get around it?

ADD REPLY • link 7.9 years ago by amoltej ▴ 100