Diploid genome gene annotation
0
0
Entering edit mode
5.1 years ago
Morgan S. ▴ 80

Hi guys,

I am working with several diploid fungal genomes but confused on how I deal with the duplicated genes. I started by assembling the diploid genomes using dipSPAdes, then gene finding with Maker. The reported number of genes is ~20,000 genes, which is about 2x as many genes that are reported for haploid genomes of the same genus. So my question is, is it okay to go forward with functional gene annotation, or do I need to somehow get rid of the duplicate genes in the genome? I have been confused about whether it is appropriate to publish the diploid version of the genome, or if it is necessary to report the haploid version. I hope this makes sense!

Thanks in advance, Morgan

genome diploid gene annotation • 1.4k views
ADD COMMENT
1
Entering edit mode

Is the species you're working on (highly) heterozygotic? If not then dipSPADES is unfortunately not the most appropriate choice of assembler software.

ADD REPLY
0
Entering edit mode

I do not believe so, (but also not 100% sure) because they exhibit both haploid and diploid cells. I personally observed this under the microscope after staining with DAPI. Also, the diploid genome was much higher quality in terms of number of contigs, size, and N50 when I compared the assembly to the regular SPAdes assembly. BUSCO confirmed that approximately 70% of the single copy orthologs were duplicated. Do you recommend a certain assembler so that I could compare them?

ADD REPLY
1
Entering edit mode

It's true that is not common to find diploid genome annotation within databases. I don't think the EBI or NCBI submission pipeline will make any difference whether it is a haploid or diploid annotation. But you should contact them to know what would be the best way to submit your data. I'm looking forward to hearing more about it. One of the problem I could see is that the alleles of a locus have two different gene identifiers in your MAKER annotation. So it means then you will have two loci identifiers for only one locus... So it would be bit wierd ...

ADD REPLY
0
Entering edit mode

Thanks for the advice, I'll contact the databases and can post an update here. I should have thought about this sooner before proceeding with assembly and annotation :/ I just wonder if there is a way to "fix" this with the gene predictions instead of having to start from the beginning with the assemblies.

ADD REPLY
1
Entering edit mode

If you know which contigs are part of which assembly (primary or secondary) then it's not a problem to filter your annotation.

ADD REPLY
0
Entering edit mode

That is good to hear. Do you recommend any program that can do this? Would it basically be some sort of alignment program that can detect the duplicated genes?

ADD REPLY
0
Entering edit mode

Usually it is your assembler that would give you the phased genome. But I don't know how look the dipSPAdes outputs.

ADD REPLY
1
Entering edit mode

Update:

I decided to go forward with the haploid genomes instead, so I used purge_haplotigs pipeline to do so. The genomes were greatly reduced in size and annotations.

ADD REPLY

Login before adding your answer.

Traffic: 2940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6