How to generate single sequence from assembled contigs?
12 days ago
arriyaz.nstu ▴ 10

Hi,

We have sequenced a chloroplast DNA from potato and assembled the reads by SPAdes assembler. After assembly, we got 3 contigs.

Now we want to perform the following steps:

1. Generate a single DNA sequence from these 3 contigs.

Q. Would you please suggest any pipeline or tool or method to do this?

1. Then we want to use this single DNA sequence for annotation, where we should get .gff, .gb, etc files.

Q. Which tool we should use for this purpose?

assembly contigs genomics
few points to get this thread started:

• if the data is not available then no tool will create a single DNA seq. Don't know this specific situation but you could also consider some targeted approaches (eg. create primers and do run-off sequencing to close the remaining gaps). If the data to close the gaps is in your sequencing pool you can consider applying some gap-closing tools (google for this term)

• a chloroplast, though present in an eukaryotic organism is therefore not eukaryotic! Moreover chloroplast are from bacterial origin and thus don't have the typical characteristics of eukaryotes but rather prokaryotic . Moreover there likely exists specific tools to annotate chloroplast (most work on similarity with known genes/proteins)

Last point: what have you tried/considered so far? Did you look for tools or approaches?

Thank you for pointing out the prokaryotic fact.

If we take a reference sequence from NCBI and align our contigs against it, can we generate a single consensus sequence?

Initially, we thought about using Prokka for annotation.

This review might be somewhat helpful.

12 days ago

I think a better way to patch the contigs and perform genome finishing is, one should go for some long-read sequencing (like pacBio OR nanopore) and build the assembly using a long read, and correct it using short-read sequencing (just performing a hybrid assembly). You can also use the SPAdes scaffolds.fa file instead of contig.fa, it may have patched contigs. There are few tools available though, which serve the purpose like CONTIGuator (A genome finishing tool for bacterial genome; I am not sure whether it would work for eukaryotic genomes or not; I hope you would find other for eukaryotic genome finishing).

For the gene prediction of the eukaryotic genome, there are a number of tools available, like GENSCAN, AUGUSTUS, GeneMark, etc. (If you just google it, you would get plenty number of hits).

I was using AUGUSTUS to annotate the Fungal genome, and it was quite accurate, but I am not sure about annotating the chloroplast genome though. You can give it a try and post your view for us :)