Question

Consensus length longer than the longest assembled contig

0

Entering edit mode

5.1 years ago

DanielC ▴ 170

Dear Friends,

I assembled a phage sequenced data (DNA). The assembly was done using SPAdes. The longest contig was of 88315 size. To get the consensus from the assembled contigs (total 46 of size about 88000), I used gap4. When I saved the consensus, I see that the consensus size is 1260000! Could you please give your comments on how the consensus size can be this long and longer than the longest contig? Is it a good approach to take this consensus for gene prediction and annotation? OR should I take the longest contig for annotation and prediction?

Thanks, DK

DNA consensus contigs assembly gene annotation • 1.3k views

ADD COMMENT • link updated 5.1 years ago by h.mon 35k • written 5.1 years ago by DanielC ▴ 170

score 0 · Answer 1 · 2019-03-27

For starters, use the SPAdes assembly for gene prediction and annotation, but you may want to filter out short contigs and very low / very high coverage contigs.

While I really don't understand your problem (more on this later), using Gap4 to assemble the output of SPAdes makes no sense. Both SPAdes and Gap4 are genome assemblers and output a "consensus" fasta representation of the assembly.

SPAdes had at its disposal all the reads to assemble the genome, and could use their information to break contigs at uncertain regions - for example, repeat regions.

Gap4 is also a genome assembler (developed to assemble Sanger sequencing reads), but if you try to assemble the SPAdes assembly, you will get either no improvements, or even worst, misassemblies, as Gap4 may join these repeat regions broken by SPAdes.

What I don't understand from your description: how can the longest contig (88315bp) be longer than total assembly (46 contigs of about 88000bp)?