Too many contig after genome assembly with Spades
0
0
Entering edit mode
5.4 years ago
ahmad mousavi ▴ 800

Hi

I have done Bacterial genome sequencing using Illumina Hiseq PE *150b , my library contains 600k reads, But after assembling with spades ( Kmer = -k 21,33,55,77,99,111,127) the result is too bad. I have got ~3400 contigs. I have no ref. genome for my bacteria now, we only know its family. My GC content = ~70%

What is your suggestion for decreasing number of contigs? Is there any other options better than Spades for bacteria genome assembly?

Thanks

assembly genome sequence • 4.7k views
ADD COMMENT
3
Entering edit mode

Not a bioinformatics solution, but your assembly could greatly improve by adding some long read sequencing data from Oxford Nanopore or PacBio, of which the former (MinION) can be reasonably cheap to obtain.

ADD REPLY
1
Entering edit mode

A GC content that high probably also means its repetitive. It’s likely to be a sequencing nightmare. Your only options are to sequence deeper, and use other technologies as Wouter said.

ADD REPLY
0
Entering edit mode

we all agree with Wouter :), but would a high GC not indicate less repetitive? TE (transpsoson?) are usually rather high in AT, so that would lower the overal GC, no?

ADD REPLY
1
Entering edit mode

I was more thinking of consecutive repeats (e.g. GCGCGCGCGCG), rather than IS etc, which would fail to be picked up properly by the sequencer.

Nevertheless, there are other issues with high GC - the increased strand separation energy might be an issue for library preps and the actual sequencing reaction.

ADD REPLY
0
Entering edit mode

ah, ok, yep agreed in that case.

and totally on the problems (regardless of the 'cause') when extracting/lib-prep/sequencing in high GC situations

ADD REPLY
0
Entering edit mode

Would it be possible to provide some more info on your project? eg. estimate genome size (what is the expected coverage)? is it some 'weird/exotic' bacterium?

ADD REPLY
0
Entering edit mode

Sorry, I have no idea, we estimate genome size is ~7Mb, just estimation. We tried to have 100x coverage.

ADD REPLY
0
Entering edit mode

so that will give you roughly 25x , on the low side but doable I think

ADD REPLY
0
Entering edit mode

It seems you have used several k-mer sizes. Is the contig number same across all the K-mer sizes? ahmad mousavi

ADD REPLY
0
Entering edit mode

Spades let you to define several kmers and it automatically select one based on data structure. So I have constant no. of contigs.

ADD REPLY
0
Entering edit mode

did you have a look at fastg files and the number of contigs for each kmer? You can also check how good your assembly with Bandage https://github.com/rrwick/Bandage. ahmad mousavi

ADD REPLY
0
Entering edit mode

No, I don't understant of relationship of fastq file.

With smaller kmer I got more contigs.

ADD REPLY
0
Entering edit mode

not fastq, it is fastg (updated the post). Spades outputs contigs for each kmer. With higher Kmer, contig number goes down. But the relevancy of such assembly is in question. For that reason, you may need to use software like bandage or quast/ICARUS to identify the relevant assembly

ADD REPLY
1
Entering edit mode

SPAdes automatically chooses optimal Kmers. The contigs.fasta that you get output which is not inside one of the K*** folders should be the ‘optimal’ assembly (if I remember correctly).

Optimal doesn’t necessarily mean fewest contigs though.

ADD REPLY

Login before adding your answer.

Traffic: 2594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6