Biostar Beta. Not for public use.
Question: Too many contig after genome assembly with Spades
0
Entering edit mode

Hi

I have done Bacterial genome sequencing using Illumina Hiseq PE *150b , my library contains 600k reads, But after assembling with spades ( Kmer = -k 21,33,55,77,99,111,127) the result is too bad. I have got ~3400 contigs. I have no ref. genome for my bacteria now, we only know its family. My GC content = ~70%

What is your suggestion for decreasing number of contigs? Is there any other options better than Spades for bacteria genome assembly?

Thanks

ADD COMMENTlink 15 months ago ahmad mousavi • 430
Entering edit mode
3

Not a bioinformatics solution, but your assembly could greatly improve by adding some long read sequencing data from Oxford Nanopore or PacBio, of which the former (MinION) can be reasonably cheap to obtain.

ADD REPLYlink 15 months ago
WouterDeCoster
39k
Entering edit mode
1

A GC content that high probably also means its repetitive. It’s likely to be a sequencing nightmare. Your only options are to sequence deeper, and use other technologies as Wouter said.

ADD REPLYlink 15 months ago
Joe
12k
Entering edit mode
0

we all agree with Wouter :), but would a high GC not indicate less repetitive? TE (transpsoson?) are usually rather high in AT, so that would lower the overal GC, no?

ADD REPLYlink 15 months ago
lieven.sterck
5.1k
Entering edit mode
1

I was more thinking of consecutive repeats (e.g. GCGCGCGCGCG), rather than IS etc, which would fail to be picked up properly by the sequencer.

Nevertheless, there are other issues with high GC - the increased strand separation energy might be an issue for library preps and the actual sequencing reaction.

ADD REPLYlink 15 months ago
Joe
12k
Entering edit mode
0

ah, ok, yep agreed in that case.

and totally on the problems (regardless of the 'cause') when extracting/lib-prep/sequencing in high GC situations

ADD REPLYlink 15 months ago
lieven.sterck
5.1k
Entering edit mode
0

Would it be possible to provide some more info on your project? eg. estimate genome size (what is the expected coverage)? is it some 'weird/exotic' bacterium?

ADD REPLYlink 15 months ago
lieven.sterck
5.1k
Entering edit mode
0

Sorry, I have no idea, we estimate genome size is ~7Mb, just estimation. We tried to have 100x coverage.

ADD REPLYlink 15 months ago
ahmad mousavi
• 430
Entering edit mode
0

so that will give you roughly 25x , on the low side but doable I think

ADD REPLYlink 15 months ago
lieven.sterck
5.1k
Entering edit mode
0

It seems you have used several k-mer sizes. Is the contig number same across all the K-mer sizes? ahmad mousavi

ADD REPLYlink 15 months ago
cpad0112
11k
Entering edit mode
0

Spades let you to define several kmers and it automatically select one based on data structure. So I have constant no. of contigs.

ADD REPLYlink 15 months ago
ahmad mousavi
• 430
Entering edit mode
0

did you have a look at fastg files and the number of contigs for each kmer? You can also check how good your assembly with Bandage https://github.com/rrwick/Bandage. ahmad mousavi

ADD REPLYlink 15 months ago
cpad0112
11k
Entering edit mode
0

No, I don't understant of relationship of fastq file.

With smaller kmer I got more contigs.

ADD REPLYlink 15 months ago
ahmad mousavi
• 430
Entering edit mode
0

not fastq, it is fastg (updated the post). Spades outputs contigs for each kmer. With higher Kmer, contig number goes down. But the relevancy of such assembly is in question. For that reason, you may need to use software like bandage or quast/ICARUS to identify the relevant assembly

ADD REPLYlink 15 months ago
cpad0112
11k
Entering edit mode
1

SPAdes automatically chooses optimal Kmers. The contigs.fasta that you get output which is not inside one of the K*** folders should be the ‘optimal’ assembly (if I remember correctly).

Optimal doesn’t necessarily mean fewest contigs though.

ADD REPLYlink 15 months ago
Joe
12k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0