Judging best assembly
2
0
Entering edit mode
6.0 years ago
deepti1rao ▴ 50

I used velvet to assemble genomic data of a plant and plotted a coverage histogram and a length weighted coverage histogram as suggested in the manual. Reads were 150 bp paired end, illumina. Various kmer values were tried and 115 was picked. What would be a good coverage cut off to use, considering that I have a small peak at 7. Please find 3 attachments. The expected coverage calculated by velvet is 23. When used with default coverage cut off (half of expected coverage), I get the following assembly:

Nodes=412915
N50= 21497
Max length= 185793
Total = 362 MB
No. of contigs = 48,614

I wanted to use a lower cut off to include the kmers in the smaller peak. Hence, I tried using a coverage cut off of 3, to get the following:

Nodes = 513117
N50= 20630
Max length =185793
Total = 384 MB
No. of contigs = 56,475

The expected genome size is 370-390 MB. Since it is expected to contain about 50-60% repeats, I do not expect the reads to cover my entire genome, which is also evident from my sam/bam files obtained by aligning reads to a closely related genome. I see that 10 MB is not covered.

Which among the two assemblies looks better??

kmer coverage histogram

Length weighted kmer coverage histogram

bam file coverage across reference genome of a closely related variety of the same species

velvet Assembly kmer coverage cut off • 2.5k views
ADD COMMENT
1
Entering edit mode

I would definitely run more than one assembler preferably with multiple k-mer values and then compare the assemblies using QUAST.

ADD REPLY
0
Entering edit mode

You can also look at KAT (https://kat.readthedocs.io/en/latest/walkthrough.html#genome-assembly-analysis-using-k-mer-spectra) to assess the k-mer spectra of the reads and the k-mer spectra of the assembly. Not sure if your plant has high ploidy or not. Also important would be assessing BUSCO scores for different assemblies and perhaps RNAseq data (if available) mapping rates.

ADD REPLY
0
Entering edit mode

This is the best option, multiple assemblers and multiple kmers. Decreasing the coverage cut-off for contiguity doesn't help as it increases changes of erroneous overlaps.

ADD REPLY
0
Entering edit mode

Hello deepti1rao,

The link you’ve added points to the page that contains the image, not the image itself. On ibb.co site, right click (or Ctrl-Click on a Mac) on the image and select Copy Image Address (or an equivalent option). Use that link instead of the link you used to embed the image.

ADD REPLY
0
Entering edit mode

Thanks will do next time onwards.

ADD REPLY
1
Entering edit mode
6.0 years ago
Rohit ★ 1.5k

Contig-ordering tools can help in orientation with the help of reference. However, only the contiguity of the genomes with N50, NNG50, L50, LG50 do not mean that the assembly is best. The quality of the assembly matters too, which are compared using CEGMA or BUSCO metrics. In the end, it all matters with what kind of downstream analysis is planned for your project. Also, the 10Mb missing might be due to mapping biases too, not to forget that de brujin graph based assemblers are prone to misjoins.

ADD COMMENT
0
Entering edit mode
6.0 years ago
5heikki 11k

If there's a closely related genome available, why aren't you doing a reference guided assembly? Also, it might be a good idea to try different assemblers. As to your two assemblies, they're essentially same except the bigger one includes more short contigs that may or may not be "good". I would go with the first one, although I doubt choosing either one will make any difference what so ever to anything downstream..

ADD COMMENT
0
Entering edit mode

We're not doing a reference based assembly with the reads in order not to have a reference bias at the read level itself.

Any clues as to how I can go about putting the contigs together with the help of a reference?

ADD REPLY

Login before adding your answer.

Traffic: 2657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6