What is "Total Length" in QUAST?
1
0
Entering edit mode
8.2 years ago
' ▴ 300

I have done genome assembly on an interleaved fastq file using many different assemblers (Velvet, ABySS, Minia, SPAdes, etc.) and have the "contigs.fasta" file from all of them. I have run over 50 assemblies with different parameters and options in each of those assemblers, now I have processed each "contigs.fasta" file using QUAST. I know that the length of the genome I am trying to assemble is originally 200,000. However using QUAST the "Total Length" and "Total length (>= 0 bp)" I am getting for 95% of my assemblies (i.e. contigs.fasta files from different assemblers) is near 390,000 all the time. What is the problem? Does "Total Length" in QUAST refer to something different? Why can't I get any length value near the expected 200,000? I have experimented with tons of k-mer, coverage-cutoff, expected coverage value combinations!

quast • 3.7k views
ADD COMMENT
0
Entering edit mode
8.2 years ago
thackl ★ 3.0k

Total length in QUAST does not refer to "something else", it simply gives you the total amount of bases present in your assembly (sum of length of all sequences).

What kind of sample are you trying to assembly, and how do you know that the total assembly size should be 200kbp. Is it simulated data?

If not, my guess would be that your sample either also contained "something else", e.g. minor contaminations or that you have a high level of variation and (probably excessive coverage) in your read data.

More information about your actual sample would help a lot.

ADD COMMENT
0
Entering edit mode

Yes! It is simulated data, (probably generated by Matlab, but I am not very sure about that) and all I know about it is that the original length of the genome is 200,000 and the coverage is 50.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Okay, I can see, why you are in doubt about the 200kbp :). If the data was simulated, some form of errors/heterogeneity (or maybe repeats) had to be introduced to the data - otherwise the assemblers would not have such a hard time with such a small data set. If you cannot find out how exactly the set was generated, you could run a kmer analysis to a) estimate the expected genome size and b) determine the level of noise - something along those lines: http://koke.asrc.kanazawa-u.ac.jp/HOWTO/kmer-genomesize.html

ADD REPLY

Login before adding your answer.

Traffic: 2223 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6