How To Assess The Quality Of An Assembly? (Is There No Magic Formula?)
10
28
Entering edit mode
11.2 years ago
diltsjeri ▴ 470

Hi,

I'm having a difficult time finding a consensus method for assessing the quality of an assembly.

Are there "best" methods to use based on the organism type, technology, and sequence quality? I know N50 is a value I should use to assess assembly quality, but is this only metric?

Thanks.

assembly quality next-gen • 44k views
ADD COMMENT
2
Entering edit mode

'Quality' can be a very subjective thing. The Assemblathons, as well as contests like GAGE and dnGASP, seem to indicate that assemblies can be high quality in a few areas of interest, but it is hard to make an assembly that excels in all aspects of quality. If you are only interested in one aspect of assembly quality, e.g. finding genes in a genome assembly, then it may not matter whether scaffolds are really long (e.g. > 10 Mbp), only that scaffolds mostly contain whole genes.

N50 can tell you something about the average length of scaffolds and/or contigs. It is meaningless to compare the N50 values of any two assemblies unless they are the same size. It is also possible to artificially raise N50 by deliberately excluding short contigs/scaffolds and/or increasing the padding of Ns within scaffolds. One of the figures we include in the Assemblathon 2 paper suggests that N50 can be a semi-useful predictor of assembly quality. Some of the most highly-ranked assemblies had high N50 values...but not all of them did, and some which had high N50 values did not rank as highly.

To give you a succinct, but somewhat disappointing, answer to your question, I would say:

There is no magic formula.

ADD REPLY
0
Entering edit mode

Lately I have been following the methods listed here:

  • BUSCO/CEGMA for checking the core genes
  • Map RNASeq reads and unigenes dervied from transcriptome assembly
  • Map Proteins from closely related species
  • Map constituent reads that were used to form the assembly and check their depth and mappability
  • Distribution of NGx (10,50,70,90 etc)
  • Distribution of contig lengths
  • Check presence of duplicate contigs and other contaminants (easiest way is to submit the genome to NCBI)
  • Bases constituting the assembly.
ADD REPLY
18
Entering edit mode
11.2 years ago

N50 is most definitely not the only thing to look at. How you should asses it basically depends on what you want to do with the assembly.

You could check out this paper recently submitted to the Arxiv

http://arxiv.org/pdf/1301.5406

"Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species"

Keith R. Bradnam (1), Joseph N. Fass (1), Anton Alexandrov (36), Paul Baranay (2), Michael Bechner (39), İnanç Birol (33), Sébastien Boisvert10, (11), Jarrod A. Chapman (20), Guillaume Chapuis (7,9), Rayan Chikhi (7,9), Hamidreza Chitsaz (6), Wen-Chi Chou (14,16), Jacques Corbeil (10,13), Cristian Del Fabbro (17), T. Roderick Docking (33), Richard Durbin (34), Dent Earl (40), Scott Emrich (3), Pavel Fedotov (36), Nuno A. Fonseca (30,35), Ganeshkumar Ganapathy (38), Richard A. Gibbs (32), Sante Gnerre (22), Élénie Godzaridis (11), Steve Goldstein (39), Matthias Haimel (30), Giles Hall (22), David Haussler (40), Joseph B. Hiatt (41), Isaac Y. Ho (20), Jason Howard (38), Martin Hunt (34), Shaun D. Jackman (33), David B Jaffe (22), Erich Jarvis (38), Huaiyang Jiang (32), et al. (55 additional authors not shown)

and also the previous Assemblathon paper. Also check out papers by Steven Salzberg and Mihai Pop on this subject, plus the references within all of the above. There are many others which I can't think of off the top of my head, I'm sure others will suggest some

best Zam

ADD COMMENT
3
Entering edit mode

As you mentioned GAGE, I am actually concerned with this evaluation. For small genomes, the authors intentionally mix 50% of short-insert reads and 50% of long-insert reads by thinning the source data. When assembling, they largely treat the two types of reads the same apart from orientation and insert size. If the assembler does not consider the exceptionally high chimeric rate of long-insert reads, the performance will be very bad, as is shown in the table. However, in practice, short-insert reads are cheaper and of much better quality than long-insert. An better approach would be to sequence more short-insert reads, assemble them first and then only use long-insert to build scaffolds. As such, GAGE might only be evaluating a scenario that may not represent the best practice.

Assemblathon 1/2 is truly amazing which I like a lot.

ADD REPLY
8
Entering edit mode
11.2 years ago

I like the paper answer above, but if you're just looking for some additional measuring sticks besides N50, you could also think about:

  • number of contigs
  • Length of longest/shortest contigs
  • Average length of contigs
  • Total length of all contigs
  • Length of 10/100/1000/10000 longest contigs
ADD COMMENT
8
Entering edit mode
11.2 years ago

I would add: the number of annotations you can grab from your contigs or ORFs you can predict as "information content" estimates.

ADD COMMENT
8
Entering edit mode
11.2 years ago
Rayan Chikhi ★ 1.5k

QUAST and FRCurve are two recent tools that should definitely be considered when evaluating assemblies.

QUAST computes a comprehensive set of classical metrics. It can reproduce the GAGE benchmark.

FRCurve computes newer metrics related to correctness.

ADD COMMENT
6
Entering edit mode
11.2 years ago
earonesty ▴ 250

I use a dup-mer-21 calculation to compare assemblies based on this conversaion:

http://www.homolog.us/blogs/2012/06/26/what-is-wrong-with-n50-how-can-we-make-it-better-part-ii/

Source code:

http://ea-utils.googlecode.com/svn/trunk/clipper/contig-stats

This lets you know if there is excessive chimerism ... a common error.

ADD COMMENT
1
Entering edit mode

The article correctly points out that evaluating N50 only is frequently misleading, but the last paragraph is questionable. When there is ambiguity about whether A should be connected to B or to C, the right decision is not to perform any joining. If we force a join, we will get longer N50 at the cost of high error probability at the junction. An aggressive assembler will get longer N50 but more misassemblies in that case.

ADD REPLY
0
Entering edit mode

Which is what the dup-mer-21 will detect... overaggressive assemblers. You should see the same kmer represented in multiple locations when the assembler is more aggressively calling connections in its graph than it should.

It's easy to produce a single contig. It's hard to get it right.

ADD REPLY
0
Entering edit mode

@earonesty: Could i please know how to intrpret the dup-mer-cnt, dup-pct-21 when comparing assemblies? Should they be high or low?

ADD REPLY
0
Entering edit mode

They should be "comparable to expected". In other words...you should benchmark it to an existing quality assembly. Some k-mer duplication is, of course, expected. What the "correct" number is varies from organism to organism. As a rule, I would expect longer genome to have more.

ADD REPLY
4
Entering edit mode
5.9 years ago

you can use Quast (QUality ASsesment Tool) , evaluates genome assemblies by computing various metrics, including:

  1. N50: length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
  2. L50: The minimum number X such that X longest contigs cover at least 50% of the assembly
  3. NG50: where length of the reference genome is being covered
  4. NA50 and NGA50: where aligned blocks instead of contigs are taken
  5. Number of N’s per 100 kbp and GC %
  6. missassemblies: misassembled and unaligned contigs or contigs bases
  7. genes and operons covered

A clear report will generate , and which helps you to ASSESS your genome assembly

Good Luck

ADD COMMENT
3
Entering edit mode
11.2 years ago
SES 8.6k

Regardless of your biological question, I think looking at length statistics alone can be very misleading and uninformative because 1) the percentage of Ns in scaffolds may be very high and 2) there is always some level of contamination (from organelles, but also other species, possibly) in draft shotgun assemblies, in my experience. How you define "quality" is important to your assessment of the assembly, but the common goal is to try and represent the actual genomic sequence of an organism, so some things to check are:

  • Sequence content of contigs/scaffolds.
  • Levels of contamination (aside from sequence contamination, there are also assembly artifacts to be aware of, as others mentioned).
  • Gene content/accuracy.

The last two points can be assessed by looking at the reference genome or gene models, respectively, of your species or a closely related species. There are many recent papers on comparing genome assemblies so I won't list any paper or tools (too easy to google), but I will mention a method for inferring the gene content. CEGMA is a set of conserved genes in eukaryotes and may be biologically informative, especially if your organism is a non-model species and you have no transcriptome or even closely related species for comparison.

ADD COMMENT
0
Entering edit mode
10.5 years ago
Prakki Rama ★ 2.7k

you can also check this Assessing The Quality Of De Novo Assembled Data

ADD COMMENT
0
Entering edit mode
5.8 years ago
alslonik ▴ 310

We also use BUSCO (https://busco.ezlab.org/), along with QUAST, already mentioned and statistics as size of scaffolds, percentage of gaps, N50 etc.

ADD COMMENT
0
Entering edit mode
2.2 years ago
WANG ▴ 10

I have a question, how to assess the quality of each contig? I test an assembly method in a simulated sample, Is there any methods that compare the assemble contig and the ground truth straightforward, and afford measurement scores at the sequence level?

ADD COMMENT

Login before adding your answer.

Traffic: 2437 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6