Tool Recommendations For Human Genome Assembly
5
2
Entering edit mode
11.1 years ago
Chris Cole ▴ 800

Hi all,

I'm a bioinformatician with lots of NGS experience, but mostly with RNAseq and exomes. I'm looking at an upcoming project which involves the assembly of one or more human genomes. I have no experience in assembly, so what would people suggest I try?

I realise de novo assembly is non-trivial with such a large genome, but what about a reference guided assembly? Can the typical tools help with that, i.e. abyss, velvet or cortex?

Also, what kind of hardware requirements would I need? The biggest box I have access to has 24 cores and 128GB RAM.

Any suggestions gratefully received.

Cheers,
Chris

assembly human-genome • 6.6k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
3
Entering edit mode
11.1 years ago
Rayan Chikhi ★ 1.5k

Minia will give you an accurate contigs assembly using very reasonable amounts of time (1 day) and memory (~ 6 GB). It is not a complete assembly pipe-line (paired-end and mate-pair information are not used). But you can run a scaffolder (such as SSPACE) after it.

ADD COMMENT
2
Entering edit mode
11.1 years ago

Looking at the options you mention:

  • velvet won't handle a whole human genome

  • Cortex doesn't do whole-genome consensus assembly - if that's what you want then Cortex is the wrong tool. If you want to find variants in this genome or between it and other samples or the reference, then Cortex can do the job, and I think you're better off using Cortex than doing a WG assembly and then mapping to it - unless you want to find very large (>10kb) heterozygous structural variants, where Cortex has low power, but an assembly might find things/hints (with a potentially high FDR, so you'd need to do a decent job of validating/studying/interpreting the results)

    • Abyss, or SGA or AllPaths-LG would be my tools of choice for standard WG human assembly. Or I might try fermi, but I'm embarrassed to admit I don't really know what kind of results fermi would give/how they would compare. Heng can comment more usefully on that.

The key is not to go out and look for the best assembler of all - nothing out there is the best at everything. Work out what you want to achieve with the assembly, and then go and assess the possible tools.

ADD COMMENT
3
Entering edit mode

Fermi is not designed as a complete assembly package. It uses short-insert paired-end reads when assembling contigs, but it does not do scaffolding or use long-insert mate-pair reads. For contig assembly on the NA12878 data set, it is comparable to SGA and abyss in terms of N50 and misassembly rate (see preprint). With 128GB RAM, fermi should work with 45X coverage, at most 50X I would guess. SGA is more memory efficient. Another good assembler to try is SOAPdenovo2. When it is set up to use the sparse de Bruijn graph (a graph not keeping all k-mers), it can also assemble deep coverage in 128GB RAM.

ADD REPLY
0
Entering edit mode

Thanks very much Heng.

ADD REPLY
1
Entering edit mode

Ooops. AllPaths-LG won't fit into 128Gb of RAM (at least it needed 512Gb RAM in their paper). SGA ad Abyss and Fermi would all fit in 128Gb RAM

ADD REPLY
0
Entering edit mode

Many thanks for these replies, they're very helpful.

The aim is to produce a genome for an individual or more - not so much discover anything specific - from a short read sequencing run. It'll be a couple of lanes of HiSeq, so ~20x.

ADD REPLY
1
Entering edit mode
11.1 years ago
Fabian Bull ★ 1.3k

More of a comment but an answer but I though it could help you anyway.

The amount of computational power you need depends on the sequencing depth you are going to use. There are assemblers which will need much more RAM than you have. I remember reading in the LG-Allpaths manual, the time and RAM they needed for a human genome. Additionally some assemblers can not handle paired end data. My favorite assembler is CLC (unfortunatelly not free). If you are really getting serious there is no one from stoping you to try several assemblers and use the one producing the best output.

Additional tip: the most in important thing in genome assembly (and maybe in all bioinformatics stuff) is preprocessing. You have to trim for quality, correct sequencing errors (e.g. tools like Quake), check for contamination and check if the insert size provided by the sequencing lab is correct.

To the assembly itself: There is no reasion why you might do a de-novo assembly. People spend millions of dollar to come up with a human genome so use ist. A reference assembly might also be much less computational intensive.

ADD COMMENT
0
Entering edit mode

I am interested in the 3rd-party evaluation of CLC on human data. I heard from my friends two years ago that CLC was overstating their performance that time. How about it now? Also, Jared published in the SGA paper that for de novo assembly, trimming leads to shorter N50. My experience is the same. The right strategy is to do quality-aware error correction as much as possible if the assembly algorithm itself cannot handle errors well. Most practical assemblers provide error correction tools (e.g. soapdenovo, sga, allpaths-lg, cortex and fermi; if I am right, none of them trim reads) or handle errors well (e.g. celera-assembler).

ADD REPLY
0
Entering edit mode

IMHO, the advantage of SGA is a desirable scaling behavior and not better assembly performance.

I have never compared assembler very detailed (my boss did it in his thesis). The only thing I can tell, CLC produces by far the best N50 values.

ADD REPLY
0
Entering edit mode

According to assemblathon1, sga is one of the best assemblers overall.

ADD REPLY
0
Entering edit mode

Thanks.

I agree a reference assembly is what I would like to do, but I don't know how. Abyss and fermi look like candidates. I don't have access to CLC bio, so that's not an option. Plus, I'd rather stick with open source :)

ADD REPLY
1
Entering edit mode

Before NGS, reference assembly was more often referred to reference-guided assembly. The strategy usually required de novo assembly as a step and used a reference genome for orientation. Nowadays, by reference assembly, we typically mean mapping short reads to the reference genome and then running a SNP caller to call each base. Strictly speaking (at least in my view), this is not "assembly".

ADD REPLY
0
Entering edit mode

Oh, is that it? I was considering doing that, but thought it too simplistic...

I agree it isn't an assembly, but should suit my needs in this case.

ADD REPLY
1
Entering edit mode
7.0 years ago
always_learning ★ 1.1k

Hello Friends,

what's the latest update for Tool Recommendations For Human Genome Assembly?

Any suggestion ?

Thanks

ADD COMMENT
0
Entering edit mode

GATK standard workflow is usually still considered the basic and gold-standard pipeline to use.

ADD REPLY
0
Entering edit mode

As far as I know GATK doesn't do Genome Assembly.

ADD REPLY
0
Entering edit mode

I misread your statement, my apologies. So you are looking to do de novo assembly? That is getting to be a pretty niche application in human genomes these days. Usually people do something like hybrid mapping/assembly protocols if they aren;t just doing standard mapping.

ADD REPLY
0
Entering edit mode

Unless you are working with long reads (PacBio/Nanopore) I don't think genome assembly is beneficial and mapping would be preferential. But perhaps you have good reasons to go for assembly?

ADD REPLY
0
Entering edit mode
11.1 years ago
DG 7.3k

For whole genomes you can do short-read mapping the same as with exome data, in which case use whatever you prefer from exome experience. There are of course situations where you may want to try de novo assembly to look for large or complex structural variations. I recently visited the the BC Cancer Centre and, while they are obviously biased towards it, they have had a lot of success with AbySS. In their pipeline they mix short-read mapping and de novo assembly for analyses.

ADD COMMENT

Login before adding your answer.

Traffic: 1503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6