Question

Tools For Combined Genome Variation Analysis Of Short Sequencing Reads For Several Genome Strains

3

Entering edit mode

11.2 years ago

14134125465346445 ★ 3.6k

What are the recommended tools for the combined genome variation analysis of short sequencing data for several genome strains of a given organism?

I have seen that traditionally people have invested more in trying to deep sequence and assemble one specific strain or, if possible, single individual, to have the reference genome assembly, then done some more sequencing with the money left to assess the variability in the other important strains.

The only case a few years ago that wasn't like this was the Sanger sequencing of several strains of Drosophila simulans, all low coverage, that were pooled and used to define the simulans genome reference.

If one takes the approach of doing the same amount of sequencing for a group of strains without an existing reference genome, what would be the best tools to assess the genomic variability in the group of strains?

EDIT: for example, in this paper for a cattle pathogen, the authors did the resequencing of 10 strains for a species that already had a reference genome. They did a very sound variation analysis by comparing the results of the 10 resequenced strains mapped to the reference. My question is: what tools would someone use in the case where the sequenced 10 strains where for a species without a reference genome?

assembly genome variation • 4.0k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 11.2 years ago by 14134125465346445 ★ 3.6k

1

Entering edit mode

I have the feeling that you would need to compile a reference first.

ADD REPLY • link 11.2 years ago by fo3c ▴ 450

0

Entering edit mode

is what you asking is resemble to Metagenomics? but instead of collection samples directly from environment you are talking about sequencing them from culture media in the lab with out reference genome and comparing them?

ADD REPLY • link 11.2 years ago by Medhat 9.7k

Istvan Albert · Answer 1 · 2013-02-19

The Cortex variation assembler is designed for precisely this! The idea is to simultaneously de novo assemble a joint graph of all your samples, and look for differences, without using a reference. You can then use population/segregation statistics to distinguish variants from repeats and errors. FInally, you can use a reference if you have one to provide coordinates. First published here

De novo assembly and genotyping of variants using colored de Bruijn graphs. Z Iqbal, M Cacao, I Turner, P Flicek, G McVean, Nature Genetics (2012)

and then here we recently published on how to use it for microbial genomics, with a new pipeline wrapper to make it a lot more user friendly (give it an index file listing sample id's and which fastq they have, and it does all the assembly, error removal, variant discovery, genotyping and makes a VCF.

High-throughput microbial population genomics using the Cortex variation assembler. Z Iqbal, I Turner, G McVean, Bioinformatics 2013

Here's an example of it being used in a longitudinal study looking at S. aureus

Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. B. Young, T Golubchik et al, Proc. Nat. Acad. Sci Proc. Nat. Acad. Sci (2012)

Sorry for the self-publicity, but it is an answer to your question! You do need to think carefully about how experimental design (number of samples, coverage per sample, read length) affects your power to discover variants. Assembly is typically less sensitive than mapping, although more specific.

best Zam