Question

The Impact Of Sequencing Error On Population Genetics Parameters

4

Entering edit mode

12.1 years ago

Lds ▴ 450

Hi fellows,

Does anyone know how to simulate the impact of sequencing error on population genetics parameters, like theta, Ne, rho ...

Thanks in advance

population sequencing • 2.3k views

ADD COMMENT • link updated 12.1 years ago by lh3 33k • written 12.1 years ago by Lds ▴ 450

score 3 · Answer 1 · 2012-03-16

This problem has been worked on, so you may want to look into these papers for how (or whether it is needed) to simulating the effects of sequencing error on population genetic parameters:

Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. http://www.ncbi.nlm.nih.gov/pubmed/18411405

Population genetic inference from resequencing data: http://www.ncbi.nlm.nih.gov/pubmed/18984575

Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects: http://www.ncbi.nlm.nih.gov/pubmed/18725384

Estimation of allele frequencies from high-coverage genome-sequencing projects: http://www.ncbi.nlm.nih.gov/pubmed/19293142

score 1 · Answer 2 · 2012-03-17

If possible, work on real data. The estimate of population parameters are mostly affected by artifacts that are not considered in simulation, such as inaccurate base quality, correlated sequencing and mapping errors, and misalignments around hidden indels. Many pure theoretical methods haven't been applied to real data and the few times I seen they applied they do not work well. Sequencing errors are relatively easy to tackle. The hard part is systematical artifacts.

If you care about practical problems, I recommend you to read "related works" cited in my paper, in particular a series of papers by Rasmus Nielsen in collaboration with BGI.

On simulation, my preference is to simulate from real data. For example, we can downsample reads, simulate short reads from long reads, or at least learn error profiles from real data and then simulate. If you want to do pure simulation anyway, at least simulate indels and use a mapper to map simulated reads rather than assume each read is aligned perfectly.

score 0 · Answer 3 · 2012-03-17

Here are some publications which are relevant:

The publications below are more about the error profile of 454 and Illumina data though probably important to consider in light of any population genetics measures (NB: this is in no way comprehensive, as there are numerous similar papers. Also, the first one, though highly cited, is a bit dated now.)