The Impact Of Sequencing Error On Population Genetics Parameters
3
4
Entering edit mode
12.1 years ago
Lds ▴ 450

Hi fellows,

Does anyone know how to simulate the impact of sequencing error on population genetics parameters, like theta, Ne, rho ...

Thanks in advance

population sequencing • 2.3k views
ADD COMMENT
3
Entering edit mode
12.1 years ago

This problem has been worked on, so you may want to look into these papers for how (or whether it is needed) to simulating the effects of sequencing error on population genetic parameters:

Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. http://www.ncbi.nlm.nih.gov/pubmed/18411405

Population genetic inference from resequencing data: http://www.ncbi.nlm.nih.gov/pubmed/18984575

Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects: http://www.ncbi.nlm.nih.gov/pubmed/18725384

Estimation of allele frequencies from high-coverage genome-sequencing projects: http://www.ncbi.nlm.nih.gov/pubmed/19293142

ADD COMMENT
1
Entering edit mode
12.1 years ago
lh3 33k

If possible, work on real data. The estimate of population parameters are mostly affected by artifacts that are not considered in simulation, such as inaccurate base quality, correlated sequencing and mapping errors, and misalignments around hidden indels. Many pure theoretical methods haven't been applied to real data and the few times I seen they applied they do not work well. Sequencing errors are relatively easy to tackle. The hard part is systematical artifacts.

If you care about practical problems, I recommend you to read "related works" cited in my paper, in particular a series of papers by Rasmus Nielsen in collaboration with BGI.

On simulation, my preference is to simulate from real data. For example, we can downsample reads, simulate short reads from long reads, or at least learn error profiles from real data and then simulate. If you want to do pure simulation anyway, at least simulate indels and use a mapper to map simulated reads rather than assume each read is aligned perfectly.

ADD COMMENT
0
Entering edit mode
12.1 years ago
SES 8.6k

Here are some publications which are relevant:

The publications below are more about the error profile of 454 and Illumina data though probably important to consider in light of any population genetics measures (NB: this is in no way comprehensive, as there are numerous similar papers. Also, the first one, though highly cited, is a bit dated now.)

ADD COMMENT

Login before adding your answer.

Traffic: 2619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6