Question

Considering Sub-Population Structure When Imputating Snps From Haplotype

4

Entering edit mode

12.9 years ago

Fayue1015 ▴ 210

Current imputation methods seems do not consider sub-population structure in Hapmap Datasets, for example, even for Chinese population, Taiwaness and Beijing Chinese or Singapore Chinese maybe different. So is it a good question to ask that given a Hapmap data for certain population, we should also do some cluster analysis to find the sub-population structure in such data and do analysis separately?

snp imputation population structure • 2.6k views

ADD COMMENT • link updated 12.9 years ago by Pablo Marin-Garcia ★ 2.0k • written 12.9 years ago by Fayue1015 ▴ 210

score 4 · Answer 1 · 2011-07-14

HapMap data is to be used as a reference, in order to have a template where to lay your own population findings. sure HapMap populations could be argued not only in terms of their exact origin, but also in terms of size, relationships covered, ... the only way anyone can be sure about stratification biases on certain dataset is by running a structure test, but although I can't come out now with a paper that reflects it I am almost certain that no stratification biases have been published to date on HapMap data (always talking about data inside each population, not among populations).

right now there are 11 populations you can query through HapMap, and the 2 chinese ones are CHB (chinese from Beijin) and CHD (chinese origin from Denver). the JPT population (japanese from Tokio) is the third asian group that "fulfils" HapMap's eastern Asia coverage. one can easily see how these populations differ among themselves, and even how precise their findings would be inside them, in order to use them on your own experiment as reference.

if you are working with your own chinese populations and you need some reference data from HapMap I would suggest to use the CHB data only, as it would be the most suitable set for you and you won't have to care about stratification effects at all at least on HapMap's reference data.

score 4 · Answer 2 · 2011-07-18

The effectiveness of imputation relies on the sample size. The current experience from 1000g is that pooling populations leads to better imputation. I think the IMPUTE group has published something on this topic, but I forget where. Nonetheless, it should be note that this is for sequencing data, but not necessarily applicable to you.

score 2 · Answer 3 · 2012-03-14

Some years ago you needed to find a reference population closer to your sample in order to make an accurate imputation , and this sometimes implied to mix hapmap individuals from different population. But nowadays some imputation programs (like Marchini's impute2) does not care about population substructure but use the concept of 'surrogate family' where for each individual and each iteration selects the closer K haplotypes (being K a number you pass as a parameter to the program). When using imputation programs that uses the 'surrogate family' the best results come when you have a reference population as diverse and big as possible (put all your hapmap population together if your computer resurces allow it). Howie and Marchini has documented this here: http://www.g3journal.org/content/1/6/457.full and here you can watch a nice talk of Marchini in the Newton's Institute in Cambridge http://www.newton.ac.uk/programmes/CGR/seminars/071411001.html