Question

Impute2 1000Genomes Imputation, Optimal Sliding Windows

2

Entering edit mode

10.2 years ago

Tafelplankje ▴ 120

I have a large data set which uses >20Gb/core when imputing 5MB regions, so i would like to use regions of 1MB or less if possible, at least, reduce as much as possible without loss of imputation accuracy.

Impute2 uses the -int command to specify the region you want to impute, standard is ~5MB chunks. According to http://genome.sph.umich.edu/wiki/IMPUTE2:_1000_Genomes_Imputation_Cookbook it is also recommended to take into account the following:

First, if the chunks are all of the same size (in Mb), it is possible that a few chunks will have too many (or too few) SNPs to analyze. Second, in intervals that cross the centromere (which typically corresponds to large gap in 1000 Genome SNPs), it may happen that only a few SNPs from one arm are chunked together with SNPs from a different chromosome arm. To overcome these two concerns, we suggest that you merge adjacent chunks when one has very few (say, less than 200) SNPs in a chunk and that you try to avoid creating region splits that span the centromere. If a merged chunk extends further than 7-Mb, you will have to use the setting -allow_large_regions when you run IMPUTE2.

Questions:

1) is it possible to use 1MB chunks instead of 5MB without loss of imputation accuracy?

2) Did someone perhaps make an algortihm or chunk-list that takes into account the above (size of chromosomes/genetic maps, centromeres, #snps) to list the optimal genomic positions to be imputed and is willing to share?

imputation gwas 1000genomes • 4.3k views

ADD COMMENT • link 10.2 years ago by Tafelplankje ▴ 120

0

Entering edit mode

I would stick with centromeres, most are less than 5Mb, See this file. The ones that are more than 5Mb chunk them up to 5Mb. And just reduce the number of samples (~100-200) per imputation to run in parallel.

ADD REPLY • link 10.2 years ago by zx8754 11k

0

Entering edit mode

What is the best strategy for impute to reduce the samples after phasing and combining them again afterwards?

ADD REPLY • link 10.2 years ago by Tafelplankje ▴ 120

0

Entering edit mode

If you have 1000 samples, divide them up to 5 chunks of 200 samples (sample chunks depend on the number of cores and memory per core). Impute them in parallel per centromere per chromosome. Then use GTOOL to merge the output into chromosomes.

ADD REPLY • link 10.2 years ago by zx8754 11k