Question

Variant in earlier builds of 1000 genomes not found in later builds

1

Entering edit mode

8.8 years ago

jasonlouisstein ▴ 10

Hi Biostars,

I would like to calculate LD of genome-wide significant variants found within the most recent schizophrenia GWAS paper (see Supp Table 2) using the newest release of 1000 genomes: 20130502 release found here. Note that this is not the version of 1000 genomes that was used for imputation in this paper but is a newer version.

Some variants which are genome-wide significant are not found in the 20130502 release, but are found in the older release 20110521. For example, this variant chr2_200825237_I, is found in release 20110521:

tabix -p vcf ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr2.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 2:200825236-200825237 | awk '{print $1,$2,$3,$4}'
2 200825237 rs199532108 A

but is not found in the newer release:

tabix -p vcf ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 2:200825236-200825237 | awk '{print $1,$2,$3,$4}'

This last command returns nothing (no variants found in that range). I don't understand how a variant can be found in a smaller sample (older version) but not found in the larger and overlapping sample (newer version). Any help would be greatly appreciated!

Thanks,
Jason

snp • 1.6k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.8 years ago by jasonlouisstein ▴ 10

Ram · Answer 1 · 2015-07-30

1

Entering edit mode

8.8 years ago

karl.stamm 4.1k

Take a look at the SNP in question: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=199532108

It's an indel in a homopolymer region. Just the kind of thing that shows up as false positive in older toolchains and is filtered or clarified in later updates.

I can't guarantee that's the case, but it's a possibility. I do believe the Thousand genomes people used different tools in each Phase: the datasets are expected to be different.

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.8 years ago by karl.stamm 4.1k

0

Entering edit mode

Thanks! I guess this begs the question whether this SNP is a technical artifact. And if it is, is the case v control signal observed at this SNP also a technical artifact? Perhaps a result of poor imputation?

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.8 years ago by jasonlouisstein ▴ 10

0

Entering edit mode

Imputation is going to be data driven and assume all inputs are high quality. If you're using linkage then it won't hurt to lose some markers, so go ahead and filter away all indels for their false-positive chance. Note I haven't looked at the gwas paper you reference.

ADD REPLY • link 8.8 years ago by karl.stamm 4.1k