Variant in earlier builds of 1000 genomes not found in later builds
1
1
Entering edit mode
8.8 years ago

Hi Biostars,

I would like to calculate LD of genome-wide significant variants found within the most recent schizophrenia GWAS paper (see Supp Table 2) using the newest release of 1000 genomes: 20130502 release found here. Note that this is not the version of 1000 genomes that was used for imputation in this paper but is a newer version.

Some variants which are genome-wide significant are not found in the 20130502 release, but are found in the older release 20110521. For example, this variant chr2_200825237_I, is found in release 20110521:

tabix -p vcf ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr2.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 2:200825236-200825237 | awk '{print $1,$2,$3,$4}'
2 200825237 rs199532108 A

but is not found in the newer release:

tabix -p vcf ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz 2:200825236-200825237 | awk '{print $1,$2,$3,$4}'

This last command returns nothing (no variants found in that range). I don't understand how a variant can be found in a smaller sample (older version) but not found in the larger and overlapping sample (newer version). Any help would be greatly appreciated!

Thanks,
Jason

snp • 1.6k views
ADD COMMENT
1
Entering edit mode
8.8 years ago

Take a look at the SNP in question: http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=199532108

It's an indel in a homopolymer region. Just the kind of thing that shows up as false positive in older toolchains and is filtered or clarified in later updates.

I can't guarantee that's the case, but it's a possibility. I do believe the Thousand genomes people used different tools in each Phase: the datasets are expected to be different.

ADD COMMENT
0
Entering edit mode

Thanks! I guess this begs the question whether this SNP is a technical artifact. And if it is, is the case v control signal observed at this SNP also a technical artifact? Perhaps a result of poor imputation?

ADD REPLY
0
Entering edit mode
Imputation is going to be data driven and assume all inputs are high quality. If you're using linkage then it won't hurt to lose some markers, so go ahead and filter away all indels for their false-positive chance. Note I haven't looked at the gwas paper you reference.
ADD REPLY

Login before adding your answer.

Traffic: 2034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6