Hot to retrieval a SNP from a very big vcf file by position with tabix
1
0
Entering edit mode
4.9 years ago
vw ▴ 40

Hi all,

I want to retrieval SNPs in a very large vcf file.

I stored the chromosome and position of my interested SNPs in a temp file as below:

chr1    2487663 rs2227312       C       A       100.0   PASS    DP=1825;ASP=true;CAF=0.4008,0.5992;COMMON=1;G5=true;G5A=true;GENEINFO=LOC115110:115110|TNFRSF14:8764;GNO=true;HD=true;KGPhase1=true;KGPhase3=true;R5=true;RS=2227312;RSPOS=24

I used to retrieval records from the source vcf with tabix by inputting a range:

tabix source.vcf.gz chr1: 222-245

But this time, since it a snp, I can only input a begin site:

tabix source.vcf.gz chr1: 2,487,663

or

tabix source.vcf.gz chr1:2,487,663-2,487,663

But it doesn't work. Furthermore, not all of SNPs have dbSNP ID, so, I cannot retrieval them by ID.

Could you give me some suggestions?

Thank you!

tabix vcftools vcf SNP • 2.7k views
ADD COMMENT
2
Entering edit mode
4.9 years ago

source.vcf MUST be compressed with bgzip and indexed with tabix or bcftools

bgzip source.vcf 
tabix -p vcf  source.vcf.gz
tabix source.vcf.gz "chr1:2487663-2487663"

(no space , no comma in "chr1:2487663-2487663" )

so, I cannot retrieval them by ID.

gatk select variants: https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

--keepIDs / -IDs

List of variant IDs to select

If a file containing a list of IDs is provided to this argument, the tool will only select variants whose ID field is present in this list of IDs. The matching is done by exact string matching. The expected file format is simply plain text with one ID per line.

ADD COMMENT
0
Entering edit mode

Thanks for your response. I have compressed and indexed the vcf file. A few first lines of the source.vcf.gz appear as below:

gzip -cd source.vcf.gz | head -60 | tail -4

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10019   rs775809821     TA      T       .       .       RS=775809821;RSPOS=10020;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP

I try to retrieval above SNP from the source.cvf.gz:

tabix source.vcf.gz "chr1:10019-10019"

But it doesn't return anything. Did I make any mistake?

ADD REPLY
0
Entering edit mode

Did I make any mistake?

yes it's "1:10019-10019" not "chr1:10019-10019"

ADD REPLY
0
Entering edit mode

I figure out the issue:

In the source vcf file, the content in the chromosome column is a plant number without prefix "chr". So, my command line should be:

tabix source.vcf.gz "1:10019-10019"

It works well now. Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 2154 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6