Extraction of SNPs from the 1000G vcf file for a set of regions
0
0
Entering edit mode
8.2 years ago
SGMS ▴ 130

Hi all,

I am trying to extract SNPs from the vcf file of the 1000G using bcftools. I managed to extract the SNPs in a separate file, but I want to get the SNPs only in the regions which I'm interested in. I found that -R option provides me with this but I tried it in several ways and I didn't make it work. So the following works:

bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz > extraction.txt

But then the following does not:

bcftools query -f'%CHROM\t%POS\t%ID\t%REF,%ALT\n' -R regions.bed ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz > extraction.txt

I'm suspecting that the problem is the bed file which initially was a .txt file. The original format was:

1:150748683-150800917
1:151758546-151824348
1:154520256-154572353

and now:

1:150718683-150830917   150718682       150830917
1:151728546-151854348   151728545       151854348
1:154490256-154602353   154490255       154602353

If someone could give me a hint, that would be great.

Thank you!

bcftools SNP 1000G • 3.0k views
ADD COMMENT
0
Entering edit mode

Try tab-delimited chrom, start, end and possibly identifier.

1   150718682       150830917 1:150718683-150830917
1   151728545       151854348 1:151728546-151854348
1   154490255       154602353 1:154490256-154602353
ADD REPLY
0
Entering edit mode

Did you look at tabix?

Usage:   tabix [OPTIONS] [FILE] [REGION [...]]
tabix -h file.vcf.gz 1:150718683-150830917

Loop over a set of regions in your bed file.

ADD REPLY
0
Entering edit mode

Sorry, I'm not sure whether I got this.. My initial file has the chr:start-end format and I don't know how to turn the bed file into a format accepted by bcftools..

ADD REPLY
0
Entering edit mode

@StatGen Its not for bcftools solution. Its is for your entire task to create a new VCF file with a set of regions in bed file.

ADD REPLY
0
Entering edit mode

Hi venu,

Thanks.. The task though is not to create a new vcf with a set of regions in a bed file.. I want to extract the SNPs from the 1000G vcf, whose positions overlap with the regions I will specify.. My regions though are in a text format which I turned into a bed format which is most probably not acceptable by the bcftools..

ADD REPLY
0
Entering edit mode

Can you just show how you are expecting the output file?

ADD REPLY
0
Entering edit mode

After I use bcftools I'm expecting:

1       10177   rs367896724     A,AC
1       10235   rs540431307     T,TA
1       10352   rs555500075     T,TA
1       10505   rs548419688     A,T
1       10506   rs568405545     C,G
1       10511   rs534229142     G,A

i.e CHR POS ID REF,ALT

Thanks

ADD REPLY
0
Entering edit mode
tabix file.vcf.gz 1-10177-10177 | cut -f 1-5 > out.vcf

Select required fields with cut.

ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6