Extracting certain columns from VCF file
1
3
Entering edit mode
5.9 years ago

Hello all,

I've been recently trying to extract only certain columns with vcftools of an annovar-run VCF file. I did the following command:

vcftools --vcf file_ANNOVAR.vcf --recode-INFO ExAC_SAS_AF --recode-INFO rs_dbSNP147 --out OUTPUT.vcf

but it unfortunately isn't working. Does any one have any tips on what else I could try? I don't know what the column # is because the file is too big to open on my computer (I'm doing everything via SSH).

vcf genotype vcftools exome • 8.1k views
ADD COMMENT
1
Entering edit mode

if you want GUI based program this is the one to use

ADD REPLY
0
Entering edit mode

Please post input vcf (with headers and few example records) and the columns you want to extract @OP

ADD REPLY
1
Entering edit mode

Hey guys, I ended up using some perl scripting to fix my issue. I realized that everything was being printed in the 9th column i.e. Exac|gnomad|..|..| so I ended up spliting that column and then pasting / joining the ones I needed. :) Thank you all for the help!

ADD REPLY
0
Entering edit mode

You're welcome dude

ADD REPLY
6
Entering edit mode
5.9 years ago

You need to switch from VCFtools to BCFTools, in partcular, bcftools query.

It looks like you not only want certain columns but also certain key-value pairs within the primary VCF columns, which are tab-delimited.

Here are examples that will assist you from one of my own VCFs:

bcftools query -f'[%CHROM:%POS %GT\n]' 2701.snvindel.var.vcf.gz | head -5
1:69511 1/1
1:69761 0/1
1:752721 0/1
1:752894 1/1
1:762273 0/1

.

bcftools query -f'[%CHROM:%POS:%REF:%ALT %SAMPLE %GT\n]' 2701.snvindel.var.vcf.gz | head -5
1:69511:A:G 2701 1/1
1:69761:A:T 2701 0/1
1:752721:A:G 2701 0/1
1:752894:T:C 2701 1/1
1:762273:G:A 2701 0/1

Should be fairly obvious what those are doing. To extract certain values from the INFO column, which is what you appear to have to do, you can do the following:

bcftools query -f'[%CHROM:%POS:%REF:%ALT %INFO/HaplotypeScore:%INFO/VQSLOD %SAMPLE %GT\n]' 2701.snvindel.var.vcf.gz | head -5
1:69511:A:G 0.9159:-6.231 2701 1/1
1:69761:A:T 0:-9.034 2701 0/1
1:752721:A:G 0:-1.447 2701 0/1
1:752894:T:C 0:-6.798 2701 1/1
1:762273:G:A 5.3647:-2.236 2701 0/1

Here, HaplotypeScore and VQSLOD are tags define din my INFO field.

Kevin

ADD COMMENT
0
Entering edit mode

I'm really new to bioinformatics, so thank you so much for your help! I tried doing that and it said that the column(s) didn't exist. I'm not sure whether it's because of how my VCF file is formatted? My info header looks like this:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature

Do you think the "|" is affecting anything?

ADD REPLY
0
Entering edit mode

Ah! In this case, the key value is called Description (%INFO/Description), so, bcftools query will only be able to extract the entire string that contains all of your annotation.

You can still, nevertheless, do that and then do some post filtering with cut, sed, awk, or other commands. How is your experience with these commands?

ANNOVAR can output in CSV format, by the way. That would be much easier for you, surely?

ADD REPLY
0
Entering edit mode

Hi, I would need to extract variants with gnomAD_AF information from CSQ field. With bcftools query it returns only dots even though I can manually check with less that there are values for gnomAD_AF.. I would really appreciate help!

bcftools query -f '%CHROM %POS %INFO/gnomAD_AF\n' FILE | head -3
1 877831 .
1 949608 .
1 977156 .
ADD REPLY
0
Entering edit mode

Please show your VCF header, and also a few records from the VCF itself.

ADD REPLY

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6