Biostar Beta. Not for public use.
Filtering Vcf File
10
Entering edit mode
12 months ago
bioinfo • 700
New Zealand

I was wondering how to filter the vcf file based on a few input arguments ( DP>10, MQ>30 and QD>20 or GT = "1/1" etc)? I m planning to use simple command on the command line to extract the info and create a new filtered vcf file. I want to keep the 20 lines of vcf header INFO in new file as well. I can do it with perl but is there any other easy way? Last time I extracted my required info from vcf file using vcftools but I couldnt get a filtered vcf file.

My command

vcftools --vcf GMM_homo.vcf --depth --FILTER-summary --TsTv-by-count --site-mean-depth --SNPdensity 1000 --site-pi --minQ 30 --min-meanDP 5 --out homo_GMM

ADD COMMENTlink
0
Entering edit mode

I just tried

egrep '^#|"GT =1/1" | "DP>10","MQ>30"' my.vcf > filtered.vcf

Didn't work though.

ADD REPLYlink
0
Entering edit mode

I need to filter my vcf file to include variants with at least 30 individuals in each of the possible groups: major allele homozygote, heterozygote, and minor allele homozygotes; would be grateful for any input. Thanks!

ADD REPLYlink
0
Entering edit mode

ask this as a new question please.

ADD REPLYlink
32
Entering edit mode
4.7 years ago
Erik Garrison ♦ 2.1k
Somerville, MA

You can do exactly this with vcffilter in vcflib!

Here's how to select all variants with depth greater than 10, mapping quality greater than 30, and QD greater than 20:

vcffilter -f "DP > 10 & MQ > 30 & QD > 20" file.vcf >filtered.vcf

Now, to select only variants with homozygotes, you can strip every genotype that's not homozygous, fix up the file's AC and AF fields using the genotypes with vcffixup, and then remove all the AC = 0 sites (again, using vcffilter).

cat filtered.vcf | vcffilter -g "GT = 1/1" | vcffixup - | vcffilter -f "AC > 0" >results.vcf

The expression language is clunky (you have to put spaces in between the tokens, and parenthetical expressions also have to have spaces). There is also no != symbol, but as a workaround you can do ! ( expression ).

For instance, to pick up non-homozygous genotypes, you'd use:

vcffilter -g "! ( GT = 1/1 )"

I'd like to fix some of these things (and also add regex matching for strings) but this far it more than does the job for quick filtering operations, allowing me to do virtually any kind of filtering from the command line without having to drop into writing a custom script.

These are the supported operations: > < = | & !, and symbols: ( ). Strings are interpreted literally. There is some type checking using the VCF header, so you have to have a valid VCF file. The output is a valid VCF file, so you can stream the filter results into another filtering operation.

ADD COMMENTlink
0
Entering edit mode

Note that this will work for _any_ values in the INFO field or per-sample fields.

ADD REPLYlink
0
Entering edit mode

Does the vcffilter -f work with mutect vcf output? I tried it but does not seem to work. The vcf output of Mutect has a column as FILTER and I want to only keep the variants that have the value PASS for that column, ideally it should be like this

vcffilter -f "FILTER = PASS" file.vcf > filt_out.vcf

But this does not seem to work. Can anyone tell me where am getting it wrong?

ADD REPLYlink
0
Entering edit mode

problem solved, works well with epgrep command.. thanks

ADD REPLYlink
0
Entering edit mode

Hi, I have a mutect VCF file with the same FILTER column and PASS value I tried to run the vcffilter command but as you said it does not work. I saw that you solved the problem with grep. Please could you give me more information? Thanks

ADD REPLYlink
0
Entering edit mode

I have a problem with vcffilter. When I use it it removes variant info (Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type| ...). Here is my command:

vcffilter -k -f "( TYPE = ins | TYPE = del ) & FDP > 10 & HRUN < 6" -f "QUAL > 20" -g "FAO > 4 & GQ > 5" file.vcf | vcf-annotate --fill-AC-AN | vcffilter -f "AC > 0" > file.vcf.indelfilter.vcf"

Any idea where is the mistake and how to fix it?

ADD REPLYlink
4
Entering edit mode
14 months ago
France/Nantes/Institut du Thorax - INSE…

I wrote some tools to extract the fields from INFO and FORMAT. See: https://code.google.com/p/variationtoolkit/wiki/ExtractInfo and https://code.google.com/p/variationtoolkit/wiki/ExtractFormat

$ cat data.vcf.gz |\
   extractformat -t GT |\
   awk -F '        ' '($11=="1/1") |\
   extractinfo -t DP |\
  awk -F '        ' '(int($12)>10")'
ADD COMMENTlink
1
Entering edit mode

3.5 years later: this is wrong. Just filter the VCF using https://github.com/lindenb/jvarkit/wiki/VCFFilterJS or extract the fields using gatk varianttotable

ADD REPLYlink
0
Entering edit mode

@ Pierre Lindenbaum

can we convert the fpfilter out file which filters output of varscan for false postives to convert into vcf4.0 format? I tried vcf-annotate but to no avail. I was trying to write a script but does not help me out. I would like to know if you can any custom tool designed for it?

ADD REPLYlink
3
Entering edit mode
14 months ago
National Institutes of Health, Bethesda…

snpSift, a utility associated with snpEff, has several options for filtering and transforming from vcf to tab-delimited text.

ADD COMMENTlink
2
Entering edit mode
3.3 years ago
Adam • 990
United States

Don't you just need to add --recode to your command?

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1