INPUT

Question

Remove variants that are homozygous-reference (0/0) in two VCF files

0

Entering edit mode

7.8 years ago

KJ ▴ 10

Dear all,

Currently I have two large VCF files that include calls that are homozygous for the reference allele. For easier analysis, I would like to remove the variants that are homozygous-ref in both VCF files (or, in VCF-speak, be "0/0" for both samples at the same locus). I can't be the first to want to do this, but wasn't able to find anything of use.

INPUT

sample1.vcf

20      60055   .       A       .       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/0:25:25:99:PASS
20      60056   .       G      A.       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/1:12,13:25:99:PASS,PASS
20      60057   .       T       .       35      PASS    DP=26;PF=20;MF=6;MQ=60;SB=0.769 GT:AD:DP:GQ:FL  0/0:26:26:99:PASS
20      60058   .       C      T       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  1/1:25:25:99:PASS

sample2.vcf

20      60055   .       A       .       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/0:25:25:99:PASS
20      60056   .       G       .       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/0:25:25:99:PASS
20      60057   .       T       .       35      PASS    DP=26;PF=20;MF=6;MQ=60;SB=0.769 GT:AD:DP:GQ:FL  0/0:26:26:99:PASS
20      60058   .       C      T       35      PASS    DP=26;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  1/1:26:26:99:PASS

to:

OUTPUT

sample1.vcf

20      60056   .       G      A.       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/1:12,13:25:99:PASS,PASS
20      60058   .       C      T       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  1/1:25:25:99:PASS

sample2.vcf

20      60056   .       G       .       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/0:25:25:99:PASS
20      60058   .       C      T       35      PASS    DP=26;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  1/1:26:26:99:PASS

Some notes:

The files are around 260 GB big
I would like to keep the files seperate (not joining together)
The is about 128GB memory available
The files are sorted on position (fortunately)

Does anyone have experience with something like this, or could point me into a useful direction? Many thanks.

sequencing next-gen perl awk • 7.0k views

ADD COMMENT • link updated 7.8 years ago by William ★ 5.3k • written 7.8 years ago by KJ ▴ 10

1

Entering edit mode

7.8 years ago

harold.smith.tarheel ★ 4.9k

grep -v '0/0:' FILE.VCF > NEWFILE.VCF

Edit: I misread the original question (thanks, @genomax2). The command above will remove all reference calls, not just the subset that are present in both.

egrep '##|0/0:' FILE.VCF > NEWFILE.VCF

You could use this egrep expression to obtain the VCFs of reference calls in each file, BEDtools 'intersect' to find the common calls, then BEDtools 'subtract' to remove the common calls from the original VCFs. Inelegant but functional.

ADD COMMENT • link 7.8 years ago by harold.smith.tarheel ★ 4.9k

1

Entering edit mode

This would work for the example snippet above but if the 0/0 entry for a sample is in only one file (but not the other) then it is not clear if @KJ wants to keep it.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

Sorry for the confusion, it is indeed my intention to keep loci that are "0/1" or "1/1" in at least one sample. The above solution removes to much data.

ADD REPLY • link 7.8 years ago by KJ ▴ 10

score 3 · Accepted Answer · 2016-07-01

Why don't you want to merge the VCF files?

Most analysis work with genotype matrixes in the end, not 2 separate genotype arrays. Also for this query you at least temporary somewere need to merge the two vcf files to a genotype matrix. If you want to do any other query on both files you again need to merge them at least temporary.

If you really want to keep the files separate I would do the following.

The first command below merges the two vcf files, then select the sites with at least one alt allele and writes a site only VCF file to disk. This site only VCF file contains the loci you are interested in and can be used for filtering both your files.

bcftools merge file_A.vcf.gz file_B.vcf.gz | bcftools view -c1 | bcftools view --drop-genotypess -O z -o sites_with_at_least_1_alternative_allele.vcf.gz

bcftools view -R  sites_with_at_least_1_alternative_allele.vcf.gz  file_A.vcf.gz -O z -o  file_A_at_least_one_alt_allele.vcf.gz 

bcftools view -R  sites_with_at_least_1_alternative_allele.vcf.gz  file_B.vcf.gz -O z -o file_B_at_least_one_alt_allele.vcf.gz