Remove variants that are homozygous-reference (0/0) in two VCF files
3
0
Entering edit mode
7.8 years ago
KJ ▴ 10

Dear all,

Currently I have two large VCF files that include calls that are homozygous for the reference allele. For easier analysis, I would like to remove the variants that are homozygous-ref in both VCF files (or, in VCF-speak, be "0/0" for both samples at the same locus). I can't be the first to want to do this, but wasn't able to find anything of use.

INPUT

sample1.vcf

20      60055   .       A       .       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/0:25:25:99:PASS
20      60056   .       G      A.       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/1:12,13:25:99:PASS,PASS
20      60057   .       T       .       35      PASS    DP=26;PF=20;MF=6;MQ=60;SB=0.769 GT:AD:DP:GQ:FL  0/0:26:26:99:PASS
20      60058   .       C      T       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  1/1:25:25:99:PASS

sample2.vcf

20      60055   .       A       .       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/0:25:25:99:PASS
20      60056   .       G       .       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/0:25:25:99:PASS
20      60057   .       T       .       35      PASS    DP=26;PF=20;MF=6;MQ=60;SB=0.769 GT:AD:DP:GQ:FL  0/0:26:26:99:PASS
20      60058   .       C      T       35      PASS    DP=26;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  1/1:26:26:99:PASS

to:

OUTPUT

sample1.vcf

20      60056   .       G      A.       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/1:12,13:25:99:PASS,PASS
20      60058   .       C      T       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  1/1:25:25:99:PASS

sample2.vcf

20      60056   .       G       .       35      PASS    DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  0/0:25:25:99:PASS
20      60058   .       C      T       35      PASS    DP=26;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL  1/1:26:26:99:PASS

Some notes:

  • The files are around 260 GB big
  • I would like to keep the files seperate (not joining together)
  • The is about 128GB memory available
  • The files are sorted on position (fortunately)

Does anyone have experience with something like this, or could point me into a useful direction? Many thanks.

sequencing next-gen perl awk • 7.0k views
ADD COMMENT
3
Entering edit mode
7.8 years ago
William ★ 5.3k

Why don't you want to merge the VCF files?

Most analysis work with genotype matrixes in the end, not 2 separate genotype arrays. Also for this query you at least temporary somewere need to merge the two vcf files to a genotype matrix. If you want to do any other query on both files you again need to merge them at least temporary.

If you really want to keep the files separate I would do the following.

The first command below merges the two vcf files, then select the sites with at least one alt allele and writes a site only VCF file to disk. This site only VCF file contains the loci you are interested in and can be used for filtering both your files.

bcftools merge file_A.vcf.gz file_B.vcf.gz | bcftools view -c1 | bcftools view --drop-genotypess -O z -o sites_with_at_least_1_alternative_allele.vcf.gz

bcftools view -R  sites_with_at_least_1_alternative_allele.vcf.gz  file_A.vcf.gz -O z -o  file_A_at_least_one_alt_allele.vcf.gz 

bcftools view -R  sites_with_at_least_1_alternative_allele.vcf.gz  file_B.vcf.gz -O z -o file_B_at_least_one_alt_allele.vcf.gz

See also bcftools view documentation

https://samtools.github.io/bcftools/bcftools.html#view

ADD COMMENT
1
Entering edit mode

The reason for not wanting to merge are the fact that 1) bcftools merge seems to output a file that tabix can not index anymore (maybe because of the size?), and 2) the script for the next analysis step already being ready, taking single-sample VCF's as input.

I ended up splitting the files by chromosome with tabix (this turned out to be necessary anyway) and doing a temporary merge using GNU join.

Your answer does the job though (and is the most logical approach is almost all cases) , so therefore accepted as answer, thanks!

ADD REPLY
1
Entering edit mode
7.8 years ago
grep -v '0/0:' FILE.VCF > NEWFILE.VCF

Edit: I misread the original question (thanks, @genomax2). The command above will remove all reference calls, not just the subset that are present in both.

egrep '##|0/0:' FILE.VCF > NEWFILE.VCF

You could use this egrep expression to obtain the VCFs of reference calls in each file, BEDtools 'intersect' to find the common calls, then BEDtools 'subtract' to remove the common calls from the original VCFs. Inelegant but functional.

ADD COMMENT
1
Entering edit mode

This would work for the example snippet above but if the 0/0 entry for a sample is in only one file (but not the other) then it is not clear if @KJ wants to keep it.

ADD REPLY
0
Entering edit mode

Sorry for the confusion, it is indeed my intention to keep loci that are "0/1" or "1/1" in at least one sample. The above solution removes to much data.

ADD REPLY

Login before adding your answer.

Traffic: 2426 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6