Biostar Beta. Not for public use.
intersect VCF files
3
Entering edit mode
15 months ago
Bogdan • 780
Palo Alto, CA, USA

Dear all,

please would appreciate a quick recommendation about the best way to intersect 2 VCF files. many thanks,

bogdan

myposts VCF • 5.6k views
ADD COMMENTlink
0
Entering edit mode

Do you intent to compare the VCFs for similarities? i.e. SNPs common to both?

ADD REPLYlink
0
Entering edit mode

thanks gentlemen !

ADD REPLYlink
8
Entering edit mode
14 months ago
Len Trigg ♦ 1.3k
New Zealand

Unfortunately, vcftools, bcftools, and bedops all completely ignore the fact that in many cases there are multiple equivalent ways to express the same variant, so your input vcfs may look different but actually be saying the same thing about the underlying haplotypes. You can alleviate some of this by applying normalization, but you should cut to the chase by using a haplotype-aware vcf comparison tools such as rtg vcfeval, hap.py, or vgraph. I would recommend vcfeval as it is the most straight forward for using in an intersection type use-case (much like bcftools isec, you get separate output VCFs containing the unique and shared variants, preserving the input representations and annotations).

ADD COMMENTlink
1
Entering edit mode

You should mention that you work for RTG when recommending the software - however its open source and really awesome, so thank you! :)

ADD REPLYlink
0
Entering edit mode

I tried using rtg vcfeval and dear god is it a pain to install the first time! curl -O on a HPC gives me no way of unzipping the file, no matter what tool I use, and once I download to local Mac+unzip+scp, the SDF file for the actual intersect operation is either to be created from scratch or the pre-formatted databases download at the rate of 1 gig/4 hours. If this is the initial investment that goes into using this tool, why would I use it?

ADD REPLYlink
1
Entering edit mode

Sorry to hear you had difficulty. I had no idea that curl has problems with bog standard github release download links (wget handles these fine). Apparently you need to give curl the -L option, e.g:

$ curl -L -O https://github.com/RealTimeGenomics/rtg-tools/releases/download/3.7.1/rtg-tools-3.7.1-nojre.zip
$ unzip -q rtg-tools-3.7.1-nojre.zip
$ ./rtg-tools-3.7.1/rtg  # answer y or n when prompted

Regarding the SDF creation, it's trivial difficult to make your own, and is no harder than creating a BWA reference index or faidx index, e.g:

$ ./rtg-tools-3.7.1/rtg format -o hs37d5.sdf /path/to/hs37d5.fa.gz
ADD REPLYlink
0
Entering edit mode

Thank you - I will try rtg the next time I have to intersect VCF files. In the meantime, I'll try and prep the SDF file.

ADD REPLYlink
0
Entering edit mode

I installed rtg-tools 3.8.4 using bioconda. When I run vcfeval the output contains 4 VCFs:

fn.vcf.gz
fp.vcf.gz
tp.vcf.gz
tp-baseline.gz

Judging from the slides and the preprint, the fn (false negative) file contains calls that are in the baseline but not in the calls file, the fp (false positive) file contains calls that are in the calls but not in the baseline files and the tp (true positive) contains variants which are represented in both. Is this correct?

What then is the tp-baseline file?

And I see from the log that variants with alleles over 1000 bases are skipped. Is this cutoff limit configurable somehow?

BTW I'm using this to compare variants in a bacterial genome.

ADD REPLYlink
0
Entering edit mode

tp.vcf contains the called variants which are present in both.
tp-baseline.vcf contains the baseline variants which are present in both.

So the tp.vcf is equivalent to the tp-baseline.vcf, but the former is using the calls representation, and the latter is using the baseline representation. If you haven't already looked, there is more information in the user manual section for vcfeval (which unfortunately you don't get with bioconda).

You can adjust the maximum allele length with --Xmax-length=N. --X flags usually require caution when using. In the case of allele length, synchronization points (since you've read the preprint) only occur where there is no variant in play, so if you have very long alleles as well as many short alleles within the same span (in either vcf), computational complexity can blow out. It's probably less of a concern with bacterial variants, since you'll presumably either have haploid variants or be using --squash-ploidy.

ADD REPLYlink
6
Entering edit mode
2.1 years ago
Mitch Bekritsky ♦ 1.1k
London, England

bcftools isec is very easy to use. With a sample command line like

bcftools isec -p isec_output -Oz 1.vcf.gz 2.vcf.gz

You can get four output files in gzipped VCFs

  1. isec_output/0000.vcf.gz would be variants unique to 1.vcf.gz
  2. isec_output/0001.vcf.gz would be variants unique to 2.vcf.gz
  3. isec_output/0002.vcf.gz would be variants shared by 1.vcf.gz and 2.vcf.gz as represented in 1.vcf.gz
  4. isec_output/0002.vcf.gz would be variants shared by 1.vcf.gz and 2.vcf.gz as represented in 2.vcf.gz

It's also faster than vcftools :)

ADD COMMENTlink
1
Entering edit mode

thanks Mitch !

ADD REPLYlink
4
Entering edit mode
15 months ago
venu 6.2k
Germany

Quickly recommended vcftools

ADD COMMENTlink
1
Entering edit mode

Just important to reiterate what Len mentioned below - vcftools, bcftools, and bedops all will NOT get the right answer in many cases! Even with careful normalization (for instance, using vt or vcfallelicprimitives), there will still be equivalent / matching variants that are not appropriately handled. As of today, the only tools that do intersections correctly are vgraph, vcfeval, and hap.py.

ADD REPLYlink
4
Entering edit mode
15 months ago
Seattle, WA USA

You could use BEDOPS bedops for this. There are a couple approaches, depending on what you mean by intersect.

There is the set operation on elements, which is probably what most mean by "intersect":

$ bedops --element-of 1 <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed

An element-of operation is not a symmetric result — in this case, the result contains elements of one.vcf that overlap elements in two.vcf. You would reverse the order of inputs to get the other result.

Alternatively, and perhaps less commonly, you could do an operation on genomic regions:

$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > answer.bed

The by-region operation is symmetric; inputs can be entered in any order to get the same result.

If you want VCF elements "common" to or overlapping between both input files, you can do the following series of operations:

$ bedops --intersect <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > common-regions.bed
$ bedops --everything <(vcf2bed < one.vcf) <(vcf2bed < two.vcf) > all-elements.bed
$ bedops --element-of 1 all-elements.bed common-regions.bed > common-elements.bed
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1