Subset VCF file by comparing CHROM,POS,REF and ALT
1
2
Entering edit mode
6.4 years ago
Ram 43k

Hello everyone,

I've looked at vcftools, bcftools, bedops, bedtools and GATK SelectVariants, but I don't see how I can address my requirement - I hope the community can help.

I want to subset a VCF file based on 4 attributes: CHROM, POS, REF and ALT. I have 2 VCF files, and I wish to exclude all entries from VCF1 where it matches those 4 values in VCF2. Most of the above-mentioned tools work only on CHROM and POS, even if they accept a VCF file to filter by. There is no way they compare an exact match to ALT allele, even if both VCF inputs were processed to split all multi-allelic sites.

The closest I can get to is by using bcftools annotate, and by copying over an INFO attribute (say, INFO/AC) with a new name (say, INFO/DUMMY_AC) so I can filter by that new name. The manual on bcftools annotate states:

When REF and ALT are present, only matching VCF records will be annotated.

which works for me when my ALT alleles are split, but does not help me filter, only mark them.

Is there any subset tool that will help me compare by custom attributes or do I have to write my own script for it?

Thank you!

--
Ram

vcf subset • 4.9k views
ADD COMMENT
1
Entering edit mode

I'd say your requirements are sufficiently complicated to go to a scripting approach, for which I would use cyvcf2 (python).

ADD REPLY
0
Entering edit mode

HI Ram,

Extract and write records from A shared by both A and B using exact allele match

bcftools isec A.vcf.gz B.vcf.gz -p dir -n =2 -w 1

This should work !!

Thanks
Najeeb

ADD REPLY
0
Entering edit mode

Thank you! It's been a while since I did this exercise so I am not sure if I checked out bcftools isec (I most probably did and either didn't see this example or excluded it for a reason). Would you happen to know what exact allele match means? I don't see its definition anywhere in the documentation. Also, I'd like to exclude those positions, not pick the intersect (but maybe I could have piped it to another command that does the actual exclusion)

ADD REPLY
0
Entering edit mode

Just run bcftools isec on terminal and it gives exact "Extract and write records from A shared by both A and B using exact allele match" message.

ADD REPLY
0
Entering edit mode

How does this add to the conversation exactly? This is neither an explanation of the term "exact allele match" nor is it a pointer to excluding overlaps as opposed to picking them.

ADD REPLY
2
Entering edit mode
6.4 years ago

I've written http://lindenb.github.io/jvarkit/VcfIn.html with the option:

  -A, --allalt
      ALL user ALT must be found in VCF-database ALT

however, i've not much used this tool.

ADD COMMENT
2
Entering edit mode

Of course you've written a tool for this, Pierre! And referenced this post there. How do you do it so fast?

ADD REPLY
1
Entering edit mode

that's what she said.

ADD REPLY
1
Entering edit mode

And reference this post there.

@Pierre is cross-referencing Biostars threads on individual tool pages so their practical use can be documented.

This tool was already written and waiting for you :)

ADD REPLY
0
Entering edit mode

I made a typo - I meant "referenced", not "reference". He'd already cross-ref'd it and I was wondering how he did that so quickly! And yeah, I guessed the tool was already ready (given he didn't call it biostars287815 or some such, so it was the cross-referencing speed that I was shocked by :-)

ADD REPLY
1
Entering edit mode

Just a note for future visitors: I'm accepting this answer because it makes sense, but I haven't tested it either. I wrote my own algorithm using bcftools annotate and a grep.

ADD REPLY
0
Entering edit mode

@RamRS if you still have your solution, I would be happy to take a look :)

ADD REPLY

Login before adding your answer.

Traffic: 2360 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6