Biostar Beta. Not for public use.
Access array in bcftools
0
Entering edit mode
12 months ago
WCIP | Glasgow | UK

Hi- I'm filtering a VCF file with bcftools view according the a FORMAT field with two items. How can I access the n-th item in the m-th sample?

For example, given this vcf file:

##fileformat=VCFv4.0
...
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

The AD tag has two elements which in the first sample are "9,1". I want to filter records where the second item in the first sample is greater than something (say 3). I thought this should work and it should exclude the record above because 1 < 3, but it doesn't:

bcftools view -i 'AD[0][1] > 3' recs.vcf # Access AD in first sample, second element

Is this possible at all with bcftools?

Thanks

ADD COMMENTlink
1
Entering edit mode
12 months ago
WCIP | Glasgow | UK

Better late then never... This works with bcftools 1.9 (but not with 1.5, I haven't tried other versions).

Access the AD tag of the first sample (0) and get the second item (1). Require the returned value to be > 10:

bcftools-1.9/bcftools view --include 'FORMAT/AD[0:1] > 10'

This is documented at https://samtools.github.io/bcftools/bcftools.html#expressions

ADD COMMENTlink
1
Entering edit mode
11 months ago
Republic of Ireland

Provided the formatting is consistent all the way through the VCF or BCF, I would avoid tools like bcftools for filtering and just use AWK:

Here is an example:

cat test.vcf 
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Greater than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] > 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

Less than 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] < 3) print $0}'
##fileformat=VCFv4.0
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,1 0/0:29,4

Greater than or equal to 3

cat test.vcf | awk '/^#/{print $0}; {split($10, format, ":"); split(format[2], ad, ","); if (ad[2] >= 3) print $0}'
##fileformat=VCFv4.0
##FORMAT=<ID=AD,Number=2,Type=Integer,Description="Allele depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  normal  tumour
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,3 0/0:29,4
chr1    14397   .   CTGT    C   .   PASS    SVTYPE=DEL  GT:AD   0/0:9,5 0/0:29,4

In these examples, I use AWK's split function in order to isolate the AD from column 10 (FORMAT), and then the second part of AD. Then it's just a simple if statement. The first part of the AWK command will ensure that the VCF header is output.

Kevin

ADD COMMENTlink
2
Entering edit mode

Hi Kevin, thanks for replying (+1).

I would avoid tools like bcftools for filtering and just use AWK

Actually, I would argue the opposite. I try to avoid parsing VCF records as raw strings as they are quite complex and it's easy to overlook some details. Since bcftools seems well tried and tested I'd use that instead.

ADD REPLYlink
1
Entering edit mode
12 months ago
France/Nantes/Institut du Thorax - INSE…

using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html

 java -jar dist/vcffilterjdk.jar -e 'Genotype G=variant.getGenotype(0); return G.hasAD() && G.getAD().length>1 &&  G.getAD()[1]>3;' input.vcf
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1