Filter a VCF by genotype posterior probabilities (GP) obtained when imputing with beagle
0
1
Entering edit mode
2.9 years ago
Alejandro ▴ 10

Hi all, I imputed a set of genotypes by beagle using the following function:

java -jar beagle.jar gt=target.vcf ref=ref.vcf gp=true out=imputed

With gp = true to obtain the genotypes posterior probabilities of 0|0,0|1,1|0 and 1|1. The result obtained look like this:

##fileformat=VCFv4.2
##source="beagle.18May20.d20.jar"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + 2*P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO   FORMAT   A1 A2 A3
1       41466   oar3_OAR1_38175 G       A       .       PASS    DR2=0.01;AF=0.1125;IMP  GT:DS:GP        0|0:0.2:0.81,0.18,0.01  0|0:0.2:0.81,0.18,0.01  0|0:0.27:0.75,0.24,0.02

Once imputed, I wanted to filter by the genotype posterior probability (GP), so I used an instruction that I found and modified with the help of bcftools, for example if I wanted filter by GP>=80 and exclude (-e) the genotypes with less probability (they will appear as "./."):

bcftools +setGT imputed.vcf.gz -- -t q -n . -e 'GP>0.80' | bcftools query -f'%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\t%INFO[\t%GT]\n'

And the result should be something like that:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO   FORMAT   A1 A2 A3
1       41466   oar3_OAR1_38175 G       A       .       PASS    DR2=0.01;AF=0.1125;IMP  GT     0|0  0|0  ./.

Because the first and second animal (A1 & A2) has a highest genotype probability of 0.81 to 0|0 genotype, but A3 has less. The problem is that according to the probability that you select, the "exclude" command puts as missign or maintine with their value wrong genotypes. For example if I select GP>=85 the result obtained is:

1       41466   oar3_OAR1_38175 G       A       .       PASS    DR2=0.01;AF=0.1125;IMP  GT      ./.     ./.     ./. 

But if I select GP>=99 the program returns something wrong to my understanding, like this:

1       41466   oar3_OAR1_38175 G       A       .       PASS    DR2=0.01;AF=0.1125;IMP  GT      0|0     0|0     0|0 

These three genotypes should appear as missing, but it returns them to me as if they had passed the filter. I'm sure it is something that I'm not understanding well about the command, or perhaps there is another better command to do it, could you help me?. Thanks!

genotypes imputation vcf bcftools beagle • 2.2k views
ADD COMMENT
0
Entering edit mode

Try bcftools +setGT test.vcf -- -t q -n . -i'FORMAT/GP>=0.80'

ADD REPLY

Login before adding your answer.

Traffic: 2774 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6