Biostar Beta. Not for public use.
Extracting genotypes and filtering from a VCF file containing multiple exomes of families
0
Entering edit mode
18 months ago

Hi all, I'm currently trying to extract genotypes from a VCF file but I'm not really sure how to approach this. This is how the data line looks like:

12      2224727 .       C       T       6204.77 PASS    0.0010  rs377534958     CACNA1C GT:AD:DP:GQ:PGT:PID:PL:SAC      0/0:5,0:5:12:.:.:0,12,180 0/1:7,3:10:86:.:.:86,0,2
38:0,7,0,3      0/0:44,0:44:99:.:.:0,108,1620   0/0:35,0:35:99:.:.:0,99,1422    0/0:33,0:33:99:.:.:0,99,1262

I'm trying to filter based off GQ and then extract the genotypes of all the individuals with RSID (i.e. rs377534958 0/0 0/1 0/0 0/2) information. Does anyone have any suggestions on how to do this? I've never worked with VCF files containing multiple individuals so this is all very new to me. Thank you so much :)

ADD COMMENTlink
0
Entering edit mode

Thank you so much for the help! I'm trying to more so get my file to be like the following:

FamilyID IndividualID rs1 rs2 rs3 rs4 rs5 ... rsN fam1 1 0/0 0/1 0/1 0/0 0/0 fam1 2 0/2 0/1 0/1 0/1 0/0

if that makes sense - kind of like your first example but a bit more parallel. The thing is, I have to filter on the GQ before hand to prevent any false positives but since the vcf file contains ~200 individuals and I'd want multiple SNPs, that's what has been really confusing for me. I inherited the data from the previous student and I'm not used to this kind of formatting.

ADD REPLYlink
0
Entering edit mode

Is Family ID even encoded in your VCF? What exactly are you aiming to do?

ADD REPLYlink
1
Entering edit mode
13 months ago
Republic of Ireland

Yes, working with multi-sample VCFs can be difficult because most filtering methods only deal with single-sample VCFs. Also, the VCF format is not overly amenable to storing multiple samples. For example, the QUAL field can only hold a single value yet has to represent all of the samples. According to one of my mentors, this is why people then go off and develop custom solutions, as I repeatedly do with VCF data.

I believe what you want may be possible with a program called SnpEff, but here's my own solution:

1, filter for your rs ID of interest (must be set in ID field)

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf 

1   762592  rs71507462  C   G   3684.68 VQSRTrancheSNP99.00to99.90;LowQual;SNP_SPECIFIC_FILTERS DB;Dels=0;FS=0;HaplotypeScore=0;MQ=41.99;MQ0=1;NEGATIVE_TRAIN_SITE;QD=27.98;VQSLOD=-3.612;culprit=MQ;set=FilteredInAll;BaseQRankSum=-0.72;MQRankSum=-0.72;ReadPosRankSum=-0.72;InbreedingCoeff=-0.1805;DP=2013;AF=0.81;MLEAC=34;MLEAF=0.81;AN=594;AC=514    GT:AD:DP:GQ:PL  ./.:.:.:.:. 1/1:1,5:6:15:196,15,0   1/1:0,15:15:30:350,30,0 0/1:4,2:6:47:47,0,124   1/1:0,5:5:12:145,12,0   1/1:0,5:5:12:132,12,0   1/1:0,5:5:12:143,12,0   1/1:0,6:6:15:179,15,0   1/1:0,6:6:12:166,12,0   1/1:0,8:8:18:213,18,0   1/1:0,8:8:15:183,15,0   1/1:4,14:18:27:335,27,0 0/1:3,7:10:79:163,0,79  ./.:.:.:.:. 1/1:0,8:8:15:168,15,0   1/1:1,14:15:33:389,33,0 0/1:6,3:9:54:54,0,149   1/1:3,5:8:12:150,12,0   1/1:0,6:6:15:195,15,0   1/1:1,4:5:6:72,6,0  1/1:0,3:3:6:52,6,0  1/1:0,7:7:18:207,18,0   ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. 1/1:1,4:5:6:63,6,0  1/1:1,3:4:6:80,6,0

2, 'melt' the the data into long format and ensure that we output the genotype (GT) and the genotype quality (GQ)

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%SAMPLE:%GT:%GQ\n]'

1:762592:C:G:16422:./.:.
1:762592:C:G:16427:1/1:15
1:762592:C:G:16451:1/1:30
1:762592:C:G:16456:0/1:47
1:762592:C:G:17412:1/1:12
1:762592:C:G:17413:1/1:12
1:762592:C:G:17417:1/1:12
1:762592:C:G:17418:1/1:15
1:762592:C:G:17422:1/1:12
1:762592:C:G:17423:1/1:18
1:762592:C:G:17427:1/1:15
1:762592:C:G:17428:1/1:27
1:762592:C:G:17432:0/1:79
1:762592:C:G:17433:./.:.
1:762592:C:G:17437:1/1:15
1:762592:C:G:17438:1/1:33

3, use a simple awk filter on the 7th column (GQ value) to select those GQs > 10, then tabulate the genotypes

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%SAMPLE:%GT:%GQ\n]' | awk -F ":" '$7>10' | cut -f6 -d":" | sort | uniq -c

      4 0/0
     51 0/1
    135 1/1

4, same as #3 but filter for GQ > 30

bcftools view --include 'ID=="rs71507462"' ProjectMerge.rsID.bcf | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%SAMPLE:%GT:%GQ\n]' | awk -F ":" '$7>30' | cut -f6 -d":" | sort | uniq -c

     33 0/1
      5 1/1

-----------------------------------------------

--------------------------------------------

This script (above) is only valid for a single variant lookup. If your output data for more than one rs ID at the same time, you'll have to modify it. For example, you can 'tally up' (count up) the genotypes of each variant with GQ > 30 across all individuals as follows:

bcftools view ProjectMerge.rsID.bcf | head -100 | bcftools query -f'[%ID:%GT:%GQ\n]' | awk -F ":" '$3>30' | cut -f1-2 -d":" | sort | uniq -c

     31 rs140739101:0/0
      1 rs140739101:0/1
     16 rs142004627:0/0
      1 rs142004627:0/1
      1 rs150690004:1/1
     23 rs190717287:0/0
      1 rs190717287:0/1
      1 rs200505207:0/1
      1 rs62639105:0/0
     21 rs62639105:0/1
      4 rs75062661:0/0
      4 rs75062661:0/1
    214 rs75062661:1/1

.

bcftools view ProjectMerge.rsID.bcf | head -100 | bcftools query -f'[%CHROM:%POS:%REF:%ALT:%ID:%GT:%GQ\n]' | awk -F ":" '$7>30' | cut -f1-6 -d":" | sort | uniq -c

      1 1:66162:A:T:rs62639105:0/0
     21 1:66162:A:T:rs62639105:0/1
     31 1:69428:T:G:rs140739101:0/0
      1 1:69428:T:G:rs140739101:0/1
     16 1:69453:G:A:rs142004627:0/0
      1 1:69453:G:A:rs142004627:0/1
      1 1:69496:G:A:rs150690004:1/1
      4 1:69511:A:G:rs75062661:0/0
      4 1:69511:A:G:rs75062661:0/1
    214 1:69511:A:G:rs75062661:1/1
     23 1:69534:T:C:rs190717287:0/0
      1 1:69534:T:C:rs190717287:0/1
      1 1:69761:A:T:rs200505207:0/1

----------------------------

Other scripts of mine that you may find of use:

Kevin

ADD COMMENTlink
0
Entering edit mode
14 months ago
France/Nantes/Institut du Thorax - INSE…

using vcffilterjdk to select the called variant (not './.') having a GQ greater than 20. http://lindenb.github.io/jvarkit/VcfFilterJdk.html

java -jar dist/vcffilterjdk.jar -e 'return variant.getGenotypes().stream().allMatch(G->!G.isCalled() || G.getGQ()>20);' input.vcf.gz
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1