How to add custom fields to the VCF files?
3
1
Entering edit mode
7.6 years ago
kirannbishwa01 ★ 1.6k

I need to add a context dependent custom tags in the FORMAT and SAMPLE column of a vcf file.

Here is my sample file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  
2   5187    .   G   C   2535.90 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/1:47,17:64:99:965,0,1925:1|0:.,.,.,.,.,.,.,.:662:|:0.5    
2   5207    .   T   A   2814.90 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/1:41,30:71:99:955,0,1628:1|0:.,.,.,.,.,.,.,.:662:|:0.5        
2   5238    .   G   C   8321.94 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/1:48,59:107:99:1986,0,1336:0|1:.,.,.,.,.,.,.,.:662:|:0.5      
2   5258    .   T   C   10415.90    PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC   0/1:75,45:120:99:1369,0,2434:1|0:.,.,.,.,.,.,.,.:662:|:0.5

Step 01: Look for phasing information at a particular variant site. For line #1, GT is 0/1 and PG is 1|0

Step 02: I need to take the information from the PG field (phased genotype) and update the vcf by creating a new field HP (haplotype phase) at the end of the FORMAT. The HP is the default output from GATK RBphasing but my other tool cannot output this field.

The output file should be this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2ms01e  
2   5187    .   G   C   2535.90 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:HP 0/1:47,17:64:99:965,0,1925:1|0:.,.,.,.,.,.,.,.:662:|:0.5:5187-2,5187-1 
2   5207    .   T   A   2814.90 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:HP 0/1:41,30:71:99:955,0,1628:1|0:.,.,.,.,.,.,.,.:662:|:0.5:5207-2,5207-1     
2   5238    .   G   C   8321.94 PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:HP 0/1:48,59:107:99:1986,0,1336:0|1:.,.,.,.,.,.,.,.:662:|:0.5:5238-1,5238-2       
2   5258    .   T   C   10415.90    PASS        GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:HP 0/1:75,45:120:99:1369,0,2434:1|0:.,.,.,.,.,.,.,.:662:|:0.5:5258-2,5258-1

Explanation of the process: 1) We need to look at the PG field first and see if the particular locus is phased with any variants before-after it. If its phased the value for the PG field will have a pipe (|). In the first line the GT (genotype) is 0/1 (heterozygous) and the PG is 1|0.

2) Identify the positions (POS) of the heterozygous site. For the first line its 5187. So, we need to pick this value and connect to value with '-' to the value obtained in step 3. - this steps store the value of the PG field as ':5187-'

3) So, with PG 1|0, the HP field should indicate the alternate allele (1) is reported first with value 2. - this step store the value for the field as ':5178-2'

4) Now, complete the HP field to indicate the value of another allele, with reported phase value as 1. - this step stores the value for the PG field as ':5178-2,5187-1'

So, GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC 0/1:47,17:64:99:965,0,1925:1|0:.,.,.,.,.,.,.,.:662:|:0.5
translates to GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC :HP 0/1:47,17:64:99:965,0,1925:1|0:.,.,.,.,.,.,.,.:662:|:0.5:5187-2,5187-1

Note: - At some site the variant might be 1/2 (or say 1|2 in PG), so this needs to be considered too. - I tried to looks for several variant annotation software including PESEQ, Annnovar, vcftools etc. but not finding any thing that can specifically do this. If there is any such tool that is even more great. - It would be best if the program can be in python with description of what is happening at each step. I am trying to learn python so that would be of great help too.

Thanks much in advance ! - Bishwa K.

vcf phasing SNP annotation manipulation • 11k views
ADD COMMENT
0
Entering edit mode

At the mean time I am trying to learn pyvcf. But, I would appreciate any programming help specifically in regards to the given question. I would want this solution sooner than later, so.

Thanks,

ADD REPLY
0
Entering edit mode

I provided a template.

ADD REPLY
3
Entering edit mode
7.6 years ago

It should be fairly simple with any VCF parsers. Look at pysam, pyVCF

Here is a template:

First you need to add HP to the header.

import pysam

#read the input file
myvcf=pysam.VariantFile("in.vcf","r")

# Add the HP field to header. Say its a string and can take any no. of values. It depends what format you want to give.
myvcf.header.formats.add("HP",".","String","Some description")

# An example showing how to add 'HP' information. Please explore further.    
for variant in myvcf:
    for sample in variant.samples:
        variant.samples[sample]['HP']="Hola"
    print variant

.

Input: GT:AD:DP:GQ:PL    0/1:12,13:25:99:358,0,318
Output: GT:AD:DP:GQ:PL:HP    0/1:12,13:25:99:358,0,318:Hola
ADD COMMENT
0
Entering edit mode

Thanks, but I think I need to add the HP field. There as a tab inserted infront of Hp field in my previous sample data, it was just a typo.

I will try what I can do with pyVCF. I really need to learn to do some python programming.

Thanks,

ADD REPLY
3
Entering edit mode
7.6 years ago
dshulgin ▴ 260

https://github.com/dshulgin2015/computing_python/blob/master/issue_1.py

Works well, if someone else wants to do the same.

ADD COMMENT
0
Entering edit mode
5.6 years ago
sm.hashemin ▴ 90

Is there a way to take a field from the INFO column and copy it or do a calcuation and feed it to sample column? I have one sample vcf files which to not contain format and sample column, and all data is in INFO column. really appreaciate if you could help a bro out! Mo

ADD COMMENT
1
Entering edit mode

It is surprising that you don't have SAMPLE and FORMAT. I suggest using awk to move the data around or extract the data out of your INFO column. If you are familiar with python that should make it even easier.

If you are fairly new to bioinformatics and not familiar with building your own scripts/code, I suggest using this tool: https://github.com/everestial/VCF-Simplify to convert data between TABULAR and VCF format. Let me know if you have any question while using this tool.

ADD REPLY
0
Entering edit mode

the tool was really helpful. I could use it as nonbiologist/bioinformaticain.

ADD REPLY

Login before adding your answer.

Traffic: 2447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6