Question

Extract vcf file header and save it in Key-Value csv

0

Entering edit mode

5.4 years ago

tinyskinn • 0

Hi everyone,

If I have a vcf file with header like this:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA0000

How can I extract (with bcftools) and save/print these key-value pairs to a csv csv file with python. My expected output would look something like:

fileformat,VCFv4.0
fileDate,20090805

etc...

Thanks for the anticipated response

genome SNP next-gen-sequencing • 9.5k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 5.4 years ago by tinyskinn • 0

1

Entering edit mode

Hello tinyskinn ,

what is your final goal? When working with python on vcf files, you should take one of the existing modules like pysam or cyvcf2.

fin swimmer

ADD REPLY • link 5.4 years ago by finswimmer 16k

0

Entering edit mode

Hello Fin, my final goal is to be able to have an annotated metadata as in enter link description here. And this would require me to automate the extraction of the header Key-Value info. I am trying to implement this in enter link description here. And thank you for the suggestions.

ADD REPLY • link 5.4 years ago by tinyskinn • 0

1

Entering edit mode

I don't understand how your expected output is comma separated for the first line, but separated by a = for the second line.

Also, what have you tried to solve this issue? Why do you want to use bcftools and python, exactly? Mentioning these suggest that you at least have an idea which tools/language can be used.

ADD REPLY • link 5.4 years ago by WouterDeCoster 47k

0

Entering edit mode

regarding your first question, I have corrected the error. Thanks! to your second question, I actually do not have to use bcftools or python. From my search, they were the ones that came up. And my goal is explained my above answer to @finswimmer

ADD REPLY • link 5.4 years ago by tinyskinn • 0

score 1 · Answer 1 · 2018-12-03

1

Entering edit mode

5.4 years ago

Nicolas Rosewick 11k

use classic shell for that (not tested though):

cat file.vcf | grep '^##' | sed 's/=/\,/g' | sed 's/#//g' > header.txt

ADD COMMENT • link 5.4 years ago by Nicolas Rosewick 11k

0

Entering edit mode

It returned the original header including the ''##'. How about if the vcf is also .vcf.gz ?

ADD REPLY • link 5.4 years ago by tinyskinn • 0

0

Entering edit mode

I added an additional sed. But I didn't try it though.

ADD REPLY • link 5.4 years ago by Nicolas Rosewick 11k

0

Entering edit mode

For .gz use gzip -d -c instead of cat

ADD REPLY • link 5.4 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Or just zgrep instead of gzip+grep.

ADD REPLY • link 5.4 years ago by finswimmer 16k

0

Entering edit mode

I'd use grep '^#' instead to capture the header for the table

ADD REPLY • link 3.8 years ago by timing ▴ 20

score 0 · Answer 2 · 2022-08-25

0

Entering edit mode

20 months ago

tweep • 0

This will do :

tabix -H sample.gvcf.gz

ADD COMMENT • link 20 months ago by tweep • 0