Creation of a VCF file from scratch with python
2
0
Entering edit mode
5.2 years ago

Hello! I want to make a VCF file with a header line syntax like "#CHROM POS REF ALT". Is it possible to create such a VCF file from scratch with python?

Thanks in advance

VCF python • 7.4k views
ADD COMMENT
1
Entering edit mode

you can do it easily in pandas package.

ADD REPLY
0
Entering edit mode

Certainly. But what advantage are you going to gain by doing that? Are you trying to simulate data?

ADD REPLY
0
Entering edit mode

I want to create a header that allows me to save some data that is not in the template CSV. So i want either to create a new CSV that i can use to save those fields or to manipulate a template in order to add those.

ADD REPLY
0
Entering edit mode

yes, it is possible in Python, Perl, Java, etc ;)

Please extend what exactly are you trying to do.

ADD REPLY
0
Entering edit mode

I want to create a header that allows me to save some data that is not in the template CSV. So i want either to create a new CSV that i can use to save those fields or to manipulate a template in order to add those.

ADD REPLY
2
Entering edit mode
5.2 years ago

Assuming you're using Python, you can start with a template like so (make sure your chromosome lengths are correct for your assembly):

##fileformat=VCFv4.1
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chrM,length=16571>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    Path
%CHROMOSOME %POSITION   %ID %REF    %ALT    .   .   .

And then read in this file row by row. Keep the each line in a list or something, extract the last line as a "template" and remove it from your list. Parse out the template for the placeholders, then, read in your variant data as a list of as a variant tuples. Join them up with the placeholders and substitute them. Stub example:

        output = variant_template # Variant template is the last line as a string
        PLACEHOLDERS = ["%"+X for X in "CHROMOSOME,POSITION,REF,ALT".split(",")] # Placeholders are what you replace. You could also just split the last row you extracted from the template file.
        for x,y in zip(PLACEHOLDERS, variant_tuple): # Pair up placeholders and variant data (assuming it's ordered in the same way.)
            output = output.replace(x,y) # Replace text

Append output to the file. Do this for each variant.

ADD COMMENT
0
Entering edit mode

Thank you! I will try this.

ADD REPLY
2
Entering edit mode
5.2 years ago

A vcf file is a plain text file, that follow the rules by the specification for a valid vcf. As long as you take care of these rules, you can create this file how ever you want.

Be careful: There are lot of tools out there, that are satisfied, if the vcf contain just one header line, holding the column names: #CHROM POS ID REF ALT QUAL FILTER INFO This is not enough for a "real" valid vcf. For this the header must also include:

  • information about the file format version: ##fileformat=VCFv4.3
  • information about contig length for each contig used in the file, e.g. ##contig=<ID=chr1,length=249250621>
  • information about each key used in the INFO or FORMAT column, e.g.:
    • ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
    • ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

If you consider this right from the start, you will not have any problems using different vcf tools later. Especially bcftools is very strict about the header values.

When working with python, you could think about using one of the available modules to handle and create vcf file like pysam or cyvcf.

ADD COMMENT

Login before adding your answer.

Traffic: 2058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6