Biostar Beta. Not for public use.
Creation of a VCF file from scratch with python
0
Entering edit mode
15 months ago

Hello! I want to make a VCF file with a header line syntax like "#CHROM POS REF ALT". Is it possible to create such a VCF file from scratch with python?

Thanks in advance

VCF python • 338 views
ADD COMMENTlink
1
Entering edit mode

you can do it easily in pandas package.

ADD REPLYlink
0
Entering edit mode

Certainly. But what advantage are you going to gain by doing that? Are you trying to simulate data?

ADD REPLYlink
0
Entering edit mode

I want to create a header that allows me to save some data that is not in the template CSV. So i want either to create a new CSV that i can use to save those fields or to manipulate a template in order to add those.

ADD REPLYlink
0
Entering edit mode

yes, it is possible in Python, Perl, Java, etc ;)

Please extend what exactly are you trying to do.

ADD REPLYlink
0
Entering edit mode

I want to create a header that allows me to save some data that is not in the template CSV. So i want either to create a new CSV that i can use to save those fields or to manipulate a template in order to add those.

ADD REPLYlink
2
Entering edit mode
3 months ago
Germany

A vcf file is a plain text file, that follow the rules by the specification for a valid vcf. As long as you take care of these rules, you can create this file how ever you want.

Be careful: There are lot of tools out there, that are satisfied, if the vcf contain just one header line, holding the column names: #CHROM POS ID REF ALT QUAL FILTER INFO This is not enough for a "real" valid vcf. For this the header must also include:

  • information about the file format version: ##fileformat=VCFv4.3
  • information about contig length for each contig used in the file, e.g. ##contig=<ID=chr1,length=249250621>
  • information about each key used in the INFO or FORMAT column, e.g.:
    • ##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
    • ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

If you consider this right from the start, you will not have any problems using different vcf tools later. Especially bcftools is very strict about the header values.

When working with python, you could think about using one of the available modules to handle and create vcf file like pysam or cyvcf.

ADD COMMENTlink
0
Entering edit mode
12 months ago
Canada

Assuming you're using Python, you can start with a template like so (make sure your chromosome lengths are correct for your assembly):

##fileformat=VCFv4.1
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chrM,length=16571>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    Path
%CHROMOSOME %POSITION   %ID %REF    %ALT    .   .   .

And then read in this file row by row. Keep the each line in a list or something, extract the last line as a "template" and remove it from your list. Parse out the template for the placeholders, then, read in your variant data as a list of as a variant tuples. Join them up with the placeholders and substitute them. Stub example:

        output = variant_template # Variant template is the last line as a string
        PLACEHOLDERS = ["%"+X for X in "CHROMOSOME,POSITION,REF,ALT".split(",")] # Placeholders are what you replace. You could also just split the last row you extracted from the template file.
        for x,y in zip(PLACEHOLDERS, variant_tuple): # Pair up placeholders and variant data (assuming it's ordered in the same way.)
            output = output.replace(x,y) # Replace text

Append output to the file. Do this for each variant.

ADD COMMENTlink
0
Entering edit mode

Thank you! I will try this.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1