Add contig lenght to VCF header in a robust way
4
4
Entering edit mode
7.8 years ago
William ★ 5.3k

I have some older VCF files that don't have the contig length set in the VCF header. This means that Picard and some other tools that are very strict with the VCF spec won't accept them.

The contig entries in the header should be

##contig=<ID=1,length=195471971>
##contig=<ID=2,length=182113224>

but they are

##contig=<ID=1>
##contig=<ID=2>

I know that I can manually fix this by doing the following steps.

  • unzipping the file
  • extracting the header
  • lookup the contig lenghts in a fasta.fa.fai file
  • adding the lenght to the contigs records in the header with vim
  • re-header with bcftools
  • bgzip and tabix the the re-headered vcf file

In my hands this works but if you make a slight VIM copy paste error you will have spend a lot of time reheadering and bgzipping a large VCF file for nothing.

Therefore I would like to have a more robust automatic solution where I just give the VCF file and the reference genome file and the header is automatically fixed and a new bgzipped VCF file is written out.

Is there a tool that can add the contig lenghts to the VCF header and write out a new bgzipped VCF file?

vcf • 9.9k views
ADD COMMENT
11
Entering edit mode
7.8 years ago

Use awk to insert the contig before the #CHROM line. Something like

awk '/^#CHROM/ { printf("##contig=<ID=1,length=195471971>\n##contig=<ID=2,length=182113224>\n");} {print;}' in.vcf > out.vcf

you can generate the printf line above from the ref.fai file:

  awk '{printf("##contig=<ID=%s,length=%d>\\n",$1,$2);}' ref.fai
ADD COMMENT
0
Entering edit mode

Great solution, particularly as there's an opportunity to double-check the lengths before writing them to the VCF :)

ADD REPLY
6
Entering edit mode
23 months ago
jena ▴ 290

Use bcftools and a reference fasta index

You can use the following bcftools command to add contig info into your VCF / BCF files:

bcftools reheader --fai ref.fa.fai -Oz input.vcf.gz > output.vcf.gz

It will also remove unused contigs and generally fix issues.

ADD COMMENT
3
Entering edit mode
7.8 years ago
William ★ 5.3k

A select all variants with the gatk framework also adds the contig lenght to the contigs in the header. Upside is that it's a one liner with a off the shelf tool:

gatk-framework -T SelectVariants -V input.vcf.gz -R /ref.fa -o input_header_fixed.vcf.gz

Downside side is that the full VCF is inflated to the GATK Variant/GenotypeContext objects and rewritten to a new VCF. So it's slower then AWK and or a c/c++ tool solution, and the new VCF is slightly different (different order or precision of Variant/Genotype attributes).

ADD COMMENT
0
Entering edit mode

Error: htsjdk.tribble.TribbleException: Contig 22 does not have a length field.

ADD REPLY
0
Entering edit mode
3.8 years ago
wood ▴ 10

use picard UpdateVcfSequenceDictionary tool, for detail see https://broadinstitute.github.io/picard/command-line-overview.html#UpdateVcfSequenceDictionary

ADD COMMENT

Login before adding your answer.

Traffic: 1816 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6