Question

Writing vcf file headers

0

Entering edit mode

6.2 years ago

spiral01 ▴ 110

I have downloaded variation data from the popfly database (http://popfly.uab.cat). However, the vcf headers are all over the place. If I try to index the vcf files I get segmentation fault: 11, and if I try to parse them using bcftools I get various errors in the header. I think the easiest solution would be to entirely rewrite the headers. Is there a tool that does this?

As an aside, is there any benefit to vcf files over bed files? I know many annotation tools take vcf as input (e.g. snpEff), but I can strip away the header and convert to big bed and still annotate with annovar for instance.

Here is the header and the first few lines:

##fileformat=VCFv4.1
##contig=<ID=1,length=23011544>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  RAL-149_Chr2L   RAL-335_Chr2L   RAL-357_Chr2L   RAL-399_Chr2L   RAL-42_Chr2L    RAL-461_Chr2L   RAL-491_Chr2L   RAL-703_Chr2L   RAL-721_Chr2L   RAL-783_Chr2L   USI02_Chr2L USI03_Chr2L USI16_Chr2L USI17_Chr2L USI24_Chr2L USI31_Chr2L USI33_Chr2L USI34_Chr2L USI35_Chr2L USI38_Chr2L USW_35_Chr2L    USW_37_Chr2L    USW_40_Chr2L    USW_49_Chr2L    USW_50_Chr2L    USW_54_Chr2L    USW_59_Chr2L    USW_66_Chr2L    USW_69_Chr2L    USW_74_Chr2L
2L  5039    .   C   N,T .   .   .   GT  0   1   1   2   1   0   1   0   1   1
2L  5076    .   G   N,T .   .   .   GT  0   1   1   1   1   0   1   0   1   1
2L  5092    .   C   N,T .   .   .   GT  0   1   2   2   2   0   1   0   1   1
2L  5095    .   T   N,A .   .   .   GT  0   1   2   2   2   0   1   0   1   1
2L  5317    .   G   N,A .   .   .   GT  0   0   0   0   0   0   1   0   1   0
2L  5372    .   T   N,A .   .   .   GT  0   1   0   0   2   0   1   0   1   0

SNP vcf • 13k views

ADD COMMENT • link updated 3.6 years ago by johnston.mike.j ▴ 40 • written 6.2 years ago by spiral01 ▴ 110

1

Entering edit mode

Can you post an example of the header that is giving you problems? I work with vcf daily, haven't had any problems!

ADD REPLY • link 6.2 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Hi, I have added an example header. Thanks.

ADD REPLY • link 6.2 years ago by spiral01 ▴ 110

1

Entering edit mode

header seems fine, what is the problem you get besides segmentation fault? Do you get more info?

ADD REPLY • link 6.2 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

When I try:

bcftools view -G file.vcf

I get:

[W::vcf_parse] contig '2L' is not defined in the header. (Quick workaround: index the file with tabix.)
Undefined tags in the header, cannot proceed in the sample subset mode.

Yet when I zip the file with bgzip and then try to tabix index it I get segmentation fault 11.

ADD REPLY • link 6.2 years ago by spiral01 ▴ 110

0

Entering edit mode

It seems that your file doesn't have all the contigs in the header. How did you generate such vcf file? can you post the generation command?

ADD REPLY • link 6.2 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

I didn't actually generate the file. It is from the pop fly drosophila database (http://popfly.uab.cat/). I confess that I am a primarily theoretical biologist and so do not generate my own data, so I am not particularly familiar with the vcf generation process.

ADD REPLY • link 6.2 years ago by spiral01 ▴ 110

1

Entering edit mode

Ok, we might be closer to the solution. How did you get the file? can you download it from there?

Quick workaround for the time being:

you can modify the header lines where the contigs are specified, adding one for each chromosome (2L, 2R, 3L, 3R, X) where you specify the length. Like this:

##fileformat=VCFv4.1
##contig=<ID=2L,length=insert_length>
##contig=<ID=2R,length=insert_length>
##contig=<ID=3L,length=insert_length>
##contig=<ID=3R,length=insert_length>
##contig=<ID=X,length=insert_length>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ...
2L  5039    .   C   N,T .   .   .   GT  0   1   1   2   1   0   1   0   1   1
2L  5076    .   G   N,T .   .   .   GT  0   1   1   1   1   0   1   0   1   1
2L  5092    .   C   N,T .   .   .   GT  0   1   2   2   2   0   1   0   1   1

ADD REPLY • link 6.2 years ago by Matteo Schiavinato ★ 3.6k

0

Entering edit mode

Thank you. Yes you can download the file from the link I sent above. Under resources you can download data. It is the vcf file for the AM population (I cannot link directly for some reason).

ADD REPLY • link 6.2 years ago by spiral01 ▴ 110

0

Entering edit mode

Have you checked the line endings and file encoding? Perhaps that is what is tripping up the software.

If you want to post the VCF in question I can take a closer look at it.

ADD REPLY • link 6.2 years ago by Dan D 7.4k

0

Entering edit mode

Thanks. I have edited to include both the header and the first few lines.

ADD REPLY • link 6.2 years ago by spiral01 ▴ 110

score 0 · Answer 1 · 2020-09-10

It seems that the contig header lines are incorrect. If you have access to the genome fasta index that was used for the analysis, the following script should remove the existing '##contig' lines and replace them with the correct details from the fasta index.

# Input / output filenames
index_file="/home/user/reference/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.fai"
vcf_bad_header="genotypes.incorrectHeaderContigs.vcf"
output_vcf="genotypes.vcf"

# Extract assembly name
# Ensure the file extension passed to 'basename' matches the suffix of the index file
assembly_name=$(basename $index_file .fna.fai)

# Build desired '##contig' lines from fasta index file
cut -f 1-2 "$index_file" |\
sed -r 's/^(chr)/##contig=<ID=\1/' |\
sed -r 's/\s([0-9]+)/,length=\1/' |\
sed -r "s/$/,assembly=$assembly_name>/" > "$assembly_name.vcfHeaderContigs.txt"

# Locate line numbers of VCF header containing '##contig' entries
start_line=$(grep -n -m 1 "##contig" "$vcf_bad_header" | cut -d ":" -f 1)
end_line=$(grep -n "##contig" "$vcf_bad_header" | tail -n 1 | cut -d ":" -f 1)

# Build VCF: (1) before bad contigs (2) new contig lines (3) continue after bad contigs
head -n $(($start_line-1)) "$vcf_bad_header" > "$output_vcf"
cat "$assembly_name.vcfHeaderContigs.txt" >> "$output_vcf"
tail -n +$(($end_line+1)) "$vcf_bad_header" >> "$output_vcf"