Write header of contigs in a vcf file
1
0
Entering edit mode
4.5 years ago
angelaparody ▴ 100

Hi,

I am having the following issue. I used STACKS to produce a .vcf file and it does not contain header for contigs, and I need this in order to edit the ID column which is all 0s at the moment (see below, well, it is a bit mixed up). This is how the vcf file looks like (header and a few more rows):

##fileformat=VCFv4.2                                                                                                        
##fileDate=20191104                                                                                                     
##source="Stacks v2.41"                                                                                                     
##INFO=<ID=AD,Number=R,Type=Integer,Description="Total Depth for Each Allele">                                                                                                      
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">                                                                                                       
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">                                                                                                      
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">                                                                                                      
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allele Depth">                                                                                                       
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">                                                                                                     
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">                                                                                                      
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">                                                                                                      
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">                                                                                                       
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">                                                                                                        
##INFO=<ID=loc_strand,Number=1,Type=Character,Description="Genomic strand the corresponding Stacks locus aligns on">                                                                                                        
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Lib1_3A Lib1_3B Lib1_3C Lib1_3D Lib1_3E Lib1_3F Lib1_4A Lib1_4B Lib1_4C Lib1_4D Lib1_4E Lib1_4F Lib1_7A Lib1_7A2    Lib1_7D Lib1_7D2    Lib1_7E Lib1_7F
scaffold_45 10  0   A   C   0

And when I use this option to edit the #IDs (I want IDs to be like scaffold_45_10, which is scaffold _CHROM_POS):

bcftools annotate --set-id +'%CHROM\_%POS' myfilename.vcf > new.vcf.

I get this error:

[W::vcf_parse] Contig 'scaffold_45' is not defined in the header. (Quick workaround: index the file with tabix.)
Encountered error, cannot proceed. Please check the error output above.

Therefore I would need to define contigs in the header (that is what the error says). To do so, I tried what is explain in the last message in this post: bcftools_Issue#766 Following exactly what it is said there I got a vcf file that looks like this:

    ##contig=<ID=scaffold_45, length=120>                                                                                                       
    ##contig=<ID=scaffold_62, length=125>                                                                                                       
    ##contig=<ID=scaffold_69, length=132>                                                                                                       
    ##contig=<ID=scaffold_98, length=172>                                                                                                       
    ##contig=<ID=scaffold_154, length=149>                                                                                                      
    ##contig=<ID=scaffold_210, length=184>                                                                                                      
  (there are more rows)                                                                                                     
    **VIM - Vi IMproved 8.0 (2016 Sep 12, compiled Jun 21 2019 04:10:35)**                                                                                                      
    scaffold_45 10  0   A   C   0   PASS

Which is wrong. The header that was there has been replaced by the list of contigs IDs and lengths (which is what I want to include in the header but without losing the header that was there). There is also one extra row (VIM - Vi IMproved 8.0 (2016 Sep 12, compiled Jun 21 2019 04:10:35) that I don't know how it appeared and should not be there.

In short, I need to be able to edit the IDs of the vcf file, and for that I need to add the contig info in the header of the vcf file. The reason I want the IDs in the vcf file is to filter SNPs by IDs with vcftools. Does someone know how to solve this issue?

Thanks in advance,

'Angela

PS: I may tray to include the header of my initial vcf file into the header.txt and see what happens...

vcf edit header • 4.0k views
ADD COMMENT
3
Entering edit mode
4.5 years ago
angelaparody ▴ 100

I found a solution! Maybe not the best, probably it could be done in a different -and faster- way, but I am not a bioinformatician and this worked for me :)

An output of STACKS is catalog.fa.gz and I obtained the .fai doing this:

gzip -cd ./catalog.fa.gz > catalog.fa
samtools faidx catalog.fa > catalog.fa.fai

Then I took this code from Biostars: Question: Add contig lenght to VCF header in a robust way

awk '{printf("##contig=<ID=%s,length=%d>\\n",$1,$2);}' ref.fai > ToEditHeader.txt

I have edited this txt file and used it to edit what is between ** (see below).

awk '/^#CHROM/ { printf("**##contig=<ID=1,length=195471971>\n##contig=<ID=2,length=182113224>\n**");} {print;}' inputfile.vcf > contigsInHeader.vcf

Then I used this (see below) from: Biostars: Question: How edit VCF file (specifically #CHROM and ID columns), which finally set IDs on my contigsInHeader.vcf file:

bcftools annotate --set-id +'%CHROM\_%POS' contigsInHeader.vcf > SnpsWithIDs.vcf

I hope this help to anyone having the same issue!

Cheers,

'Angela

ADD COMMENT

Login before adding your answer.

Traffic: 3072 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6