VCF file problem
2
0
Entering edit mode
7.0 years ago
lessismore ★ 1.3k

Hello everybody,

i have a doubt. i have been analyzing a vcf file and i wanted to check for some of the SNPs if there was correspondence between REF nucleotide in the vcf and the effective nucleotide in the genome from which the SNP was called. In all cases the nucleotide in the specific location i checked was corresponding to the ALT nucleotide in the vcf. Maybe it's a noobie question, but id be very curious to know why (i was expecting a correspondence to REF nucleotide).

Thanks in advance

vcf gbs • 2.9k views
ADD COMMENT
0
Entering edit mode

how have you checked this ? what was your command line.

ADD REPLY
0
Entering edit mode

-> i started from the: genomeassembly file vcf file a list of interesting SNPs (contig and location) that we identified by SVS to have an high FST index:

contig1 76832 
contig2 691177 
contig3 144034 
contig4 190995

-> i extracted the first 5 columns from the vcf for the SNPs

cat snps | grep -w -F -f - vcf_prova.vcf | cut -f1-5

python mygenome = SeqIO.to_dict(SeqIO.parse("genomeassembly.fasta","fasta")) 
contig1 = mygenome ['contig1'] 
print contig1 [76832]

where 76832 is the position of the SNP stored in the vcf

The result i get is that the location i find in the contig is always correspondent to the nucleotide in the ALT column.

ADD REPLY
1
Entering edit mode
7.0 years ago

The following cmd line scan the SNVs in a VCF and check the REF allele vs the

$ awk -F '\t' '/^#/ || !($4 ~ /^[ATGC]$/) {next;} {printf("%s,%s,%s\n",$1,$2,$4);}'  mutations.vcf  | while IFS="," read -a A; do echo -n "${A[2]} " && samtools faidx ref.fa "${A[0]}:${A[1]}-${A[1]}"  | grep -v '^>' ; done
A A
A A
T T
T T
C C
A A
T T
A A
A A
G G
T T
T T

(...)

ok REF vcf == REF fasta , to my great relief, people won't have to retract their papers :-)

ADD COMMENT
1
Entering edit mode
7.0 years ago

You can use bcftools norm for checking against a reference:

bcftools norm --fasta-ref "$genome_assembly_file" --check-ref  $ACTION "$vcf_file"

where $ACTION may be

  • w : warn
  • s : try to set/fix bad sites (make sure you verify your VCF file afterwards!)

(for additional actions see the manual)

ADD COMMENT
0
Entering edit mode

what version of bcftools? in my version is not present Version: 0.1.19-96b5f2294a

ADD REPLY
2
Entering edit mode

Well -- I don't know when the norm command or the --check-ref option were introduced but the Github history dates back until 2013. Don't you think your version is a little bit outdated?

ADD REPLY
0
Entering edit mode

Yes indeed. Thanks i just replaced it.

So the output i got is this for every entry: [W::vcf_parse] contig 'contig1' is not defined in the header. (Quick workaround: index the file with tabix.)

what does that mean?

ADD REPLY
1
Entering edit mode

Your VCF file does not adhere to the VCF specifications. In a proper VCF, every template sequence (i.e., chromosome or contig) must be defined in the header, like exemplified in Section 1.1 of the specification (look for the line starting with ##contig=<ID=20). Apart from your file not being proper VCF, this is a warning and my guess is that it can be ignored.

ADD REPLY
0
Entering edit mode

thanks a lot because im getting things clearer:

they used a reference genome but here in the vcf it says that "Reference allele is not known. The major allele was used as reference allele". So what that means? they did not set a reference genome but still they mapped again a reference and the vcf output in the column REF is imposed by the major allele??

screenshotScreenshot

ADD REPLY
2
Entering edit mode

I have no idea - but i guess yes: they somehow inferred variants and used the major allele as reference. Probably, you should contact the people that created the VCF file to ask for the methods...

Apart from that: without knowing the exact reference that was used (i.e., having the same FASTA file) I think bcftools norm will not be able to correctly annotated the reference allele.

ADD REPLY
0
Entering edit mode

yep, i just talked with the person. He created a new VCF from Tassel not defining the reference sequence e choosing the major allele as reference. Solved. Thanks a lot

ADD REPLY

Login before adding your answer.

Traffic: 1467 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6