Tool to fix reference allele nucleotide in vcf
3
2
Entering edit mode
5.4 years ago

I am looking for a tool/script/pony to correct the REF column in a vcf file whenever that nucleotide doesn't match the reference genome, as supplied in fasta format.

It sounds like a common task but I could not find something. I did find bcftools +fixref but that only works for SNPs. My vcf files are from structural variants.

Cheers,
W

vcf fasta pony • 5.4k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

My vcf files are from structural variants.

what does it currently look like in the REF/ALT column ? (why do you need to fill the REF column (thinking of a very large deletion...) in your downstream analysis ? )

ADD REPLY
0
Entering edit mode

My main issue is that EVA complains that the nucleotides are incorrect. It seems their software only uses POS to check the REF nucleotide, and not END.

Just some lines from the vcf:

chr1    258046  93      G       <INS>   161     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=258047;CIPOS=-10,1;CIEND=-10,1;SVLEN=126;RT=0,8,0;GAP=126;MAPQ=60,60;PID=61.386,1.698;PLENGTH=0.036,1.185;R
chr1    297535  96      A       <INS>   96      MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=297536;CIPOS=-1,1;CIEND=-1,1;SVLEN=68;RT=0,2,0;GAP=68;MAPQ=60,60;PID=6.857,456.896;PLENGTH=0.409,0.005;RLEN
chr1    350805  98      A       <INS>   649     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=350806;CIPOS=-1,3;CIEND=-1,3;SVLEN=40;RT=0,11,0;GAP=40;MAPQ=36,36;PID=33.595,1.201;PLENGTH=0.094,2.548;RLEN
chr1    368841  107     G       <INS>   35      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=368842;CIPOS=-2,1;CIEND=-2,1;SVLEN=43;RT=0,4,0;GAP=43;MAPQ=42,42;PID=1.298,27.038;PLENGTH=0
chr1    369462  133     C       <INS>   131     SVcluster;MapQual;CIPOS;CIEND   IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369463;CIPOS=-14,19;CIEND=-14,19;SVLEN=72;RT=0,6,0;GAP=72;MAPQ=56,56;PID=7.453,8.49
chr1    369520  135     A       <INS>   65      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369521;CIPOS=-4,4;CIEND=-4,4;SVLEN=42;RT=0,5,0;GAP=42;MAPQ=35,35;PID=23.347,20.085;PLENGTH=

Essentially a single nucleotide would be sufficient - and in one file 98% is correct. The other file has N for every position so some more changes need to be made.

I'm already working on a python script (now posted as answer) but I thought someone else should have done it already ;p

ADD REPLY
0
Entering edit mode

For your narrow purpose, where all your insertions are symbolic rather than explicitly spelled out, the existing answers should be a satisfactory workaround to "bcftools +fixref"'s insistence on SNP-only data.

However, it's worth noting that there is a very good reason that bcftools insists on only SNPs. It's actually impossible to fix ordinary short indels: the only difference between a 1-base deletion and an insertion of the same base at the same position is the REF/ALT order.

ADD REPLY
5
Entering edit mode
5.4 years ago
JC 13k

Some code like this can help, but be aware that changing the reference column affects other fields in your VCF, such as the GT value.

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use fixRefVCF.pl <GENOME_FASTA> <VCF> <FIX_VCF>\n";

my $genome = shift @ARGV;
my $orig_vcf = shift @ARGV;
my $new_vcf = shift @ARGV;

my %seq = ();
my ($chr, $pos, $ref, $obs);

warn "loading genome sequence from $genome\n";
open (my $fh, "<", $genome) or die "cannot read $genome\n";
while (<$fh>) {
    chomp;
    if (/>(\w+)/) {
        $chr = $1;
    } else {
       $seq{$chr} .= $_;
    }
}
close $fh;

open (my $vh, "<", $orig_vcf) or die "cannot read $orig_vcf\n";
open (my $oh, ">", $new_vcf) or die "cannot write $new_vcf\n";
while (<$vh>) {
    if (/^#/) {
         print $oh $_;
    } else {
        chomp;
        my @var = split (/\t/, $_);
        $chr = $var[0];
        $pos = $var[1]; # VCF is 1-based, Perl string is 0-based
        $ref = $var[3];
        $obs = substr($seq{$chr}, $pos - 1, length $ref);
        if ($ref ne $obs) {
            warn "editing $chr:$pos $ref->$obs\n";
            $var[3] = $obs;
        }
        print $oh join "\t", @var;
        print $oh "\n";
    }
}
close $vh;
close $oh;
ADD COMMENT
1
Entering edit mode

Thanks! That seems to work beautifully well (and fast too).
Two comments: the warn "editing $chr:$pos $ref->$obs\n"; is off-by-one and the header is not written to the new vcf file ;-)

ADD REPLY
1
Entering edit mode

glad to help, I edited the code to show the VCF coordinates and print the header.

ADD REPLY
0
Entering edit mode

This script is nice to identify the variants whose REF allele is not correct and should be revised. It should be not used to change the reference alllele forcely with this script.

ADD REPLY
1
Entering edit mode

Ehm yes it should change those. That is exactly what I asked for.

ADD REPLY
0
Entering edit mode

So next step is apply plink --a1-allele to change the reference allele?

ADD REPLY
1
Entering edit mode

I did not use or need plink. I needed to modify the reference allele in a vcf file.

ADD REPLY
0
Entering edit mode

So what's the approach for you to modify the reference allele in a vcf. I find there are ~12% SNPs are REF/ALT reversed.

ADD REPLY
0
Entering edit mode

THIS SCRIPT FIXES THE REFERENCE ALLELES in a vcf file which are not correct when compared with a fasta file.

ADD REPLY
4
Entering edit mode
5.4 years ago

Hello WouterDeCoster ,

there is another bcftools plugin that should do this:

$ bcftools +fill-from-fasta input.vcf -- -c REF -f genome.fa > output.vcf

fin swimmer

ADD COMMENT
2
Entering edit mode
5.4 years ago

This is the python solution I came up with, using cyvcf (minimally requiring v0.10.2) and pyfaidx. It can use compressed vcf and fasta files.

ADD COMMENT

Login before adding your answer.

Traffic: 2350 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6