Question

Tool to fix reference allele nucleotide in vcf

2

Entering edit mode

5.4 years ago

WouterDeCoster 47k

I am looking for a tool/script/pony to correct the REF column in a vcf file whenever that nucleotide doesn't match the reference genome, as supplied in fasta format.

It sounds like a common task but I could not find something. I did find bcftools +fixref but that only works for SNPs. My vcf files are from structural variants.

Cheers,
W

vcf fasta pony • 5.4k views

ADD COMMENT • link updated 5.4 years ago by finswimmer 16k • written 5.4 years ago by WouterDeCoster 47k

0

Entering edit mode

tagging: Pierre Lindenbaum

ADD REPLY • link 5.4 years ago by GenoMax 141k

0

Entering edit mode

My vcf files are from structural variants.

what does it currently look like in the REF/ALT column ? (why do you need to fill the REF column (thinking of a very large deletion...) in your downstream analysis ? )

ADD REPLY • link 5.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

My main issue is that EVA complains that the nucleotides are incorrect. It seems their software only uses POS to check the REF nucleotide, and not END.

Just some lines from the vcf:

chr1    258046  93      G       <INS>   161     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=258047;CIPOS=-10,1;CIEND=-10,1;SVLEN=126;RT=0,8,0;GAP=126;MAPQ=60,60;PID=61.386,1.698;PLENGTH=0.036,1.185;R
chr1    297535  96      A       <INS>   96      MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=297536;CIPOS=-1,1;CIEND=-1,1;SVLEN=68;RT=0,2,0;GAP=68;MAPQ=60,60;PID=6.857,456.896;PLENGTH=0.409,0.005;RLEN
chr1    350805  98      A       <INS>   649     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=350806;CIPOS=-1,3;CIEND=-1,3;SVLEN=40;RT=0,11,0;GAP=40;MAPQ=36,36;PID=33.595,1.201;PLENGTH=0.094,2.548;RLEN
chr1    368841  107     G       <INS>   35      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=368842;CIPOS=-2,1;CIEND=-2,1;SVLEN=43;RT=0,4,0;GAP=43;MAPQ=42,42;PID=1.298,27.038;PLENGTH=0
chr1    369462  133     C       <INS>   131     SVcluster;MapQual;CIPOS;CIEND   IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369463;CIPOS=-14,19;CIEND=-14,19;SVLEN=72;RT=0,6,0;GAP=72;MAPQ=56,56;PID=7.453,8.49
chr1    369520  135     A       <INS>   65      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369521;CIPOS=-4,4;CIEND=-4,4;SVLEN=42;RT=0,5,0;GAP=42;MAPQ=35,35;PID=23.347,20.085;PLENGTH=

Essentially a single nucleotide would be sufficient - and in one file 98% is correct. The other file has N for every position so some more changes need to be made.

I'm already working on a python script (now posted as answer) but I thought someone else should have done it already ;p

ADD REPLY • link 5.4 years ago by WouterDeCoster 47k

0

Entering edit mode

For your narrow purpose, where all your insertions are symbolic rather than explicitly spelled out, the existing answers should be a satisfactory workaround to "bcftools +fixref"'s insistence on SNP-only data.

However, it's worth noting that there is a very good reason that bcftools insists on only SNPs. It's actually impossible to fix ordinary short indels: the only difference between a 1-base deletion and an insertion of the same base at the same position is the REF/ALT order.

ADD REPLY • link 5.4 years ago by chrchang523 10k

score 5 · Accepted Answer · 2018-11-05

5

Entering edit mode

5.4 years ago

JC 13k

Some code like this can help, but be aware that changing the reference column affects other fields in your VCF, such as the GT value.

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use fixRefVCF.pl <GENOME_FASTA> <VCF> <FIX_VCF>\n";

my $genome = shift @ARGV;
my $orig_vcf = shift @ARGV;
my $new_vcf = shift @ARGV;

my %seq = ();
my ($chr, $pos, $ref, $obs);

warn "loading genome sequence from $genome\n";
open (my $fh, "<", $genome) or die "cannot read $genome\n";
while (<$fh>) {
    chomp;
    if (/>(\w+)/) {
        $chr = $1;
    } else {
       $seq{$chr} .= $_;
    }
}
close $fh;

open (my $vh, "<", $orig_vcf) or die "cannot read $orig_vcf\n";
open (my $oh, ">", $new_vcf) or die "cannot write $new_vcf\n";
while (<$vh>) {
    if (/^#/) {
         print $oh $_;
    } else {
        chomp;
        my @var = split (/\t/, $_);
        $chr = $var[0];
        $pos = $var[1]; # VCF is 1-based, Perl string is 0-based
        $ref = $var[3];
        $obs = substr($seq{$chr}, $pos - 1, length $ref);
        if ($ref ne $obs) {
            warn "editing $chr:$pos $ref->$obs\n";
            $var[3] = $obs;
        }
        print $oh join "\t", @var;
        print $oh "\n";
    }
}
close $vh;
close $oh;

ADD COMMENT • link 5.4 years ago by JC 13k

1

Entering edit mode

Thanks! That seems to work beautifully well (and fast too).
Two comments: the warn "editing $chr:$pos $ref->$obs\n"; is off-by-one and the header is not written to the new vcf file ;-)

ADD REPLY • link 5.4 years ago by WouterDeCoster 47k

1

Entering edit mode

glad to help, I edited the code to show the VCF coordinates and print the header.

ADD REPLY • link 5.4 years ago by JC 13k

0

Entering edit mode

This script is nice to identify the variants whose REF allele is not correct and should be revised. It should be not used to change the reference alllele forcely with this script.

ADD REPLY • link 4.8 years ago by Shicheng Guo ★ 9.4k

1

Entering edit mode

Ehm yes it should change those. That is exactly what I asked for.

ADD REPLY • link 4.8 years ago by WouterDeCoster 47k

0

Entering edit mode

So next step is apply plink --a1-allele to change the reference allele?

ADD REPLY • link 4.8 years ago by Shicheng Guo ★ 9.4k

1

Entering edit mode

I did not use or need plink. I needed to modify the reference allele in a vcf file.

ADD REPLY • link 4.8 years ago by WouterDeCoster 47k

0

Entering edit mode

So what's the approach for you to modify the reference allele in a vcf. I find there are ~12% SNPs are REF/ALT reversed.

ADD REPLY • link 4.8 years ago by Shicheng Guo ★ 9.4k

0

Entering edit mode

THIS SCRIPT FIXES THE REFERENCE ALLELES in a vcf file which are not correct when compared with a fasta file.

ADD REPLY • link 4.8 years ago by WouterDeCoster 47k

score 4 · Accepted Answer · 2018-11-14

4

Entering edit mode

5.4 years ago

finswimmer 16k

Hello WouterDeCoster ,

there is another bcftools plugin that should do this:

$ bcftools +fill-from-fasta input.vcf -- -c REF -f genome.fa > output.vcf

fin swimmer

ADD COMMENT • link 5.4 years ago by finswimmer 16k

score 2 · Accepted Answer · 2018-11-05

2

Entering edit mode

5.4 years ago

WouterDeCoster 47k

This is the python solution I came up with, using cyvcf (minimally requiring v0.10.2) and pyfaidx. It can use compressed vcf and fasta files.

ADD COMMENT • link 5.4 years ago by WouterDeCoster 47k