Changing the reference haplotype in VCF
1
0
Entering edit mode
6.2 years ago

A VCF file describes how to translate the reference genome into (one or more) alternative haplotypes. I need a VCF file that translates one of these haplotypes back into the reference. In other words, I need to "invert" the VCF file I have.

If only SNPs were present, one would simply swap reference and alternative haplotype columns. Indels make it necessary to update the position. Of course, larger structural variants would also change the coordinates (sometimes dramatically), but for now I only care about SNPs and indels.

It does not seem too difficult to code, but would be quite time consuming (for me) to do it right.

Is there a tool for it?

alignment vcf snp indel • 2.7k views
ADD COMMENT
0
Entering edit mode

I guess you can use FastaAlternateReferenceMaker from GATK, but other way around.

ADD REPLY
0
Entering edit mode

Could you elaborate? This tool just "merges" reference fasta and a vcf file, and does not produce a new vcf file, as far as I can tell.

ADD REPLY
1
Entering edit mode
6.2 years ago

Ok, I think I've found the solution myself:

## get mouse chr19
curl ftp://ftp.ensembl.org/pub/release-91/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.19.fa.gz > chr19.fa.gz
zcat chr19.fa.gz > chr19.fa

## get BALBc SNPs
curl ftp://ftp-mouse.sanger.ac.uk/current_snps/strain_specific_vcfs/BALB_cJ.mgp.v5.indels.dbSNP142.normed.vcf.gz > dir.raw.vcf.gz
zcat dir.raw.vcf.gz > dir.raw.vcf

## save header
grep '^#' dir.raw.vcf > dir.vcf

## collapse to first haplotype (phased or not)
grep -v '^#' dir.raw.vcf | sed 's/,[GTCA]*//' > dir.hap.vcf

## clean up (ie turn CCCA>CA to CCA>A etc)
cat dir.hap.vcf | awk 'BEGIN{OFS="\t"} \
{ if(length($4)>length($5)) {min=length($5)} else { min=length($4) } } \
{ print $1,$2+min-1,$3,substr($4,min),substr($5,min),$6,$7,$8,$9,$10 }' \
> dir.hap.clean.vcf

## resort
sort -k1,1 -k2,2n dir.hap.clean.vcf > dir.hap.sort.vcf

## remove overlapping indels
grep -v '^#' dir.hap.sort.vcf | awk 'BEGIN{pos=0; chr=1} \
{ if($1!=chr) {chr=$1; pos=0} } \
{if($2>pos) { print $0; pos=$2-(length($5)-length($4)) } }' \
>> dir.vcf

## zip and index
bgzip -f dir.vcf && tabix -p vcf dir.vcf.gz

## convert ref to alt
bcftools consensus -f chr19.fa dir.vcf.gz > chr19.alt.fa

## invert vcf
zcat dir.vcf.gz | grep '^#' > alt.vcf &&     # get header
zcat dir.vcf.gz | grep -v '^#' | awk 'BEGIN{OFS="\t"; chr=1; i=0} \
{ if($1!=chr) {chr=$1; i=0} } \
{print $1,$2+i,$3,$5,$4,$6,$7,$8,$9,$10; i=i+length($5)-length($4)}' \
>> alt.vcf

## zip and index
bgzip -f alt.vcf && tabix -p vcf alt.vcf.gz

## convert alt to ref
bcftools consensus -f chr19.alt.fa alt.vcf.gz > chr19.dir2.fa

## check if orignal restored, no difference
cmp --silent chr19.dir2.fa chr19.fa && echo "Files are identical." || echo "Files are different."

This solutions deals with (i.e. removes) alternative haplotypes and overlapping indels (i.e. alternative haplotypes in practice). Of course, one might want to handle these differently, e.g. mask the offending (likely heterozygous) regions completely.

ADD COMMENT
0
Entering edit mode

Looks good - if you are certain that it does what you need, then I can move it to an Answer, and then you can Accept it. Let me know.

ADD REPLY
0
Entering edit mode

Sure, go ahead, thanks.

ADD REPLY
0
Entering edit mode

Done - you should be able to Accept the answer now

ADD REPLY

Login before adding your answer.

Traffic: 2017 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6