Filterout biallelic SNPs from multiple VCF files.
3
1
Entering edit mode
9.4 years ago

Hi, I have a list of vcf files, one per each individual with the variants called for HLA genes. HLA has a lot of multi allelic SNPs but in this case I need to filter out only the biallelic SNPs, scanning through all vcf files. Is there any specific tool for this?

Ex: sample 1 rs-xx G C sample 1 rs-yy T C sample 2 rs-xx A C sample 2 rs-yy TC. In this case I want to get only rs-yy as the result.

Thanks a lot in advance.

HLA vcf bi allelic SNP • 6.3k views
ADD COMMENT
3
Entering edit mode
9.4 years ago
Adam ★ 1.0k

I suggest first merging your VCF files into a single file. Once this is done, there are multiple tools that could do the filtering you require (e.g. vcftools --max-alleles 2)

ADD COMMENT
0
Entering edit mode

Thanks a lot. Is merging over 500 samples feasible?

ADD REPLY
0
Entering edit mode

Yes, although htslib is likely to be much faster for such a task.

ADD REPLY
2
Entering edit mode
9.4 years ago
Garan ▴ 690

GATK SelectVariants has a BIALLELIC filter flag:

java -Xmx2g -jar GenomeAnalysisTK.jar -T SelectVariants \
-R human_g1k_v37.fasta \
-o out_biallelic.vcf \
--variant in.vcf \
-restrictAllelesTo BIALLELIC

You could put together a bash script to loop through an array of sample vcf file names and output the separate biallelic VCFs.

Something like:

batch=("sample_1" "sample_2" "sample_3") 
for sample in "${batch[@]}"
do  

     java -Xmx2g -jar GenomeAnalysisTK.jar -T SelectVariants \
     -R human_g1k_v37.fasta \
     -o ${sample}_out_biallelic.vcf \
     --variant ${sample}_in.vcf \
     -restrictAllelesTo BIALLELIC &
done
ADD COMMENT
0
Entering edit mode
9.4 years ago
Prakki Rama ★ 2.7k

Is it something like this you want?

awk '/T    C       / {print $0}' file.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 2394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6