Question

How To Separate Snp Variants From Indel Variants In The Same Vcf File

8

Entering edit mode

13.0 years ago

Jianfengmao ▴ 310

Background: Usually we grouped the genomic variants into different types, like SNP, insertion, deletion, Transposable element.

I would like to get some primary statistics, like frequency or counts, for different types of such genomic variants, by vcftools. And also I want to export this genomic variant data for population genetic studies, for example, exporting allele frequency data only for SNPs or Indels.

My question: I want to know if there are tools/strategies to divide SNP variants from indel variants in the same VCF file (only snp and indel there), and keep them into different vcf files. I do my study depending on VCF and VCFTools.

I think your suggestions are really valuable for me and who are depending on VCF format. Thanks in advance.

This question has ever been asked in VCFTools-help mailing list. But, I have not gotten any replies.

vcf vcftools • 15k views

ADD COMMENT • link updated 24 months ago by Ram 43k • written 13.0 years ago by Jianfengmao ▴ 310

Ram · Answer 1 · 2011-04-11

19

Entering edit mode

13.0 years ago

Pierre Lindenbaum 161k

EDIT

four years later, This awk script wouldn't work with multiple ALT alleles. Now, I would use bcftools filter with TYPE=snp or my program VCFFilterJS

you could use the following AWK script:

/^#/    {
    print $0 > "snv.vcf";
    print $0 > "indels.vcf";
    next;
    }

/^[^\t]+\t[0-9]+\t[^\t]*\t[atgcATGC]\t[a-zA-Z]\t/   {
    print $0 > "snv.vcf";
    next;
    }

    {
    print $0 > "indels.vcf";
    next;
    }

The script saves the SNVs and the indels in two distinct files snv.vcf and indels.vcf.

The headers are saved in both files.

If the line has a reference and a alternate base which is a single nucleotide, then save the line to snv.vcf else save it to indels.vcf

awk -f file.awk file.vcf

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 13.0 years ago by Pierre Lindenbaum 161k

2

Entering edit mode

Pierre Lindenbaum, I learned much from your script. I am now learning sed and awk, your scripts enlightened me. Thanks a lot.

ADD REPLY • link 13.0 years ago by Jianfengmao ▴ 310

0

Entering edit mode

precisely. the script I was thinking on looks exactly like this one.

ADD REPLY • link 13.0 years ago by Jorge Amigo 14k

score 2 · Answer 2 · 2011-04-11

2

Entering edit mode

13.0 years ago

Jorge Amigo 14k

from the VCF specs I would say that you only have to look for single base changes to detect SNPs, considering the rest as INDELs. as described on the bottom of this page SNPs would be the only variations with a single base on both REF and ALT columns, because INDELs would have either REF or ALT column a multi-base string. I think that if you just look for string lengths on those columns then you would have a sufficient filter, and although I haven't tried through vcftools it should be straightforward to script such filter and to divide your VCF file in two.

ADD COMMENT • link 13.0 years ago by Jorge Amigo 14k

0

Entering edit mode

Dear Jorge Amigo, Thanks a lot. I am not not at programming, so I asked such a simple question here, but I have begun to learn programming. Thanks for your kind directions.

ADD REPLY • link 13.0 years ago by Jianfengmao ▴ 310

0

Entering edit mode

for this matter you may use Pierre's awk script directly, and you will have the desired results. for your future work I would suggest you to continue learning some awk basics that will surely help you to implement very useful large file parsings with almost no hassle.

ADD REPLY • link 13.0 years ago by Jorge Amigo 14k

0

Entering edit mode

Yes, I have benefit much from learning sed and awk. And, Pierre's script enlightened me, I think I have made a great jump by following Pierre's scripts. Thank you all.

ADD REPLY • link 13.0 years ago by Jianfengmao ▴ 310

score 2 · Answer 3 · 2019-05-28

2

Entering edit mode

4.9 years ago

rodd ▴ 230

You can separate single nucleotide variants from indels using vcftools and the flags --keep-only-indels or --remove-indels

vcftools --vcf input_file_containing_all_variants.vcf --remove-indels --recode --recode-INFO-all --out output_file_with_indels_removed.vcf

ADD COMMENT • link 4.9 years ago by rodd ▴ 230