Biostar Beta. Not for public use.
VCF file filtration
0
Entering edit mode
23 months ago
jaafari.omid • 40
@jaafari.omid42947

Dears all, I have a vcf file from whole genome data on a fish species. This vcf file has prepared with bcftools pipeline. I have read something for filtering the vcf file but I couldn't make heads or tails of them. I know there is GATK pipeline but because I used bowtie2 for mapping I am not able to use that pipeline, I think. So I will be so grateful if you can help to filter my vcf file with bcftools or vcftools or any other ways.

Any help will be appreciated in advance.

Regards, Omid

snp WGS VCF bcftools vcftools • 369 views
0
Entering edit mode

Hello jaafari.omid ,

what's the goal of your filtering? With bcftools view you can create subsets on nearly any criteria you like.

fin swimmer

0
Entering edit mode

Hi, Actually I was looking to find a straightforward pipeline for filtering. Should I just do filtering for minimum genotype depth and quality? Or It is better to consider some other type of filtration?like var filtering? or removing In/Del and copy numbers?

1
Entering edit mode

Sorry to say, but there is no "straightforward pipeline for filtering".

This depends on library prep, sequencing platform, the way you do alignment/mapping and variant calling and of course what you try to find out.

It sounds a bit like you are asking about how to remove false positive variants from your file. But here it is important to know what is more important for you: specificity or sensitivity ?

fin swimmer

0
Entering edit mode

So first of all I should consider those parameters. But I thought I can at least remove individuals with a specific level of missing data, also considering their minimum genotype depth and quality. For mapping I used bowtie2 with --very0sensitive option and here is the command I used for variant calling.

 bcftools mpileup -Ou -q 20 -Q 20 -f ref.fa -b Bam-list.txt | bcftools call -Ov -mv > a.vcf


Actually I am looking to find the SNPs which are outliers between different groups. Of course removing the false positive variants is a good idea, but still I can't understand the difference between specificity and sensitivity?

0
Entering edit mode

I would upload my .vcf files in galaxy and using bcftools I would filterate my files. For example in galaxy the default for DP is 10 and you can change that.

0
Entering edit mode

Thanks for your answer, Then can I consider the MAF by using Galaxy?

0
Entering edit mode

Sorry I am not sure, I have started whole genome sequencing since 2 weeks ago. But, I found a tool named MAFtools in R very helpful although I have not used that yet.

1
Entering edit mode

0
Entering edit mode

You are most welcome, best of luck

0
Entering edit mode

Do you mean the MAF in your own dataset or the MAFs from the large consortia, like 1000 Genomes?

0
Entering edit mode

I meant my own data set, filtering my reads on vcf file based on MAF.

2
Entering edit mode

Cool. For that, I recommend bcftools view, with the following option:

-q/Q, --min-af/--max-af <float>[:<type>]


For the other filtering that you mentioned earlier, if you have BCFtools, you should also have a Perl executable called vcfutils.pl, which has much extra functionality on top of mpileup and call:

/Programs/bcftools-1.3.1/vcfutils.pl

Usage:   vcfutils.pl <command> [<arguments>]

Command: subsam       get a subset of samples
listsam      list the samples
fillac       fill the allele count field
qstats       SNP stats stratified by QUAL

hapmap2vcf   convert the hapmap format to VCF
ucscsnp2vcf  convert UCSC SNP SQL dump to VCF

varFilter    filtering short variants (*)
vcf2fq       VCF->fastq (**)

Notes: Commands with description endting with (*) may need bcftools
specific annotations.

0
Entering edit mode

but still I can't understand the difference between specificity and sensitivity

There are two types of filtering: (1) Quality filtering and (2) filter for variants of interest.

Unfortunately not all variants in your vcf file are true variants. Errors introduced during library prep, sequencing and alignment leads to false positive variants. If you have a lot of variants it is useful to first try to eliminate those variants before looking for variants of interest.

Sensitivity and specificity are terms that describe how reliable your dataset is. Sensitivity is the answer to the question, about how many of the true variants in my sample I'm able to detect. Specificity is the answer to question, about how many other variants beside the true variants will I detect.

Usually the sensitivity of a NGS analysis is quite high, which means you will detect most of the true variants. Due to the errors mentioned above the specificity could be not that high, because you have a certain number of false positive.

The goal for quality filtering is to increase the specificity by removing those false positive. Depending on the filter criteria one will also remove true positiv, which leads to a decreasing sensitivity. So whenever doing quality filtering, you have to ask your self what is more important: Be sure to have all true variants or to have a clean dataset, where I can be sure all variants are true but some are missing.

0
Entering edit mode

Thank you very much for your very helpful explanation. So I think the specificity is important to me and try to keep the final file clean.

Regards,