Biostar Test Site

This is site is used for testing only. Visit: https://www.biostars.org to ask a question.

Difficulty filtering VCF file with vcftools
3
0
Entering edit mode
4.7 years ago
rc16955 ▴ 70

Hi all,

I am trying to filter a VCF file by read depth (min 10) and mapping quality (min 30) using vcftools. I have found it quite difficult to find sample lines of anything like this, and being very new to bioinformatics and command lines in general, I'm fairly certain I'm doing something wrong. This is what I have tried so far:

vcftools --vcf myfile.vcf --minDP 10 --minQ 30 --out myfile_filtered

This generates

    VCFtools - v0.1.12b
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
        --vcf ED132_55.vcf
        --minDP 10
        --minQ 30
        --out ED132_55_filtered

After filtering, kept 1 out of 1 Individuals
After filtering, kept 87041 out of a possible 103802 Sites
Run Time = 0.00 seconds

And an output file entitled "myfile_filtered.log", which contains nothing but the exact information printed above. There is no new vcf file, so I can only assume something is going wrong.

I would be very grateful if anyone could help me fix this code, or offer alternatives that might work better. I should note that as I am not using my own computer, I can't easily install new software packages and am restricted to bcftools and vcftools. Any python or perl scripts would be wonderful

SNP • 2.6k views
ADD COMMENT
1
Entering edit mode
4.7 years ago
colindaven ★ 2.9k

You need the recode option from memory to create a new filtered file, just add --recode to your command lines

ADD COMMENT
0
Entering edit mode

Hi Colin, thank you for your answer; I am now able to generate VCF files. I'm still not convinced that the line is working correctly, however. Upon examining the files I can see that the there are some SNPs present in the unfiltered file that are missing from the filtered file even though they satisfy the criteria, while there are some SNPs in the filtered file that do not satisfy the criteria. The command has definitely removed some SNPs but I'm not sure one what basis it's done it.

Once again, many thanks for taking the time to respond.

ADD REPLY
0
Entering edit mode
13 days ago
kanika.151 ▴ 120

I was also not convinced with the VCFtools filtration process so I switched to bcftools.

ADD COMMENT
0
Entering edit mode
12 days ago
sbstevenlee ▴ 10

For those who may be interested (and also familiar with Python), check out the fuc.pyvcf submodule I wrote:

Below is a simple example that (I believe) achieves what the original post was asking.

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['G', 'T', 'A', 'C', 'C'],
...     'ALT': ['A', 'C', 'T', 'A', 'T'],
...     'QUAL': ['.', 24, 39, 15, 30],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT:DP', 'GT:DP', 'GT:DP', 'GT:DP', 'GT:DP'],
...     'Steven': ['0/0:11', '1/1:8', '0/1:15', '0/1:17', '1/1:3'],
...     'Rachel': ['0/1:8', '0/0:23', '0/0:7', '0/1:15', '1/1:12'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT  Steven  Rachel
0  chr1  100  .   G   A    .      .    .  GT:DP  0/0:11   0/1:8
1  chr1  101  .   T   C   24      .    .  GT:DP   1/1:8  0/0:23
2  chr1  102  .   A   T   39      .    .  GT:DP  0/1:15   0/0:7
3  chr1  103  .   C   A   15      .    .  GT:DP  0/1:17  0/1:15
4  chr1  104  .   C   T   30      .    .  GT:DP   1/1:3  1/1:12

We first select rows with QUAL >= 30:

>>> filtered_vf = vf.filter_qual(30)
>>> filtered_vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT  Steven  Rachel
0  chr1  102  .   A   T   39      .    .  GT:DP  0/1:15   0/0:7
1  chr1  104  .   C   T   30      .    .  GT:DP   1/1:3  1/1:12

Next, we mark genotypes with DP < 10 as missing:

>>> filtered_vf = filtered_vf.markmiss_dp(10, full=True)
>>> filtered_vf.df
  CHROM  POS ID REF ALT  QUAL FILTER INFO FORMAT  Steven  Rachel
0  chr1  102  .   A   T    39      .    .  GT:DP  0/1:15   ./.:.
1  chr1  104  .   C   T    30      .    .  GT:DP   ./.:.  1/1:12
ADD COMMENT

Login before adding your answer.

Traffic: 238 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6