Biostar Beta. Not for public use.
Remove a list of positions form a VCF file
1
Entering edit mode
18 months ago
shinken123 • 80
México

Hi

I have a list of chromosomes and positions that looks like this:

1   10045
1   93056
1   109272
1   127711
1   127822
.
.
.

And now I would like to use it to remove them from my vcf file. Do you know how to do this?

SNP vcf filter • 1.6k views
ADD COMMENTlink
5
Entering edit mode
9 months ago
Germany

bcftools can do this:

$ bcftools view -T ^list_snp_exclude.txt input.vcf > output.vf

With the ^ before the file with the coordinates one tell bcftools to exclude these regions.

fin swimmer

ADD COMMENTlink
3
Entering edit mode
19 months ago
Santiago de Compostela, Spain

a simple grep would do:

grep -vf list.txt file.vcf
ADD COMMENTlink
1
Entering edit mode

Though this was posted a while ago, I just have to say that if you grep with just the -vf flags, it will remove positions that are in list.txt from file.vcf but it will also remove additional positions that might be comprised of more digits and still contain the sequence of digits of the positions from the list. For example, you may want to remove position 10045, but if the vcf contains the positions 100450, 1004511, 100453489 etc, these will be removed as well.

In this case the -w flag should also be added to the above which greps words, that is it greps the patterns that are given if they are preceded and followed by whitespace.

ADD REPLYlink
0
Entering edit mode

Thank you very much. The only problem with grep for me is that was very slow and memory consuming so I use this link

So I transform my file to a bed file like this:

1   6405767 6405767
1   8108895 8108895
1   8623336 8623336
.
.
.

May be is not the most elegant way to do it but works for me.

ADD REPLYlink
1
Entering edit mode
22 months ago
yoce_pf • 40
Universidad Nacional Autónoma de México…

Hi!!

If someone has the same question, this loop has solved the problem

grep -Fwvf list_snp_exclude file.vcf > new_filter.vcf

list_snp_exclude: It's a list with the format Chromosome_name"\t"Position

Chrom_177   4393715
Chrom_177   4394618
Chrom_177   4395751
Chrom_215   4395751
Chrom_215   4396373
. . .
ADD COMMENTlink
0
Entering edit mode

How is this different from Jorge's answer above?

ADD REPLYlink
1
Entering edit mode

my answer was very simple. this one adds more grep functionality: -F option looks for fixed strings rather than regular expressions, and -w option looks for whole words rather than just matching patterns. I don't know how -F works in conjunction with -w, but it looks like an overall faster option. if performance is to be considered, maybe a better aimed regex (-P option needed) could also be even faster:

sed 's/^/^/; s/$/\\t/' list.txt | grep -vPf - file.vcf
ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1