Question

How to do SNP selection in large VCF/BCF/GenotypeTables in R like you can using BCFTools,CYVCF2 or Excel

1

Entering edit mode

6.1 years ago

William ★ 5.3k

How to do SNP selection in large VCF/BCF/GenotypeTables in R like you can using BCFTools,CYVCF2 (fast python VCF/BCF parser) or Excel?

Support for at least the following type of filter criteria is required:

Filtering on Region of Interest
Filtering on Samples of interest
Filtering on Variant Quality (QUAL)
Filtering on Genotype call rate (INFO/AN / (number of samples*2))
Number of alternative alleles (INFO/AC)
Filtering on alternative allele frequency (INFO/AF)
Samples required to HOM_REF,HET,HOM_ALT
Samples to have contrasting genotypes

Preferred supported filtering criteria:

Filtering on sample specific or minimal/maximum/average genotype depth
Filtering on sample specific or minimal/maximum/average genotype quality
Distance to nearest neighbor variants upstream and downstream

Depending on filter support for genotype depth and genotype quality these VCF/BCF/R specific format files can be quite big.

So compact (small data size) R specific format and multi-threaded filtering might also be useful/required when loading everything to memory on a big machine.

Or streaming parsing of the VCF/BCF file (like BCFtools and CYVCF2 do) .

What would be the best way and library to do SNP selection using the above filter criteria in R?

Is there maybe something like CYVCF2 (python wrapper around HTSLIB VCF parser) for R?

Or maybe a BCFTools like something in R?

Or some other good R library for doing this?

The targeted downstream users are R(studio) users who would like to have this functionality in the environment that they are used to and already use for related tasks.

The VCF can optionally be (pre) converted to TAB delimited format with bcftools query. Or to any specific R format.

A high level / coarse filter on region of interest and samples of interest filter might already have been done. It's often (but not always) more about iterative / dynamic fine grained filtering final selection of SNPs.

R VCF SNP filtering • 3.3k views

ADD COMMENT • link updated 6.1 years ago by Sean Davis 26k • written 6.1 years ago by William ★ 5.3k

1

Entering edit mode

Why do you need to use R? I would never recommend anyone to do VCF filtering in R, or Excel.

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Targeted downstream users are R(studio) users who would like to have this functionality in the environment that they are used to and already use for related tasks.

ADD REPLY • link 6.1 years ago by William ★ 5.3k

0

Entering edit mode

Okay, I don't know of any tools because it's not something about which I have thought. Maybe others have ideas.

From my perspective, you could use system calls from within R, i.e., system calls that call BCFtools, etc., and then wrap these as functions for your users. This assumes that they use R Server or R on a linux cluster.

I'm sure that Pierre has some ideas here, when he next checks in

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Also remember that a VCF is essentiall just a tab-separated file, apart from the header. Parsing it should be easy in R, but I just worry about speed.

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

0

Entering edit mode

the VCF can optionally be converted to TAB delimited format with bcftools query. Or to any specific R format.

ADD REPLY • link 6.1 years ago by William ★ 5.3k

0

Entering edit mode

Should that not suffice, in that case? You may look up the MAF variant format, in that case, just to stick to convention. The MAF format was developed in light of the TCGA deluge of data, as far as I know.

I would always add a unique identifier for each variant, too, consisting of sample:chr:pos:ref:alt (i.e. used as rownames in a R data-frame)

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Just load a TAB file as a data frame is certainly an option. I was hoping there would be a R library that would support:

working on larger genotype tabels than can be loaded as a data-frame
be more user friendly, i.e. offer pre-defined domain specific filters like for example bcftools does (instead of just filtering on strings and integers)

ADD REPLY • link 6.1 years ago by William ★ 5.3k

0

Entering edit mode

Sometimes the input VCF already has been filtered for regions and samples of interest. So it's now always very big (i.e. could be opened in Excel if you really wanted to) , but sometimes it still is very big (more suited for streaming processing with bcftools/cyvcf2).

ADD REPLY • link 6.1 years ago by William ★ 5.3k

0

Entering edit mode

Well, I worked with a colleague in California who always opened my VCFs in Excel, but therefore I had to limit the variants to < ~1,000,000 due to limits in Excel. Although frowned upon, it's not always possible to keep everything within the coding environment, nor is it practical.

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

0

Entering edit mode

see the previous question on biostars and the examples for my tool vcffilterjdk : http://lindenb.github.io/jvarkit/VcfFilterJdk.html

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I was looking for something R specific Pierre. VcfFilterJdk is as R specific as writing an CYVCF2 python script and then executing that via R system2 call to the CYVCF2 script.

ADD REPLY • link 6.1 years ago by William ★ 5.3k

score 2 · Accepted Answer · 2018-03-19

Have a look at the VariantAnnotation package in Bioconductor.

https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html

And the vignette on filtering VCF files.

https://bioconductor.org/packages/release/bioc/vignettes/VariantAnnotation/inst/doc/filterVcf.pdf

Note that this may not be the fastest option due to the parsing into VCF object in R, but it can be a useful adjunct to "system" calls (remember, you can wrap command-line utilities in R functions to make them "appear" to R users as R functions).