Biostar Beta. Not for public use.
Access data from very big vcf files in R
1
Entering edit mode
2.2 years ago
bisansamara • 10

Hi, I have a very big vcf file (11.8 GB), the header and first row look like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       13372   .       G       C       608.91  PASS    "AC=3;AC_AFR=0;AC_AMR=0

How can I need access the #CHROM and POS columns?

Note that I cannot view it in excel because it's too big. I have also tries the following, but none worked:

#1
> library(VariantAnnotation)
> vcfFile = system.file(package="VariantAnnotation", "extdata", "ExAC.r1.sites.vep.vcf.gz")
> scanVcfHeader(vcfFile)
Error in .io_check_exists(path(con)) : file(s) do not exist:
  ''

#2
> vcf<-readVcf("ExAC.r1.sites.vep.vcf.gz","hg19")
Error: cannot allocate vector of size 54 Kb

Any help is highly appreciated

ADD COMMENTlink
1
Entering edit mode

I would do such task using Linux command line as discussed below, but If you really need to read it in R you can use fread from library(data.table)

awk 'BEGIN{OFS="\t"}{if(!"^#"){print $1,$2}}' <(gzip -dc yourfile.gz) | gzip > output.txt.gz

ADD REPLYlink
2
Entering edit mode
14 months ago
Ginsea Chen • 120
Chinese Academy of Tropical Agricultura…

You can extract your target information through following linux shell command: zcat ExAC.r1.sites.vep.vcf.gz | head -n x+ | awk '{print $1 $2}' > target.bed

x means the number of the first information line; target.bed is your result file.

This is a simple operation, you can contact me (cginsea@gmail.com) if you need any help about this question.

ADD COMMENTlink
0
Entering edit mode
14 months ago
d-cameron ♦ 2.0k
Australia

You have insufficient memory to load the entire VCF in memory at once. The readVcf() has the optional argument param which allows you to specify not only a region of the genome that you wish to load, but also which VCF fields you want to load. By specifying the minimum number of regions, and the minimum number of fields to load, you can reduce the memory footprint of the loaded VCF.

If it's still too big to load, you could shrink your problem by only considering a subset of the data at any point in time (e.g. performing your analysis per chromosome).

Alternatively, you can use a computer with more memory.

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3