Hello,
I'm trying to run bedops
vcf2bed
on a huge (27G) VCF file. I'd like to extract a BED of the deletions in the VCF.
My usage:
vcf2bed --max-mem=4G --deletions <file.vcf >file.bed
I'm running it on a cluster and I tried giving the job 8G and 16G RAM, but it maxes out all available RAM in a matter of seconds (100-150 seconds) and the job quits. I know splitting the VCF into per-chromosome chunks might solve this, but is there any other option I could use?
I am open to using other tools to extract the BED as well, given that they ensure end
-start
equals length(REF)
.
EDIT:
I combined a couple of ideas from the recommendations below to get to my solution - YMMV and not all of these mods might be necessary. I piped in the file using GNU parallel's --pipe
and also used --do-not-sort
to get unsorted bed files.
cat huge_file.vcf | parallel --pipe vcf2bed --do-not-sort >huge_file.unsorted.bed
--
Ram
Sounds like giving it 72G of RAM might work, Ram.
...Sorry, couldn't pass it up.
Seriously though, the
vcf2bed
andconvert2bed
commands don't have a ton of options. Alex could probably give you a solution though, dude's a wizard. Alternatively, you could filter based on the variant type in the VCF to create a file of only the INDELs, which I'm sure would cut down your file to probably only a few gigs.vcf2bed
should be able to easily handle that file then.