I'm extracting BED from a 77G VCF file using bedopsvcf2bed, and it does a bunch of core dumps. This is because I'm using the following on a compute node to extract the bed:
The multiple core dumps happen thanks to the parallel --pipe, I guess. When I ran this on a login node without the parallel --pipe and with an &, I see the process running, but a jobs shows a core dump happening as well. The BED file grows and looks fine, but the core dump happens nonetheless.
Am I missing something? This worked not 2 days ago.
I have not used parallel with vcf2bed before, so I'm unsure what it is doing to parallelize the work. What happens if you do not use parallel? Is vcf2bed using version 2.4.29 of convert2bed? You could ensure you are running the correct and desired binary by replacing vcf2bed with /path/to/2.4.29/convert2bed-megarow --input=vcf --do-not-sort --snvs < in.vcf > out.bed.
It's on a compute node, and the bedops binaries are not added to the PATH by default. That's why I'm adding the 2.4.29 precompiled binaries explicitly.
I'm just trying to figure out a way to isolate the problem to as few variables as possible. Is it possible to run /path/to/bedops/2.4.29/convert2bed-megarow --input=vcf --do-not-sort --snvs < in.vcf > out.bed directly on the compute node, without using parallel or specifying PATH?
These are direct binaries that I downloaded off GitHub. I'm trying another route now - I extracted the vcf.gz files using pigz -dc, now I'm trying using gunzip -c. It probably won't make a difference, but just eliminating pigz-induced corruptions as a factor.
Self-compiled binaries would be a useful test, as it would remove differences in Linux setup as a potential source of problems.
If you can build and run self-compiled binaries, please use make all to build them, so that you get a megarow build of convert2bed. If you instead use make, please be sure to edit the exponent constants so that the binaries support longer VCF records.
Also, if you can, ideally, if you can write the first 1M lines of your VCF to an archive and post it somewhere where I could download it, I could try converting it on my end and see what's going on inside convert2bed during conversion, to see what might be causing the segfault.
If this works for your 77G file, I would be curious to know some details about your Linux setup (what your HPC runs, whatever comes out of ldd --version on a compute node, etc.). Some of those details could help improve how I make Github packages more compatible with what people are running.
Sure thing, I'll check tomorrow and let you know. We recently had a firmware + software upgrade, so I can even ask our HPC team if they could help with some input.
It ran to completion without any problems. DM me on Twitter with any questions you have!
➜ ldd --version
ldd (GNU libc) 2.12
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
That's an older version of glibc than I am compiling the precompiled binaries with. As glibc is not forwards-compatible, I'd say that's probably why you ran into problems with the precompiled stuff. This is useful info, thanks!
Without. I had to test my build and did not want to add anything to it. Once it started running well, I didn't interrupt it and now it's not using parallel.
I'm sure if I do use parallel --pipe, it'll work fine now. It was the discrepancy between the pre-compiled binary and my cluster that was the cause, not the parallel.
I have not used
parallel
withvcf2bed
before, so I'm unsure what it is doing to parallelize the work. What happens if you do not useparallel
? Isvcf2bed
using version 2.4.29 ofconvert2bed
? You could ensure you are running the correct and desired binary by replacingvcf2bed
with/path/to/2.4.29/convert2bed-megarow --input=vcf --do-not-sort --snvs < in.vcf > out.bed
.It's on a compute node, and the bedops binaries are not added to the
PATH
by default. That's why I'm adding the 2.4.29 precompiled binaries explicitly.I'm just trying to figure out a way to isolate the problem to as few variables as possible. Is it possible to run
/path/to/bedops/2.4.29/convert2bed-megarow --input=vcf --do-not-sort --snvs < in.vcf > out.bed
directly on the compute node, without usingparallel
or specifying PATH?OK, I'll try that now.
Still a segmentation fault. STDERR reads:
There are a bunch of
core.XXXX
files ~30M in size and empty BED files.The config I'm requesting is 1 node (1 CPU) with 16G RAM.
Thanks. How are these binaries installed? Are these downloaded from the Github package, or did you compile these?
These are direct binaries that I downloaded off GitHub. I'm trying another route now - I extracted the vcf.gz files using
pigz -dc
, now I'm trying usinggunzip -c
. It probably won't make a difference, but just eliminating pigz-induced corruptions as a factor.I have compiled binaries as well, I can try them if required.