Recompile bedops with increased TOKENS_MAX_LENGTH
2
0
Entering edit mode
6.4 years ago
Ram 43k

Hello,

Following up on my previous post that extracted mammoth BED files from mammoth VCF files, I had to skip sorting there. Now I wish to sort the mammoth BED files, and the problem is, some lines are super long (wc -c says ~50,000).

bedops asks me to recompile binaries after increasing TOKENS_MAX_LENGTH. How can I increase this value - do I edit ./interfaces/general-headers/suite/BEDOPS.Constants.hpp? And what is the maximum value I can give this constant?

I Googled and this hasn't been addressed before. I'm not a C++ guy so I thought we might benefit from a documented solution to this question :-)


EDIT


I'm reading that the "rest" of the BED file is given 2**15 characters as the MAX LENGTH. I'm changing the exponentiation to 17 and trying to proceed. I'll update this post once I have a result.

bedops sort-bed • 2.2k views
ADD COMMENT
3
Entering edit mode
6.4 years ago

Some major changes and speed enhancements were added in version 2.4.27 of BEDOPS. These changes are detailed in the revision history for this version:

http://bedops.readthedocs.io/en/latest/content/revision-history.html#v2-4-27

This version (and versions after 2.4.27) includes the packaging of two versions of each of the BEDOPS binaries, one suffixed -typical and another suffixed -megarow.

So there is bedops-typical and bedops-megarow, and the same for bedmap, etc.

We set up symbolic links so that you can keep your pipelines written as they are. The typical binaries are the default selection.

The typical binaries are compiled with a shorter maximum token length, so as to reduce memory usage and maximize speed improvements. Most people can use typical binaries without having to think about this or worry about it.

However, you are running into an issue where the BED line length is very long — too long for typical binary use.

So we have included megarow binaries, which allow longer token lengths based off the values in these lines of the parent BEDOPS Makefile:

https://github.com/bedops/bedops/blob/master/Makefile#L9-L11

You could try using the megarow binaries to see if this helps with your set operations.

If you have installed the two-build version of BEDOPS (I'm not sure what Homebrew does, now), you can use the convenience script switch-BEDOPS-binary-type to switch between typical and megarow builds of BEDOPS.

For example, to switch the binary set to -megarow suffixed binaries:

$ switch-BEDOPS-binary-type --megarow

This changes the binaries that the symbolic links point to.

You'll see that the binary version and help statements change, e.g.:

$ bedops --version
bedops
  citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract
  version:  2.4.29 (megarow)
  authors:  Shane Neph & Scott Kuehn

See the version key.

You can switch back to typical with the same script:

$ switch-BEDOPS-binary-type --typical

If the megarow binaries do not work, you can edit the parameters in the lines of the parent Makefile (https://github.com/bedops/bedops/blob/master/Makefile#L9-L11) and recompile with make all (not make alone).

This target builds the two sets of binaries (typical and megarow) while including any changes you make to specify longer token lengths (such as to ID and non-ID parts of a BED4+ file, by editing MASSIVE_ID_EXP and MASSIVE_REST_EXP, resp.) in the -megarow suffixed binaries.

You could also edit the header BEDOPS.Constants.hpp directly, to choose the desired exponent for the maximum token length, and then run make (not make all). This approach would make one build of binaries with custom line length parameters.

ADD COMMENT
1
Entering edit mode

I recall seeing the twin binaries and the symlinks, but did not put two and two together. This is a super convenient method! Thank you!

ADD REPLY
1
Entering edit mode
6.4 years ago
Ram 43k

Alright, I found the solution.

This is a nice piece of programming - the constants stored are the POWER to which 2 must be exponentiated to get to the max allowed length. Each segment (chromosome, position and the rest of it) are given separate capacity parameters.

Once I changed

#ifndef REST_EXPONENT
#define REST_EXPONENT 15
#endif

to

#ifndef REST_EXPONENT
#define REST_EXPONENT 17
#endif

to allow for lines longer than 2**15=32768 characters, then compiled following the instructions (make + make install), everything worked fine.

ADD COMMENT

Login before adding your answer.

Traffic: 2041 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6