Question

C/C++ Libraries For Bioinformatics?

35

Entering edit mode

13.8 years ago

User 59 13k

This question has come via a former colleague:

"I've done my work in bioinformatics in Python, but for data crunching it's really slow (one of the applications is running for almost a week), so I decided to switch to C. Does anybody know about something like Biopython in C (it seems frustrating to rewrite translation, transcription, database entries parsing from scratch)"

A cursory glance at Google shows there's a lot of unmaintained attempts to create a O|B|F style set of C libraries for computational biology. Rather than focusing on why Python isn't working for this particular case, does anyone know of actively maintained C/C++ libraries as described?

c c • 30k views

ADD COMMENT • link updated 5.0 years ago by chernouhov sergey ▴ 50 • written 13.8 years ago by User 59 13k

4

Entering edit mode

For eliminating bottlenecks you might want to look into scipy/weave, since it enables you to embed C/C++ code directly into your python scripts. See http://www.scipy.org/PerformancePython for details.

ADD REPLY • link 13.8 years ago by Michael Schubert ★ 7.1k

3

Entering edit mode

I'm assuming you know your stuff and have tried this, but whenever performance comes up, it's worth mentioning that profiling to identify bottlenecks can be very useful. If there is an obvious rate-limiting step, you may be able to extend python and just write a C function for that one small piece of the puzzle, rather than switching wholesale to C.

ADD REPLY • link 13.8 years ago by Chris Miller 22k

0

Entering edit mode

If you need to parse some structured data lex/yacc is your friend. for XML use libxml2, for ASN.1 use the NCBI asntool.

ADD REPLY • link 13.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks Chris, the originator of this question has been pointed to this thread - I'm sure they appreciate the comments about identifying bottlenecks as well.

ADD REPLY • link 13.8 years ago by User 59 13k

score 29 · Answer 1 · 2010-06-22

29

Entering edit mode

13.8 years ago

Phis ★ 1.1k

The SeqAn C++ library is quite nice.

ADD COMMENT • link 13.8 years ago by Phis ★ 1.1k

0

Entering edit mode

I think this might have been the one that was at the back of my mind - I had an inkling this was out there.

ADD REPLY • link 13.8 years ago by User 59 13k

0

Entering edit mode

SeqAn is really good and complete. Bowtie and other tools use it.

ADD REPLY • link 12.8 years ago by Bioquant ▴ 160

Ram · Answer 2 · 2010-06-22

15

Entering edit mode

13.8 years ago

Pierre Lindenbaum 161k

The only stable library I know is the NCBI toolkit

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.8 years ago by Pierre Lindenbaum 161k

4

Entering edit mode

But use the C++ version, as the C one is a PITA to compile and to use. Documentation is horrible (or at least was) for the C version.

ADD REPLY • link 13.8 years ago by Paulo Nuin ★ 3.7k

0

Entering edit mode

Thanks Pierre, I hadn't thought about the NCBI toolkit

ADD REPLY • link 13.8 years ago by User 59 13k

Ram · Answer 3 · 2010-06-22

14

Entering edit mode

13.8 years ago

brentp 24k

The genometools library is actively maintained, well-tested, has visualization tools, parsers, indexes, and a bunch of other tools.

It's also good to have a look at jim kent's stuff

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.8 years ago by brentp 24k

Ram · Answer 4 · 2010-07-06

9

Entering edit mode

13.8 years ago

Haibao Tang 3.0k

I would like to mention Bio++, mostly rich for phylogenetic stuff (substitution models etc.).

`Bio++ is a set of C++ libraries for Bioinformatics, including sequence analysis, phylogenetics, molecular evolution and population genetics.

then there is also bppSuite built on top of that.

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.8 years ago by Haibao Tang 3.0k

score 6 · Answer 5 · 2010-06-22

I recommend easel (ythe library behind HMMER, you can get it with any source code release of HMMER3). I've read (though never used) large portions of its source code. The algorithms are documented, the code is lucid, and it is ansi C AFAIK, but without knowing what you need I cannot tell if it is in easel.

Ram · Answer 6 · 2010-10-29

6

Entering edit mode

13.5 years ago

Rm 8.3k

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.5 years ago by Rm 8.3k

Ram · Answer 7 · 2010-06-22

4

Entering edit mode

13.8 years ago

Fiamh ▴ 220

Can I pitch Curtis Huttenhower's Sleipnir library? (Tutorial here)

-- Oliver

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.8 years ago by Fiamh ▴ 220

Ram · Answer 8 · 2010-10-28

4

Entering edit mode

13.5 years ago

Ketil 4.1k

One of my "selling points" for using Haskell for bioinformatics is that it combines high level of abstraction with high performance. This makes it very quick to write applications, especially when chores like parsing file formats etc is already solved in a library, and yet the resulting applications compile to native code, and run at speeds comparable (typically within a factor of two) to C.

Perhaps Haskell isn't for everybody, but I think there is a clear need for a language that optimizes programmer time (which is, let's face it, a far more scarce resource than CPU time) without sacrificing too much CPU time.

ADD COMMENT • link 13.5 years ago by Ketil 4.1k

0

Entering edit mode

Is there a Haskell bioinformatics library?

ADD REPLY • link 13.5 years ago by Erik Garrison ★ 2.4k

0

Entering edit mode

Ah, there is: http://blog.malde.org/index.php/the-haskell-bioinformatics-library/ :)

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.5 years ago by Erik Garrison ★ 2.4k

0

Entering edit mode

Yes there is that :-) There are actually several now, I'm working with the author of some of the others (dealing with RNA structure etc) to integrate them.

ADD REPLY • link 13.5 years ago by Ketil 4.1k

0

Entering edit mode

There are actually several now, and I'm working with the author of some of the other stuff (dealing with RNA structure) to integrate them. See the "bioinformatics" section on HackageDB

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 13.5 years ago by Ketil 4.1k

Ram · Answer 9 · 2010-10-28

Kevin Thornton's libsequence is

a C++ library designed to aid writing applications for genomics and evolutionary genetics. A large amount of the library is dedicated to the analysis of "single nucleotide polymorphism", or SNP data. The library is intended to be viewed as a "BioC++" akin to the bioperl project, although ... libsequence tries not to re-invent the wheel. Rather, the focus is on biological computation, such as the analysis of SNP data and sequence divergence, and the analysis of data generated from coalescent simulation.

EDIT 16 July 2011:

GeCo++ (Genomic computation C++ library) is "a C++ class library to the purpose of making easier and faster the efficient implementation of algorithms for sequence analysis when functional annotations and genomic variations need to be considered."

Ram · Answer 10 · 2010-10-29

4

Entering edit mode

13.5 years ago

lh3 33k

It really depends what you need, which you have not clarified. For writing algorithms, seqan is the best, but it is not for format parsing. For that purpose, no good C/C++ libraries. Java is usually better than scripts when speed is a concern.

EDIT: Maybe I can advertise my own works:

kseq.h: The most efficient and versatile fasta/fastq parser.
ksw: SSE2 Smith-Waterman. Probably only SWPS3 can achieve a similar speed.
knhx: Light-weight New Hampshire parser (not thoroughly tested though)
khmm: Basic HMM library.

These are all single-file or two-file libraries. If you need a component, copying one or two source files is enough. No worry about dependencies. These are also the most efficient among libraries having similar functionalities (e.g. kseq.h is 10X faster than the fastx_toolkit parser; ksw is 30X faster than a non-SSE2 implementation).

As to the speed of parsing: in another post, we have discussed that many bio* components are very inefficient for the sake of completeness. Writing your own can be by far faster. The key reason we do not write parsers in C is because for typical parsing, C is not much faster than a script. Maybe twice faster, but who cares?

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.5 years ago by lh3 33k

0

Entering edit mode

Java is usually better when speed is a concern.

Do you find Java to be faster than C?

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 13.1 years ago by Aaronquinlan 12k

0

Entering edit mode

I actually mean Java is faster than scripting languages. Sorry for misleading.

ADD REPLY • link 13.1 years ago by lh3 33k

0

Entering edit mode

In addition to SWPS3, there is also diagonalsw that achieve a similar speed. A drawback with SWPS3 is that its algorithm is buggy. It sometimes gives the wrong result. For details see here

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 12.1 years ago by Erik Sjölund • 0

0

Entering edit mode

Yes, SWPS3 is buggy, but I cannot recommend diagonalsw. I wasted half an hour and yet still could not compile it. Its dependency is really unnecessary.

ADD REPLY • link 12.1 years ago by lh3 33k

score 1 · Answer 11 · 2011-07-18

1

Entering edit mode

12.8 years ago

Fede ▴ 10

GeCo++ , as suggested by Casey can do the job. And not just because I'm currently working on it (at the medea institute) ;)

ADD COMMENT • link 12.8 years ago by Fede ▴ 10

0

Entering edit mode

Hi fede, wellcome to BioStar. Comments like this should be included using the "add comment" function underneath and answer/question. If you can please make this edit, then I can delete this "answer"

ADD REPLY • link 12.8 years ago by Casey Bergman 18k

Ram · Answer 12 · 2012-02-09

As well as being a very useful set of tools EMBOSS is also a C based framework for developing sequence analysis applications, see the EMBOSS Developers Guide.

Another option is BioLib, which aims to provide access to C/C++ libraries from BioPerl, BioRuby and BioPython. It may be quicker to use BioLib to improve the performance of parts of the Python code, instead of reimplementing in C/C++.

Ram · Answer 13 · 2012-03-15

If you are working with XML data files that are defined by an XML schema, you could use either one of the open source software products

CodeSynthesis XSD/e

and

CodeSynthesis XSD

to generate parsing functions that give you a data object model to work with. You also get serialization functions for creating XML files. In addition to that, these two software products create binary formats that you could use if you need efficient storage of your XML files. Parsing and serialization of the binary format is also faster than the XML format.

The parsing and serialization of the XML files can be done in a streaming mode which is nice if the XML files are big.

One example of what you could do with these two software products is to read Uniprot XML files. I tried out CodeSynthesis XSD/e in streaming mode together with the Uniprot XML Schema.

It worked great and the parsing was really fast.

CodeSynthesis XSD is availabe in Ubuntu Linux. To install it run

root@ubuntu-linux:~# apt-get install xsdcxx

score 0 · Answer 14 · 2019-04-13

On CBioInfCpp.h as a C++ lib containing some functions for bioinformatics

Dear Sirs.

Though I am not a professional programmer, bionformatics is very interesting interdisciplinary field for me.

I see it, the Python is a "standart language" in this field.

But when I solved problems at rosalind info, I used C++. So as a result a "lib of some function" has been borned.

The lib contains 3 groups of functions. The first one - input-output ones (in order to read-write vectors, matrixes, graphs from-to a file via only one commsnd as it is in Python).

The second group is "Working with strings". Contains some functions from computing GC-content, Edit Distance etc to finding all mutated strings in a given one.

The third is "Working with graphs". A data structure "Adjacency vector" is suggested. By the way, in general case, vertices may have negative integers assigned and graphs may have multiple loops and edges. Some function such as Eulerian Cycle, Path finding, topological sorting etc are implemented.

May it be useful for some tasks?

I understand that this lib haven't a great majority of features. For example it is not able now to work with bioinformatic databases, but here I can not to implement it by myself only.

Free distributed source code and info is here: https://drive.google.com/open?id=1FQwsQm2kG_nTO45ab0yj52xtp6_B4IB2

and here: https://github.com/chernouhov/CBioInfCpp-0-

My profile at Rosalind info http://rosalind.info/users/chernouhov/

Best regards, Chernouhov Sergey

23/06/2019 update:

Group of function "FindIn" has been updated.
Functions PairVectorCout, PairVectorFout has been updated.
Group of function "GraphCout" and "GraphFout" has been added. So nowadays one may "cout/ fout" a graph that is set by Adjacency vector to screen/ to file line by line: one edge in one line.
Function "StrToCircular" added for finding the circular string of minimal length of the given one.
Group of function MaxFlowGraph" has been added to help find Maximal Flow, the paths of the maximal flow network and max-flow min-cut in a graph.
A data structure "Adjacency map" (a modification of data structure for containing graphs "Adjacency vector") has been added. Adjacency map allows to have quicker access to edge’s weight, but it can’t work with multiple edges.
Functions for converting Adjacency vector to Adjacency map and conversely AdjVectorToAdjMap and AdjMapToAdjVector have been added. Note that Multiple edges will be joined together.
Function TandemRepeatsFinding has been added. It is intended for finding tandem repeats in the given string that may be useful for solving problems related to Microsatellite Instability etc.

14.07.2019 update:

Function CIGAR1 has been added.
Group of function "GraphCout" and "GraphFout" has been updated (so nowadays one may "cout/ fout" a graph that is set by both Adjacency vector and Adjacency map to screen/ to file line by line: one edge in one line).
Function EditDistA as an extended version of the function EditDist has been added (returns not only the value of Edit Distance between 2 strings but also one possible version of the alignment itself).

09.08.2019 update:

Group of function "NBPaths" (for finding maximal branching paths in a graph, both weighted or no, direcyed or no) has been added.
Functions ConsStringQ1 and ConsStringQ2 for building consensus string upon a given collection of strings according to their quality has been added. Note that due to little data for testing errors may be found here (please notify if you found any).

31.08.2019 update:

Function GenRandomUWGraph that generates a random unweighted graph (as its "Adjacency vector") has been added.
Group of function intended to find collection of vertices for each strongly connected component of directed graph and to find collection of vertices for each connected component of undirected graph has been added.
Group of function for counting edges multiplicity of a graph that is set by Adjacency vector has been added.

19.10.2019:

Added group of functions AdjVectorToAdjMegaMap, AdjMegaMapToAdjVector to convert Adjacency vector to/ from Adjacency mega-map (i.e. extended version of Adjacency map to contain graphs having different multiply edges).
Updated Group of function GraphCout and GraphFout to deal with mega-maps.

03.11.2019

Group of functions Num updated.
Function ScoreStringMatrix that counts score (i.e. total number of mismatches) upon vector a of strings s added.
Function GPPM that generates a position probability matrix (PPM) added. Note that pseudocounts may be used (the formula (Ns+z)/(N+2*z) is implemented).

26.11.2019

For further updates please see here: A: CBioInfCpp.h as a C++ lib containing some functions for bioinformatics