Question

Benchmarking bioinformatics tools

9

Entering edit mode

8.7 years ago

h.mon 35k

I want to ~~profile /~~ benchmark bioinformatics tools in general (CPU time, memory usage, disk usage), but specifically to do so accurately for pipelines, where a "master" script has lots of calls to other programs. Currently I am using GNU time, but I do not know how reliably it is for these cases.

Are there other tools / procedures suggestions?

P.S.: I know this is not strictly speaking a bioinformatics question, but I believe it is of general interest to the community. If moderators disagree, close the question without mercy.

edit: removed Profiling from title and text, it is not (upon learning what "profiling a software" means) what I want.

benchmarking • 4.0k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by h.mon 35k

2

Entering edit mode

Valgrind massif can analyze heap and stack memory usage:

$ valgrind --tool=massif --stacks=yes some-binary ...

It is also useful for debugging memory leaks in C applications.

It can slow down the application runtime considerably, however, so you would not want to run this in conjunction with time benchmark tests.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.7 years ago by Alex Reynolds 35k

0

Entering edit mode

I plan to perform the benchmarks on a regular basis, so Valgrind isn't a good choice due to the slowdown. In addition, it seems very complex, I just need some quick and reasonable benchmarks.

I actually learnt (thanks to a great response which however is not showing here) that I probably misused "profiling", I intend to just benchmark stuff.

ADD REPLY • link 8.7 years ago by h.mon 35k

1

Entering edit mode

I use memtime to capture memory and time information. It appears to correctly capture time even when local subprocesses are launched, but memory appears to not be correctly captured in that case - so a shellscript launcher just shows negligible memory usage. Also, with Java programs, it's really hard to figure out how much memory they actually need except by trial and error, since they use however much you tell the JVM to use. For that reason I've modified some of my programs to print how much memory the JVM thinks it's using, but that's still not quite correct, and hardly valid in a benchmark anyway.

And disk I/O is really difficult, since for example all I really care about are reads/writes to the network FS, but most tools I've seen that try to track I/O lump that together with cached reads, local disk, and even with pipes (e.g. cat x | gzip > y would measure at least double the I/O that it should).

So if anyone has a good universal solution, I'm interested as well.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

1

Entering edit mode

If you test anything involving disk I/O, clear caches in between tests. On Linux, for instance:

http://unix.stackexchange.com/a/87909

Failing to clear caches is a common way for developers to claim (dishonestly) that their binaries run faster, when in fact they don't.

ADD REPLY • link 8.7 years ago by Alex Reynolds 35k

0

Entering edit mode

Which memtime do you use? I found three: this very old one; this git which seems to be a continuation of the first link; and this git, a perl script.

Yes, benchmarking java applications is a pain, I recently benchmarked (using GNU time) ECC using Xmx4g, 8, 16, 32 and Xmx64g. In all cases the benchmark reported all memory given to the JVM was used (at least at some point), but running times were very similar. I wonder, will the JVM release some memory in case some competing process needs memory? In this particular case for ECC, it seems 4Gb are enough and more is just waste, so would the JVM be kind and use just the necessary?

ADD REPLY • link 8.7 years ago by h.mon 35k

0

Entering edit mode

We use the version at the first link, http://www.update.uu.se/~johanb/memtime/.

As for ECC... it's a bit unusual. Most Java programs will immediately grab virtual memory as specified by -Xmx but only use physical memory as needed. But ECC stores kmer counts in a count-min sketch, which is inefficient to resize dynamically, so that program allocates all possible physical memory immediately (or if using a prefilter, eventually) even if the input is only 1 read. ECC will never run out of memory, rather the accuracy is reduced as the structures become increasingly full, so a higher value of -Xmx allows increased accuracy. As a result, I recommend running it with all available memory. But for a bacterial isolate, -Xmx1g is probably sufficient for good accuracy. High accuracy is important for error-correction, but not very important for normalization, which the tool also does.

Tadpole, on the other hand, also does error-correction. It stores kmer counts exactly and as a result can run out of memory if the input is too big and -Xmx is too low. But, it uses resizable hashtables, and as a result the physical memory consumption will grow over time, only as needed.

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

Ram · Answer 1 · 2015-08-03

I also use GNU time with just a small wrapper script for convenience.

#!/bin/bash

N=10

if [ "$1" = "-n" ]; then
    N=$2;
    shift;
    shift;
fi

for ((a=1; a <= N ; a++))
do
    (/usr/bin/time -f "%e %M %P" $* > /dev/null) 2>&1 | tail -n 1
done | awk '{at+=$1;t[NR]=$1;as+=$2;s[NR]=$2;ap+=$3;p[NR]=$3}END{at/=NR;as/=NR;ap/=NR;for(i=1;i<=NR;i++){b=t[i]-at;sdt+=b*b;b=s[i]-as;sds+=b*b;b=p[i]-ap;}sdt/=(NR-1);sds/=(NR-1);sdp/=(NR-1);print "time (seconds):", at,"±", sqrt(sdt);print "mem (kbyte):",as,"±", sqrt(sds);print "cpu usage:", ap, "±", sqrt(sdp);}'

score 1 · Answer 2 · 2015-08-03

1

Entering edit mode

8.7 years ago

Ying W ★ 4.2k

If you are able to dedicate an entire node to the task,

Run Nmon for cpu and memory logging

Disk is very difficult to log, see: https://helgeklein.com/blog/2013/03/the-impossibility-of-measuring-iops-correctly/

ADD COMMENT • link 8.7 years ago by Ying W ★ 4.2k

score 0 · Answer 3 · 2015-08-05

Thanks for all the answers. In addition to the alternatives pointed, I found pyrunlim, which also seems a good alternative. It took me a while to get it to work, as it uses an old psutil API. The psutil API changes are well documented and very simple, though, so I was able to get it working (of course, using an old psutil, e.g. 2.1.1, with the original pyrunlim script, will also work).

I am currently testing the alternatives to see which one fits my needs better, and possibly to validate the results against each other.