Trying To Understand A Simple Perl Vs Python Vs Sed Benchmark
3
4
Entering edit mode
12.0 years ago
Travis ★ 2.8k

Hi,

I was performing a simple routine (read file, substitute spaces for forwardslashes) in Perl, Python and Sed to get an idea of which is fastest.

EDIT: Relevance to bioinformatics is that I want to write a quick file parser that will process paired-end read files from an Illumina MiSeq and replace spaces between read names and numbers with barcodes for downstream compatibility with seqtk i.e. @readA 1 and @readA 2 would become @readA/1 and @readA/2.

I probably would have expected Python and Perl to perform similarly but Python performed much worse and I am trying to understand why.

The sed version was simply sed -e 's/ /\//' in.txt > out.txt and took approximately 3 seconds for 3 million lines.

The perl version was perl -e 'open(INFILE, "./in.txt") or die; while(<INFILE>) { $_=~s/ /\//;print $_;}' > out.txt and took approximately 5 seconds for the same number of lines.

The Python version was python3.2 -c ' import re

for line in open("./in.txt"):

print(re.sub(r" ","/",line), end="")' > out.txt and took 67 seconds to process the same file.

I am new to Python but would have assumed the Python approach was pretty efficient - can anyone suggest what the issue is?

perl python • 12k views
ADD COMMENT
4
Entering edit mode

Perl is better for regular expressions than python. Can you try line.replace(" ", "/") instead?

ADD REPLY
0
Entering edit mode

That brought the Python time down to 11 seconds! I had thought about using a native Python substitution command but every tutorial I looked at used the re class!

ADD REPLY
0
Entering edit mode

You could possibly speed it up more by doing changedText = open("./in.txt").read().replace(" ","/"); print(changedText, file="out.txt" although I'm not sure about the python 3.* syntax.

ADD REPLY
0
Entering edit mode

9 seconds. I think I'll avoid in Python in this case!

ADD REPLY
4
Entering edit mode

Perl has a faster regex engine. Even when you optimize both perl and python, perl is usually faster and sometimes a lot. BTW, the better way to use perl is perl -pe 's/ /\//;' in.txt.

ADD REPLY
0
Entering edit mode

Thanks Heng. Are there particular problems that Python performs better on?

ADD REPLY
0
Entering edit mode

That is a different question and should not be discussed in the comments, look at my question: http://www.biostars.org/post/show/13972/when-to-choose-python-over-perl-and-vice-versa/#13984 and if that is not enough info, ask a new question.

ADD REPLY
0
Entering edit mode

Python is faster than perl on most other tasks according a couple of benchmarks, though only by a thin margin overall.

ADD REPLY
2
Entering edit mode

A suggestion: do not use just this to make a choice. Sure, there are plenty of good reasons to chose Perl (or whatever language), but a difference that is in the same order of magnitude is probably not one of them.

ADD REPLY
0
Entering edit mode

That depends on how you feel about time. To me, one week is unbearable if there are not-so-complex solutions to finish the task in one day.

ADD REPLY
0
Entering edit mode

If something is taking a week in python that takes a day in perl, there is probably something wrong with the python script.

ADD REPLY
0
Entering edit mode

In the reply to tiagoantao, I just intended to say that a factor of 5 in speed is a huge difference. I am not arguing Perl vs. Python at all.

ADD REPLY
1
Entering edit mode

For speed with python, this might speed up things (if you could test and feedback, it would be cool)

lines = open("./in.txt").readlines() #Note: readlineS not readline
print lines.replace(" ", "/"),
ADD REPLY
0
Entering edit mode

lines is a list which doesn't have a replace method

ADD REPLY
0
Entering edit mode

Sorry

lines = open("./in.txt").readlines()

print "".join(map(lambda x:x.replace(" ", "/"),lines)),

This does a single read IO. One wonders if most of the time is spent there. Trading memory for time...

ADD REPLY
0
Entering edit mode

I'm running on a virtual machine witrh limited memory so this particular approach ground the machine to a halt!

ADD REPLY
0
Entering edit mode

This is a basic programming question more suited to stackexchange.com. Please state the relevance to a bioinformatics research problem.

ADD REPLY
0
Entering edit mode

Question edited to state relevance

ADD REPLY
0
Entering edit mode

Not that it matters, but you can simplify your perl one liner: perl -e 'while(<>) { $_=~s/ /\//;print $_;}' in.text > out.text. the <> operator will read either stdin or the file named in $ARGV[0]. In fact you can reduce the loop code to { s/ /\//; print; }, since both the regex and the print statement will use the $_ operator if you don't give them an explicit argument.

ADD REPLY
0
Entering edit mode

More importantly, if this tool is going to be doing anything more complicated than a regex replace, you should do it in the language you're comfortable with so that you can debug, extend, etc more easily, especially since you're now down to 5 vs 11 seconds.

ADD REPLY
3
Entering edit mode
12.0 years ago
Niek De Klein ★ 2.6k

So you can accept the answer (if it was satisfactory enough):

Perl is better for regular expressions than python. line.replace(" ", "/") is faster than re.sub.

You could possibly speed it up more by doing changedText = open("./in.txt").read().replace(" ","/"); print(changedText, file="out.txt" although I'm not sure about the python 3.* syntax.

ADD COMMENT
2
Entering edit mode

you can slightly improve the python script's perfomance by compiling the re object before the loop.

import re space_re = re.compile(" ") for line in open("./in.txt"): print(space_re.sub("/",line), end="")'

ADD REPLY
1
Entering edit mode
12.0 years ago

You can use the replace method of a string to speed things up considerably:

$ python -m timeit -s "import re" "re.sub(' ', '/', 'some string goes here')"
100000 loops, best of 3: 2.94 usec per loop

$ python -m timeit -s "import re; p=re.compile(' ')" "p.sub('/', 'some string goes here')"
1000000 loops, best of 3: 1.66 usec per loop

$ python -m timeit -s "import re"  "'some string goes here'.replace(' ', '/')"
1000000 loops, best of 3: 0.33 usec per loop
ADD COMMENT
1
Entering edit mode
12.0 years ago
Marvin ▴ 890

tr ' ' /

You're dragging in three enormous infrastructures for a simple problem, then ask why one of them has lower overhead for the simple problem. Moreover, most of the time is probably spent in I/O, not in actual processing. What's the point?

ADD COMMENT
4
Entering edit mode

amusingly (and unexpectedly) your solution is about twice as slow than the best performer so far ... goes to show that one can almost never tell where a program spends its time:

ialbert@porthos ~/work/test
$ time tr '' / < one.fq > tmp

real    0m6.437s
user    0m4.925s
sys 0m0.311s

ialbert@porthos ~/work/test
$ time sed -e 's/ /\//' one.fq > tmp

real    0m5.808s
user    0m0.911s
sys 0m0.331s

ialbert@porthos ~/work/test
$ time perl -e 'open(INFILE, "./one.fq") or die; while(<INFILE>) { $_=~s/ /\//;print $_;}' > tmp

real    0m3.305s
user    0m1.550s
sys 0m0.315s

ialbert@porthos ~/work/test
$ time python -c 'for line in file("one.fq"): print line.replace(" ", "/")' > tmp

real    0m3.815s
user    0m3.284s
sys 0m0.317s
ADD REPLY
0
Entering edit mode

Here are the numbers on my machine (tr is the fastest):

tr ":" " " -- 1.5s
sed "s,:, ,g" -- 5.7s
perl -pe 'tr/:/ /' -- 4.1s
perl -pe 's/:/ /g' -- 5.7s
python -c 'for line in file("1.fq"): print line.replace(":", " ")' -- 6.0s

You may try to change "LC_ALL", though this has no effect on my side.

ADD REPLY
0
Entering edit mode

I ran my benchmarks on a Mac OSX and even after rerunning it today it seems that on that platform tr runs a lot slower than on a linux. On Linux I can reproduce your observation.

ADD REPLY
0
Entering edit mode

Yes, I was running on Linux. Perhaps Mac comes with a crappy "tr".

ADD REPLY
0
Entering edit mode

you can probably improve the times on all languages by not doing line-wise file readling:

import sys

fh = open('t.fq')
while True:
    print (fh.read(16384).replace(" ", "/") or sys.exit()),
ADD REPLY
0
Entering edit mode

Firstly there is more to the problem than I stated in my original question. Secondly, the solution needs to be useable by people with no bioinformatics training.

ADD REPLY

Login before adding your answer.

Traffic: 1914 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6