Biostar Beta. Not for public use.
Poll: How many people have PyPy?
2
Entering edit mode
15 months ago
John 12k
Germany

Hello all :)

Background: So I am just at the end of writing a program that runs through a BAM file gathering the statistics the user requested via command line parameters. Since the adoption of pypy in this community will directly effect how I write this program, it would be great to get some feedback on how many people have pypy installed :)

Thank you so much!

I will delete the two answers below once the poll ends to reset my rep counter, because thats a little unfair, and i will post the results below this text.

The results are in! As of 12th March 2016, it appears:

People with    PyPy:  5 
People without PyPy: 11

I'm sure the 'without' is significantly under-represented since, as Ram very correctly pointed out, some people may not know if they have it or not, and don't want to spend time trying to find out. I would probably guesstimate that access to pypy is around 30-40%, while usage of pypy is probably even lower still. This isn't helped by the fact that there aren't so many tools which state "works with pypy!", and perhaps going forward python developers should consider helping this along by using pypy-compatible imports like htspython over pysam, or functions in Numpy that are known to work in Numpypy (a significantly faster Numpy re-written for pypy).

Thank you all for participating - perhaps we can take another snap-shot in a year or two and see if these numbers have changed :)

BUMP for 2017 :) If you answered previously it would be good to know if you changed to/away from PyPy too :)

ADD COMMENTlink
1
Entering edit mode

Maybe change the title of the post to match the question being asked. Some people may have PyPy installed but never use it.

ADD REPLYlink
0
Entering edit mode

Good point - changed :)

ADD REPLYlink
1
Entering edit mode

Some thoughts --

  1. How much time does the exec() trick save? Is the function call overhead really significant versus the pure-Python computation being done inside the stats function(s)?

  2. The PyPy 5.0 release notes suggest you can embed PyPy within a C program, if you like. In any case it looks like there are options for effectively bundling PyPy with the code you distribute.

  3. Alternatively, it's possible Numba would let you JIT the stats-collecting function that needs to be fast, without much disruption to the rest of your code.

  4. There might be something in pytoolz or cytoolz that will let you use partial evaluation more elegantly here. I'm not sure exactly how to do that besides your exec() trick, though.

ADD REPLYlink
1
Entering edit mode

So I ran this stats program on a 1Gb BAM file, calculating FLAG, RNAME, TLEN and GC% for every read in the file, and sticking it in a dictionary of counts.

The results below are the times with/without pypy, with/without exec(), and with pysam/htspython. Because htspython is written for CFFI (which is a project closely affiliated with pypy), it works in pypy while pysam does not (and probably never will). Both pysam and htspython work in regular CPython though.

enter image description here

Even though pypy is clearly faster, this is actually a pretty bad use-case for pypy, because apart from calculating GC%, we're not doing a whole bunch of math that pypy can optimise. Most of these speed benefits are just from interfacing with the C library that reads the BAM file (hts) in a more efficient way. It also looks like htspython - which is still an extremely young project by brentp - could prove to be even faster going forwards, based on the CPython times. If I added more stats to gather for this test, or increase the file size, pypy times would have dropped like a stone relative to the CPython version. However - this should actually be very interesting to anyone who uses python for reading BAM files, because whatever your code is right now under pysam, you can get atleast a 2x speedup in reading the file alone using pypy & htspython. And to show i'm not pulling numbers out of the air, here's a screen shot of whatever.

To answer your other questions:

  1. I can't write in C, although I think the future of efficiency in python will be putting your main loop in C. 99% of my code is checking user input, calculating stat dependancies, ordering the stats so they are calculated in the most efficient manor, etc etc. I would hate to do all that fluff in C. But it seems like in the future we'll be able to have our cake and eat it.

  2. I've tried Numba in a variety of different projects and it never works for me. Honestly, I just don't think it works. If pypy is particular about what it can work with, Numba is just impossible. I've tried other similar 'decorate and forget' python things, but they never come close to what they promise. Cython did perhaps, but everything else comes nowhere near Java speeds.

  3. I've never heard of either pytoolz or cytoolz, so i will definitely check them out right now! Thanks :D

Tomorrow morning I will delete the three poll answers below and collect the final scores from this snap-shot survey. Although we had a few pypy users at the end of the day, <50% adoption means that right now, if I write my code for htspython without exec (which is the best bang for your buck because functions as strings are horrible), it will severely discriminate the users without pypy. So I will probably do with exec(). -sigh-

Thank you everyone that voted so far! It really really helped :)

ADD REPLYlink
1
Entering edit mode

If htspython needs cffi to work, be aware that getting cffi itself installed and working can prove problematic for the average user. So don't assume that regular users will be able to pip install your_package and have it work (this is the sole reason we're not using brentp's bw-python package in deepTools). If your target user base is a bit more advanced then this isn't an issue, but keep it in mind.

ADD REPLYlink
0
Entering edit mode

Yeah, I kind of glossed over the fact that installing hts from inside the pypy virtualenv is technological equivalent of washing a cat :/ However, if the resultant program takes 1 hour instead of 3 hours, then the user has 2 hours to figure it out before it becomes a waste of time :P But for deepTools, where most things are crazy fast to begin with (like compute matrix is just a few seconds), then you're right - unnecessary complexity.

ADD REPLYlink
1
Entering edit mode

Is washing a cat difficult?

ADD REPLYlink
0
Entering edit mode

Almost my whole day today has involved washing my cat -_-; http://imgur.com/a/QOcif

We tried speeding it up with a blow dryer, but he hid on my partner's head... cats are a pain in the *.

ADD REPLYlink
1
Entering edit mode

You need to post options for people to vote :)

ADD REPLYlink
1
Entering edit mode

Why would I need to have PyPy installed?

ADD REPLYlink
0
Entering edit mode

You don't need it installed, but for a Python tool developer it is very useful to know how many people have access to a PyPy environment. There are a few times where i have thought "oh, i could write this function to be 1 minute in CPython, and 10 seconds in pypy, or i could write it to be 2 minutes in CPython, and 1 second in PyPy. Which should i optimise for?"

This doesn't happen often, but when it does these sorts of stats can influence one's decision.

ADD REPLYlink
0
Entering edit mode

In that case can you not say that pypi is needed if you want to use my software and leave it at that.

ADD REPLYlink
0
Entering edit mode

Well, the typical alternative is that I just make an if/else branch that checks the interpreter type. I did this in a previous tool that can use array.array (CPython), Numpy (both, but need numpy) or CFFI (PyPy) to store the array, whatever the user has installed - but the result was a real mess to maintain. I would develop the code for one branch, say the numpy version, and then i'd have to figure out post-hoc how to replicate those changes with array.array and CFFI, and so it wasn't fun anymore. I mean the result is you get the fastest code whatever you have installed - but it's a pain for me and so, eh, if i don't have to worry about supporting CPython because people are moving over to PyPy then that would be great, but i think more likely they won't install PyPy unless the tool doesn't work without it, or they already have it installed.

TL;DR i'll continue to make software with 1 pathway but it's 66% optimized for regular Python :P

ADD REPLYlink
1
Entering edit mode
12 weeks ago
RamRS 21k
Houston, TX

I do not know if I have pypy installed.

EDIT: Solution: run python -c "import __pypy__" and see if you do or do not see an error.

EDIT2: You might also have it as pypy from bash. If you are put into an interpreter with ">>>>", you have it.

ADD COMMENTlink
0
Entering edit mode

Eek - I didn't think of that - thank you!

ADD REPLYlink
1
Entering edit mode

You may want to add something for those of us who use many different resources ('all of the above' or 'it depends'?). To get around it I basically up-voted the two main options. I have pypy installed locally, but a few python builds on our cluster may or may not have it installed. Installing it in user-space is discouraged unless one knows what they are doing.

ADD REPLYlink
0
Entering edit mode

Hm. That's a good question. If you have it installed in some location that isn't where the data is - does that even count? I guess it's useful information for me because people could test the software with/without pypy and decide if installing it on the cluster is worth it. They would, in my mind, count as having pypy.

ADD REPLYlink
0
Entering edit mode

Does python not provide a way to install all dependencies if not already installed?

ADD REPLYlink
0
Entering edit mode

Not pypy, since it replaces python altogether :(

ADD REPLYlink
0
Entering edit mode

Ah. I should make my answer "I do not know what pypy is" then :)

ADD REPLYlink
11
Entering edit mode
15 months ago
John 12k
Germany

I do not have pypy installed

ADD COMMENTlink
2
Entering edit mode

Not only do I not have it (nor is it on any of the servers at the institute), I wouldn't be surprised if a lot of the python extensions I write these days wouldn't work with pypy (for the same reasons as pysam not working with pypy).

Anyway, if performance is that important, why not just write parts in C or something else with a python front end?

ADD REPLYlink
1
Entering edit mode

Yeah, that's the downside of pypy, there aren't many bioinformatic programs yet that are compatible with it - however it speeds things up by 3-4x (particularly if you swap out pysam for htspython) so maybe we should be thinking about providing compatibility going forward. Looking at the poll numbers so far, I think forcing people to use pypy is out of the question - I have to support both. I will tidy up this poll in an hour or so, then post some stats about PyPy vs CPython times, and the way the modules would look with-and-without dynamic inlining via exec(). But it's ok. Im glad I asked first :)

ADD REPLYlink
1
Entering edit mode

python -c "import __pypy__" gives an error "No module named __pypy__"

ADD REPLYlink
1
Entering edit mode

...But I used to have it. Initially I got exited with the idea of python code with the speed of C. In practice I found the speed advantage was either not relevant or not that much. Since this came at the cost of having packages misbehaving and managing both python and pypy (quoting from memory, I can't be more precise, sorry), I'm not using it any more. But I'm happy to be proved wrong!

P.S. I didn't know the string + exec() trick!

ADD REPLYlink
6
Entering edit mode
15 months ago
John 12k
Germany

I do have pypy installed

ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1