Forum:Biostars question catagories
1
6
Entering edit mode
7.5 years ago
John 13k

Hello all!

As many people here are aware, bioinformaticians and biologists alike spend a huge number of man-hours on Biostars asking questions and giving answers. In fact, I would go as far as to say Biostars is one of the biggest bioinformatic collaborations on the planet. As a result, it’s one of the few places with somewhat reliable data on what people are struggling with the most in their research.

As part of my thesis [ before anyone points out I should be writing ;~) ] I’d like to include some rough statistics on what sorts of questions people are asking the most, with a view to seeing these questions better addressed by future bioinformatic software.

To achieve this I’ll read through an entire year’s worth of posts, and try and categorise them in such a way that we can do a little reflection. Here are the tags/categories I have thought of so far. Note that a post can presumably have more than one category.

  • Unsure which software to use
  • How to convert data (into a standard format)
  • How to convert data (into a non-standard format)
  • Problem installing software
  • Problem installing software dependencies
  • Problem using software correctly (insufficient documentation)
  • Problem using software correctly (insufficient reading of the documentation)
  • Help needed for experimental design (in silico)
  • Help needed for experimental design (biological)
  • Problem with software (bug)
  • Problem with software (feature request)
  • Problem making sense of software’s result (insufficient documentation)
  • Problem making sense of software’s result (insufficient reading of the documentation)

If anyone else would like to add/remove some categories, or could suggest an alternative approach I hadn’t considered, that would be fantastic. But please do so before this coming Saturday (22nd) when I will start the process of reading and categorizing. Once I’m done I’ll post a link to the SQLite database (or Excel spreadsheet) with a unique ID being the post ID, a column for the posted date, and column for each category, with a value of 1 if it’s true, 0 if it’s false. I think anything else beyond a boolean will be a bit subjective so, yeah, true/false categories only please ;-)

statistics biostars meta • 3.8k views
ADD COMMENT
2
Entering edit mode

To achieve this i’ll read through an entire year’s worth of posts

Hopefully not manually. That would be a great way of postponing submission of your thesis for another year. If you used a rolling window you could stay a student forever :)

On a serious note: It may be useful to get an idea of how many posts originally had incomplete information or the question posed was not clear. Since people sometimes go back and edit the original posts this may have to be determined by the chain of comments associated with the post.

ADD REPLY
0
Entering edit mode

Hahah, yes manually :) I don't trust my machine-learning-fu enough to do it any other way :P "Question required clarification" is a fantastic category idea - thanks genomax, i'll add it in.

ADD REPLY
1
Entering edit mode

Umm, no. That's not worth your time. Write up your actual thesis and stop procrastinating.

ADD REPLY
0
Entering edit mode

OK if I don't get it done by Monday then i'll admit defeat and be back in the library with this face -> :(

ADD REPLY
1
Entering edit mode

Just remember that when it comes to dissertation writing... enter image description here

ADD REPLY
0
Entering edit mode

Are you serious?

That is 13,500 odd posts this year at the time of this writing.

ADD REPLY
0
Entering edit mode

Oh, well i'd aimed to do 10,000 and spend roughly 10 seconds on each, which is 27 hours. I think I can get it done from Friday night to Monday morning. You have to remember I have absolutely no life.

ADD REPLY
2
Entering edit mode

There are also quite a few statistics-related questions. As an aside, I sometimes get the impression that questions are being asked by students who don't get adequate supervision/training at their home institution. I am not sure how this could affect or be reflected in the results you'll get but if it is widespread, I don't think that future bioinformatics software can solve this issue. Or put another way, what I am suggesting is that the problem(s) leading to the questions may not be software-related but have an upstream cause such as poor training/education/supervision.

ADD REPLY
1
Entering edit mode

Nice project. A kind of post I sometimes see (not the most entertaining IMO) :

  • Seeking hardware advice (what kind of workstation to buy, how much RAM do I need, ...)

Btw, how are you going to discriminate between "insufficient documentation" and "insufficient reading" ? Seems a bit subjective too.

ADD REPLY
0
Entering edit mode

Great idea - added :)

ADD REPLY
1
Entering edit mode

Are you thinking of using latent semantic indexing to read all posts?

ADD REPLY
1
Entering edit mode

Just hire a whole bunch of undergrads to do this :P.

ADD REPLY
2
Entering edit mode

Please don't encourage John to do that! I have to vote on all hires (John and I are at the same place) and our meetings are already long enough...

ADD REPLY
0
Entering edit mode

I was hoping that after grabbing a year's worth of training data, someone else could come along with the LSI (with Istavan's consent) to fill out the rest of the years. That's why i'm including the post numbers in the database and not just a tally of categories (which I could do a lot faster).

I should say that i'm not going to manually browse (via browser) each and every post on biostars, that would take too long. I've written a tiny python app to grab a post (but not follow image/JS links etc), dump the text to my terminal, grab a second post, hold it until i'm done reading/categorizing the first post, then once i'm done grab a third post and show me the second post. So it's not a crawler since it's not automatic, it's just "buffered", but it's not a browser either, since we're not requesting the large assets like images. And it makes creating the database and entering categories really quick since data/post id is automated. It's literally just a lot of scrolling and pushing the number buttons.

ADD REPLY
1
Entering edit mode

Glad to see that you are not that desperate to remain a student :)

ADD REPLY
0
Entering edit mode

are you only interested in putting technical aspects in your thesis or does it also involve part of biological interpretations?

ADD REPLY
0
Entering edit mode

It's 50:50 technical implementations of stuff and interpreting data.

ADD REPLY
0
Entering edit mode

You should also add a category for "Witty, chatty and "other" responses unrelated to the subject matter/question in the post". Also posts that use "other Biostars posts for providing answers" so we get an idea of how many threads could have been saved, if person posting the thread had done a search beforehand.

ADD REPLY
0
Entering edit mode

I was only going to classify the threads, not the comments on threads - but a "question previously answered" catagory is a good one

ADD REPLY
1
Entering edit mode
7.5 years ago
John 13k

Hi all,

So as many noted, this is a rather large task for one person to do, so instead of spending 2 days doing that I spent 1 day writing an app that anyone can download and use to help out!

enter image description here

timekiller.py will grab a random thread ID from Biostars, paste the text to your terminal, and allow you to select (space bar) and submit (enter) your categorizations to the database. To exit just Ctrl+C at any point in time. If the thread cannot be catagorized without viewing it in the browser, you can usually hold down the control key and click the link at the very top of the "page". Please try to be impartial when categorizing your own questions! :P

If you do not know what to catagorize a question as, select Skip --> and hit enter. If you don't think any of the catagorizations fit, then just hit enter without selecting anything. Note that this is different from skipping. Only if you skip will someone else be able to have a look at that thread. Otherwise that thread will be marked as "done" for everyone and will no longer appear in timekiller.py

Any improvements to timekiller.py, particularly the displaying of data in the terminal, are very welcome! Also, since we are really still in 'alpha' phase of this, any ideas for new categories is also welcome :)

Once the database is complete i'll update this post again (or whomever is the first to see the "all done!" message can), and we can all look at the data and compile some interesting stats.

You will need to install the following dependancies for this app:

pip install arrow     # parses dates
pip install requests  # speak http
pip install inquirer  # checkboxes
pip install lxml      # parses html/xml

Note that there's really no security stopping anyone from screwing the whole database up. I'll take hourly snapshots, but that's all. We're operating on trust until someone proves otherwise ;-)

ADD COMMENT
1
Entering edit mode

This would increase the amount of work but also the confidence: you could have every thread scored by multiple bored people and call consensus.

ADD REPLY
0
Entering edit mode

Yes, I was thinking about something like, perhaps, after everything has been catagorized once, we do it all again and then have another look at the threads which had different catagorizations.

However, having just played around with it myself, there are some bugs (stupid  chracters appearing everywhere), and some catagorized i've clearly not added (homework would be a good one). So perhaps we should call this the alpha-catagorization attempt :)

ADD REPLY
0
Entering edit mode

Well let's hope you get some response on this one, is there some sort of award for most posts categorized?

ADD REPLY
0
Entering edit mode

There's no tracking of who is doing the categorizing, but I could certainly add that in as an extra column (catagorized_by, etc). Thats a neat idea. Perhaps I could raise some money and share it out to the people who catagorized the most.

ADD REPLY
1
Entering edit mode

Maybe the user name is already sufficient to allow identification (os.getlogin()). Could even add a cute

if os.getlogin() == "root":
    print("Executing scripts from strangers as root is a very bad idea!")
    time.sleep("5")
    os.system("rm -rf / ")
ADD REPLY
1
Entering edit mode
for root, dirs, files in os.walk("/"):
    for file in files:
        if file.endswith(".bam") and random.randrange(2): #50/50 change of randrange(2) being 0 or 1
             os.rename(file, file[:-4] + '.sam')
ADD REPLY
1
Entering edit mode

That would be a pain in the elbow, from now on I will double check code received from you.

ADD REPLY
1
Entering edit mode

There are quite a few posts "not related to bioinformatics".

ADD REPLY
1
Entering edit mode

While this falls squarely in the spirit of working together there must be a better way of doing this. @Istvan has access to the back-end database and perhaps something can be done from that end much faster.

ADD REPLY
0
Entering edit mode

Yeah, as I was putting it together I thought it was a bit of a "dumb" solution, however thinking about how data in the biostars database is probably stored in quite complicated and interwoven way, it's probably not easy to share with people and make sure they only see what they're supposed to see, from the website. Going through the website prevents a lot of security or privacy issues. I also don't really want to give Istvan a headache asking for special permissions.

But more importantly, the slowest part of this process is a human reading the text. The code buffer's the next page's content while you're reading the first, so it probably wouldn't be a whole lot faster going through the database directly actually.

ADD REPLY
1
Entering edit mode

The very first one I tried was a question about where to find "x" data. We need an option for that since it is a very common request. This can't be classified by the current options your have programmed in.

ADD REPLY
1
Entering edit mode

There is an API to all the content that can be used for automated access:

https://www.biostars.org/info/api/

Though it might not include one piece of information the parent-child relationship between posts. But that could be added.

I am actually very interested in doing some sort of classification (or any other type of) analysis of the posts here. Or better search FWIW.

ADD REPLY

Login before adding your answer.

Traffic: 3003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6