Simple multi-threading with map and pool in Python3

Question

Biomart Very Slow

8

Entering edit mode

13.9 years ago

Chris Miller 22k

I'm using the biomaRt package in R to pull down ensembl annotations for a set of Affy probeIDs. The features seem nice, but the service is very slow, taking 10 minutes or so for a single query (of no more than a dozen probes), and frequently times out completely:

library('biomaRt')
ensembl = useMart("ensembl")
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
probePos = getBM(attributes=c("affy_huex_1_0_st_v2",
                              "hgnc_symbol",
                              "chromosome_name",
                              "start_position",
                              "end_position",
                              "strand"),   
                 filters="affy_huex_1_0_st_v2",
                 values=probeList,
                 mart=ensembl)

The slow returns have persisted over a week or two, and don't seem dependent on time of day or anything like that. If I add a few more attributes, I get timeouts:

Request to BioMart web service failed. Verify if you are still connected to the
internet.  Alternatively the BioMart web service is temporarily down.

So, my questions are:

Is this a common problem for others?
Is likely that the server is overloaded, or are my queries just too big? If the former, are there any mirrors available?
If the latter, the next step is setting up a local server, I guess. Anyone have experience (good or bad) with that? Is it worth the hassle?

biomart r • 9.8k views

ADD COMMENT • link updated 5.4 years ago by dvitsios ▴ 30 • written 13.9 years ago by Chris Miller 22k

0

Entering edit mode

Argh just in case anyone is reading today, I'm getting serious timeouts as well! Yesterday it seemed really flaky, too. It's a shame, because it is such a good resource!

ADD REPLY • link 13.7 years ago by Mike Dewar ★ 1.6k

0

Entering edit mode

As I posted below, I sort of solved the problem by breaking my queries up into very small chunks, especially on the fields that return a lot of data. It's a pain to set up the first time, but once you automate it, doing it piecemeal isn't so bad.

ADD REPLY • link 13.7 years ago by Chris Miller 22k

Ram · Answer 1 · 2010-06-08

I'd say it is an occasional, not a common problem. I've experienced it a couple of times in the several months that I've been using biomaRt regularly.
Queries of a dozen probes or so are unlikely to be too big. I recently retrieved Affymetrix exon probesets for every HGNC symbol (~ 30 000 queries) in well under 10 minutes.
I'm unsure whether there are mirrors for the latest release. I know that it is possible to specify different servers as arguments to useMart(). For example, to use the NCBI36 build, according to this mailing list thread:
```
mart <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl",
    host = "may2009.archive.ensembl.org", path = "/biomart/martservice",
    archive = FALSE)
```
I have no experience with setting up a local server - and I don't know anyone else who does. It involves many, many Perl modules and I suspect, is rather difficult.

score 4 · Answer 2 · 2010-06-12

I remember something about very large queries taking a long time, thus you may run into timeouts on the server. The workaround, which seems reasonable, is to simply [?]install the mart locally[?] if possible (faster b/c it is local, no timeout issues, etc). Ensembl, UniProt, and others make their mart data freely available, see the link above.

This is of course assuming biomaRt works with a local database, but I would be very surprised if it doesn't.

score 1 · Answer 3 · 2018-12-12

The biomaRt R package is really slow indeed, especially for large queries..

Try using instead Biomart's REST API which is much faster (relative to biomaRt always) and more robust:

https://rest.ensembl.org/

You can wrap your requests over chunks of data or individual IDs each time.

For example, in order to retrieve the full annotation for a list of GO IDs using Python3, you can iterate through each GO ID:

import requests, sys

server = "https://rest.ensembl.org" 
ext_prefix = "/ontology/id/"

def get_go_term_by_id(id):

    ext = ext_prefix + id + "?content-type=application/json"        
    r = requests.get(server+ext, headers={ "Content-Type" : "application/json"})

    if not r.ok:
        r.raise_for_status()
        sys.exit()

    decoded = r.json()
    print(repr(decoded))

go_ids = ['GO:0006958', 'GO:0031902', 'GO:0050776'] 
for id in go_ids:
    get_go_term_by_id(id)

Each individual call gets served in less than a second or max. a couple of seconds so you won't have any timeout issues.

Besides, you can wrap your code in try/except blocks so that in case an individual call crashes it can continue processing the rest as normal.

Simple multi-threading with map and pool in Python3

For real speed, you can submit multiple calls to Biomart using multiple threads in Python (e.g. with 50 threads):

from multiprocessing.dummy import Pool as ThreadPool

num_threads = 50
pool = ThreadPool(num_threads) 

server = "https://rest.ensembl.org"
ext_prefix = "/ontology/id/"

go_id_terms_dict = {}

def get_go_term_by_id(id):

    try:
        ext = ext_prefix + id + "?content-type=application/json"
        r = requests.get(server+ext, headers={ "Content-Type" : "application/json"})

        if not r.ok:
            r.raise_for_status()
            return ''

        decoded = r.json()
        go_term = repr(decoded['name'])
        go_term = go_term.replace("'", "")
        go_term = go_term.replace("\"", "")

    except:
        go_term = ''
        print('[Warning] Could not fetch GO term for ID:', id)

    go_id_terms_dict[id] = go_term
    result = id + '||' + go_term
    print(result)

# all_human_go_ids: your list with GO IDs
pool.map(get_go_term_by_id, all_human_go_ids)