Question

Describe Your Architecture: Uniprot

5

Entering edit mode

11.1 years ago

Pierre Lindenbaum 161k

This question was inspired from a thread on stackoverflow regarding BerkeleyDB-JE :

Inserting data into BerkeleyDB-JE is getting slower and slower

and a conversation on twitter with Jerven@uniprot:

https://twitter.com/jervenbolleman/status/306041881517772800

I'm currently inserting a large number of data (dbsnp) into a BDB and I'd like to deploy it into a web-server (glassfish) to implement some web services. The structure of uniprot has been briefly described in :

Infrastructure for the life sciences: design and implementation of the UniProt website

I think biostar would be a nice place to ask Jerven for some specific questions about the implementation of uniprot. Here are my questions :-)

WebContainer: wich one do you use ? tomcat ?
Do you use a specific framework to wrap BerkeleyDB (something like JPA) ?
How do you manage the BDB-Environment in the web-container ? I would open it in a ContextListener and put it in the WebContext.. but is it safe for the Multi-threaded access ?
Do you use a 'high-availability' environment ?
in the web-context: Is your BDB-Env read-only ? do you use transaction. Do you use a copy of your database for the web-access ? or do people at uniprot use the same BDB-ENV for working/inserting/updating ?
How many database does your BDB-env contain ? do you open all those databases in the BDB.env on startup ?
Do you handle your data like a SQL database (column-oriented ?) or do you put some complex structures in the DatabaseEntry (like a whole XML record) UML doesn't work for NOSQL. How do you write the documentation about the schema for those complex objects.
Did you write some specific JSP-Tags form BDB or used a known jsp library for BDB ?
REST services: did you use Jersey (JAX-RS) or did you implement your own web-services ?
In the paper you said SOAP was a bad idea ... " necessitated introducing limitations such as the maximum number of entries that could be retrieved in one go"... What's the difference with a REST solution. I mean you can stream the result with REST but, as far as I known, you could also do the same with SOAP:
Entry bindings: do you use things like @Annotation for BDB-JE our did you write your own EntryBinding<T> ?
does BDB sometimes broke ? do you have to sometimes rebuild the whole BDB ? How long does it take ? How much space do you need ?
Any regret about your current architecture ?
....

Thank you :-)

Pierre

uniprot java database • 4.3k views

ADD COMMENT • link updated 6.5 years ago by navela78 ▴ 70 • written 11.1 years ago by Pierre Lindenbaum 161k

score 11 · Answer 1 · 2013-02-25

Some background before I start answering the questions.

UniProt is a consortium database maintained by 3 partners SIB, EBI, PIR. All consortium partners want to host part of the website. The UniProt data is mostly read only on the public website due to 4 weekly release cycle. Only job data like blast etc... is read-write.

Currently all partners use Linux Cent-OS or RedHat to run their servers but in the a few years ago we also had to deal with solaris. EBI runs a proprietary load balancing solution while SIB and PIR use Apache 2+ with mod_proxy. Behind this font end we run a total of 6 tomcat servers. 4 at EBI (2 per datacentre, per policy more than need) and 1 each at SIB and PIR. On the latest security release of java. We use DNS round robbin for datacentre disappearance and load balancing behind it (you need to have both). At PIR and SIB data is stored on EXT3 partitions while at EBI its accessed via NFS on Isilon nodes.

BDB/je is accessed directly and we just have a thin wrapper around it. We use the EntryBindings to directly write our java in memory objects to berkley db key values. Using fixed length keys when ever possible. One of things we also do is have gzip compressed records. Which means we more than double the effective cache size for BDB/je which is great for IO.

Search is provided by the great Lucene libary for full text search capabilities.

All state is injected from a xml file using the Spring dependency injection framework into different Struts actions. Which in practice is injection via webcontext. This is fine as bdb/je is good for concurrent access as is our search engine lucene.

We do not use the high availability options for BDB/je we just have 7-8 independent copies of the data one per machine datacentre. Jobs such as blast data is shared on demand via http requests between mirrors. Which is ok as users rarely access data older than one hour.

We do not access bdb/je in the JSPs we have a wrapper object ResultList which combines a lucene query result with a bdb/je result iterator.

We implemented our own, mostly due to overloading off such services for many formats i.e. uniprot flatfile or xml or rdf or gff or fasta or just id or columns.

SOAP or REST is more of complexity trade off. And to be correct it really is simple HTTP interface or SOAP. Most developers in bioinfo prefer simple HTTP access over SOAP methods. (Also I did not write that paper)

We had some issues with BDB in the 3.0 time frame years ago, but no we don't have issues with BDB/je as long as the storage array behaves.

Storage needs for release 2013_02 is almost 80GB for BDB/je and 145GB for lucene indexes (version 2.9).

No I am very happy with the architecture today. It has been with us with minimal issues for more than 5 years now and is still very performant. We need to upgrade our lucene code for some improved performance but otherwise are very happy with how its kept up with the data explosion over time. Also we have a measured uptime of more than 99,9% which is not trivial with so many datacentres and users. For the rest our Struts code could use an update as well.

The website as is only suitable for finding entries if you want to do analytical queries I recommend using our sparql endpoint. This is a interface that exposes the uniprot data via standard query language that is very suitable for deep queries over our datamodel.

score 1 · Answer 2 · 2017-10-09

1

Entering edit mode

6.5 years ago

navela78 ▴ 70

Hello I am wondering what the UNIPROT architecture looks like now after 5 years of further data explosion? Are you still using Berkley DB and lucene?

Sorry I am new to this area ... I am also wondering how you store the sequence data ... do you upload it into the berkley db or maintain it as flat files? How do you provide the sequence when user wants to download it?

Do you think any other new technologies will help in sequence data management?

ADD COMMENT • link 6.5 years ago by navela78 ▴ 70

0

Entering edit mode

Navella, I can't edit my answer. But maybe ask it as a new question? i.e. take the original question edit it and ask it. Then I can answer it for 2017.

ADD REPLY • link 6.5 years ago by me ▴ 750

0

Entering edit mode

Hello, I have posted a new question:

Describe Your Architecture: Uniprot (2017)

ADD REPLY • link 6.5 years ago by navela78 ▴ 70