Question

Automated method to translate DrugBank Biological Entity (BE) drug target IDs to Entrez ID format?

0

Entering edit mode

6.6 years ago

EverInEarnest ▴ 40

I have a set of drug target "Biological Entity" (BE) IDs from DrugBank in the following format:

BE0000048 BE0000767 BE0001529 ...

I am having difficulty identifying an automated way to translate these IDs to Entrez ID format. Each BE identifier is associated with a drug target on the DrugBank site, and each has an associated UniProt ID listed on the DrugBank site (e.g. https://www.drugbank.ca/biodb/bio_entities/BE0000048 ), which is very helpful, as I can subsequently translate from UniProt to Entrez format (e.g. https://support.bioconductor.org/p/71702/ ). However, so far, I have not located any automated way to translate these BE identifiers to UniProt format, so that they can subsequently be translated to the ultimate desired Entrez format.

I have reviewed a number of resources and publications that review tools to convert between ID formats, but the problem seems to be that those tools only convert e.g. one form of gene ID to another form of gene ID, whereas my desired conversion from BE format to UniProt ID is not supported on any of the platforms I have reviewed (e.g. for the UniProt site's tool, when I specify DrugBank as the input field, I think that a "DB"-formatted input is expected, as my "EB" inputs yield the message that no results were found: http://www.uniprot.org/uploadlists/ ).

One possibility is that I might need to do web scraping to extract the UniProt ID from each DrugBank page corresponding to each BE identifier, but if there is an existing platform to do this conversion so that web scraping isn't necessary, that would be very helpful.

I will greatly appreciate any advice about how I can automate the process of converting from each BE identifier to UniProt format. Thanks in advance.

conversion • 2.7k views

ADD COMMENT • link updated 6.6 years ago by Sparrow_kop ▴ 260 • written 6.6 years ago by EverInEarnest ▴ 40

score 1 · Answer 1 · 2017-09-05

1

Entering edit mode

6.6 years ago

Sparrow_kop ▴ 260

Hi, in fact, if you can access to the full_database.xml from drugbank, all info you need is contained in it, for example "BE identifier" ,"UniProt". So you can parse the xml and match the target tags to get the info you want if you want to do web scraping.

ADD COMMENT • link 6.6 years ago by Sparrow_kop ▴ 260

0

Entering edit mode

Thanks for your response, Sparrow_kop. I will note that I am interested in the drug target data from the DrugBank database, and as far as I've seen, the only ID provided for each target is in the XML path <targets> --> <target> --> <id>, which only provides the BE-form identifier, and no UniProt ID... So, I remain with the problem that I have only the BE-form IDs for my targets, and ultimately want to convert to Entrez format...

Another interesting observation is that, despite the extensive list of descriptions provided by DrugBank for the source of each field ( https://www.drugbank.ca/documentation#drug-cards ), I do not see the Biological Entity (BE) format mentioned anywhere... This documentation lists UniProt, GenBank, and PDB ID formats as being available for Target info, but in reality, after I have extracted the <targets> --> <target> --> <id> content for each target, this info seems only to ever be in BE format.

ADD REPLY • link 6.6 years ago by EverInEarnest ▴ 40

1

Entering edit mode

Well, in my own experience, from drugbank database, drug DB_id and its target Gene symbol are enough for me, I never used BE id ... Meanwhile, despite the BE info is not mentioned in your url, it existing in the xml file indeed, and followed by the uniprot id. For example , in the full_database.xml , you can see:

    <target position="3">
  <id>BE0000717</id>
  <name>Urokinase plasminogen activator surface receptor</name>
   ....
   ....
  <polypeptide id="Q03405" source="Swiss-Prot">

The "<id>BE0000717</id>" and "<polypeptide id="Q03405" source="Swiss-Prot">" are what you need. So if you really want to extract the BE id and the mapped uniprot id, you can parse the xml, all info you need is in it. Try it !

ADD REPLY • link 6.6 years ago by Sparrow_kop ▴ 260

0

Entering edit mode

Thank you, Sparrow_kop! That is extremely useful. I am going to revise my code based on your clarifications and provide updates when available.

ADD REPLY • link 6.6 years ago by EverInEarnest ▴ 40

0

Entering edit mode

Update: I had also previously inquired with the DrugBank staff to resolve my questions, and I was provided with this informative response:

"Each target in the XML file contains 1 or more Polypeptide entries. Each of these entries contains an ID which is the UniProt ID. A target can contain more than one polypeptide, which is why we use the concept of a BioEntity, which is used by DrugBank only.

"Keep in mind a target in DrugBank is not always a polypeptide/UniProt entry. It can be a small molecule (anti-toxins for example), DNA, etc. If you want just the UniProt IDs I would consider looking at the data exports here: https://www.drugbank.ca/releases/latest#protein-identifiers.

"Additionally, I would not go down the web-scraping path, we are about to release a new version of the website with a completely different HTML layout."

ADD REPLY • link 6.6 years ago by EverInEarnest ▴ 40