Merging Data From Pfam Pdb And Uniprot
10
6
Entering edit mode
12.8 years ago
Aurobhima ▴ 100

Hi,

Does anyone know of a method to map between Pfam, PDB and UniProt. I have very specific criteria I want to select data with, and this requires a combination of these three databases.

I have been working on a solution for some time now on my own, but would like to know if anyone else has been doing something like this and if they'd be interested in discussing this with me.

Thanks

pdb uniprot • 8.7k views
ADD COMMENT
8
Entering edit mode
12.8 years ago

Residue-level cross reference data based on PDB is available via SIFTS annotations.

Please check the following files at SIFTS Quick Access:

pdb_chain_uniprot.lst - A summary of the PDBe to UniProt residue level mapping, showing the start and end residues of the mapping using SEQRES, PDB sequence and UniProt numbering.

pdb_chain_pfam.lst - A summary of the Pfam domain identifier(s)(derived via the UniProt mapping) for each PDB chain that has been processed.

You can use two files and use one identifier to map to others. This is the best cross-reference for PDB-Uniprot-Pfam I could find. I am using this in my analysis.

ADD COMMENT
2
Entering edit mode

What kind of issues ?

ADD REPLY
0
Entering edit mode

Thanks.. we did try it before and found that there are some issues with it.. which is why we went our own way.. but it is the closest I've seen to what I'm looking for..

ADD REPLY
4
1
Entering edit mode

bio2rdf uniprot data is unfortunatly very much out of date :(

ADD REPLY
2
Entering edit mode
12.8 years ago

It's all there on UniProt in "Cross-references", e.g. see this entry for NMB1681. The data is also available in the export formats, e.g. text format.

ADD COMMENT
4
Entering edit mode

Example of an inconsistency?

ADD REPLY
0
Entering edit mode

I have these data, but there are inconsistencies in the cross references between the 3 databases.. I wish it were that straightforward..

ADD REPLY
1
Entering edit mode
12.8 years ago

You might want to have a look at our BridgeDB, which was developed to help you solve questions like this. See: http://www.bridgedb.org

ADD COMMENT
1
Entering edit mode

thanks I'll have a look into it.. it could be useful..

ADD REPLY
0
Entering edit mode
12.8 years ago

The file ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz seems to contain all the IDs.

curl  -s "ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz" |\
gunzip -c |\
egrep -i '(accession|pfam|pdb)'

  (...)
  <accession>P0C9E9</accession>
  <accession>P0C9K3</accession>
  <dbReference id="PF01639" key="10" type="Pfam">
  <accession>P0C9I4</accession>
  <dbReference id="PF01639" key="10" type="Pfam">
  (...)
  <property type="PDB accession" value="1KMH"/>
  (...)
ADD COMMENT
0
Entering edit mode

there is also similar data in the Pfam, and there is the UniProt ID in the header of PDB files.. but they don't play nice with each other.

ADD REPLY
0
Entering edit mode
12.8 years ago
Nabellaleen ▴ 10

There is also modern solution using data crossing softwares (ie : http://www.isoft.fr/bio/biopack_data_en.htm ). It definitly fills me with despair to see people "reinvent the wheel" for the main but only first step of their work : data access and mining ...

ADD COMMENT
0
Entering edit mode

Thanks.. I'll have a look.. not sure I'm re-inventing the wheel though.. I have yet to find something that comes close to what it is I'm trying to do.. I need to make very specific selection criteria, e.g. all Pfam domains which are only present in non-membrane mitochondria proteins. Or which protein structures can be found exclusively extra-cellular in Eukaryotes.. if I'm reinventing the wheel, I'd be really happy to use the existing one.. :-)

ADD REPLY
0
Entering edit mode

It sounds like you should be able to build a query to answer that using SRS. Or at most a couple of queries!

ADD REPLY
0
Entering edit mode

In fact, it exists softwares which permit to easily import, read, parse, filter and cross data with total control on all parameters. So, this type of software permit to make a pipeline for your needs or for a lot of other needs in some days. And when I say "reinvent the wheel" it's not about your specific analysis but about re-designing of script each time with only some minor changes but with a large time-cost :)

ADD REPLY
0
Entering edit mode
12.8 years ago
Jerven ▴ 660

Using uniprot.org Using customize display in the uniprot entry view

Or using a mapping service http://www.uniprot.org/uniprot/?tab=mapping.

If you want to discuss the way uniprot maps to PDBe (not so straight forward as you might think) contact help@uniprot.org. Pfam comes directly out of the interpro results and there should not be that much skew between these databases.

ADD COMMENT
0
Entering edit mode
12.8 years ago
Iain ▴ 260

You could try using the SRS service in the EBI.

http://srs.ebi.ac.uk/

This service links many databases with each other.

There is a tutorial available: http://www.embl.de/~seqanal/courses/srscourse/srstut.html

An example taken directly from this tutorial, the query: enzyme < pdb gives all the enzyme database entries for which the 3D structure is known!

ADD COMMENT
0
Entering edit mode
12.8 years ago

I am planning to use a hash/checksum of the protein sequences to cross-link Uniprot to others.

SEquence Globally Unique IDentifier (SEGUID) is a hashing standard (based on SHA1) - it was specifically developed for uniquely identifying protein sequences.

See also: our PostgreSQL sequence-to-seguid implementation http://dba.stackexchange.com/questions/66/biological-sequences-of-uniprot-in-postgresql

ADD COMMENT
0
Entering edit mode

The sequencing cross referencing tool at the EBI might save you some time. http://www.ebi.ac.uk/Tools/picr/

ADD REPLY
0
Entering edit mode

Get in touch and let's see if we can merger my approach with yours, I think your idea has real potential.

ADD REPLY
0
Entering edit mode
8.0 years ago

This is a very old thread, however I still have not found a good way for pdb to uniprot residue mapping that doesn't rely on a web server that may not be up to date. SIFTS may be the way to go, but has a complicated data structure. Below is a simple self-contained biopython function which relies on an on-the-fly sequence alignment to determine the residue mapping. There may be more elegant ways to script this, but the following works.

from Bio.PDB import *
from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.PDBList import PDBList
from Bio import pairwise2
from Bio import SeqIO

def resmap(chain, uniprot_sequence):
# Returns a PDB to UniProt residue number dictionary. 

    ppb=PPBuilder()
    polypeptides = ppb.build_peptides(chain)
    pdb_sequence = ""
    for polypeptide in polypeptides:
        pdb_sequence = pdb_sequence + polypeptide.get_sequence()
    pdb_res_nums = sortedres.id[1] for res in chain if res.id[0] == " ")

    residue_list = Selection.unfold_entities(chain, 'R')
    alignments = pairwise2.align.globalms(uniprot_sequence, pdb_sequence, 2, -1, -.5, -.1)    
    uniprot_align = str(alignments[0][0])
    pdb_align     = str(alignments[0][1])

    uniprot_map = []
    count = 0
    for residue in uniprot_align:
        if residue != "-":
            count += 1
            uniprot_map.append(count)
        else:
            uniprot_map.append(-1)

    pdb_map = []
    count = -1
    for residue in pdb_align:
        if residue != "-":
            count += 1
            pdb_map.append(pdb_res_nums[count])
        else:
            pdb_map.append(-1)

    matches = []
    for index, residue in enumerate(uniprot_map):
        if uniprot_align[index] == pdb_align[index] and uniprot_align[index] != "-" and pdb_align[index] != "-":
            matches.append(True)
        else:
            matches.append(False)

    mapping = {}
    for index, match in enumerate(matches):
        if match:
            mapping[pdb_map[index]] = uniprot_map[index]

    return mapping
ADD COMMENT

Login before adding your answer.

Traffic: 2545 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6