Biostar Beta. Not for public use.
How To Search Pdb For Sequence Regions That Are Resolved? (Edited)
2
Entering edit mode
9.4 years ago
Lyco ♦ 2.3k
@Lyco1881

PDB offers to download a fasta file containing one sequence per individual chain. This file is very useful and I use it routinely for checking if there is an available structure that is related to my sequences of interest.

Every now and then, a sequence search finds a decent hit, but when I download the structure, I notice that the sequence region matching my query is not present in the file. I am not much of a structure expert, so I can't give a definitve explanation for this. I assume that the downloadable FASTA sequence is what the scientists use for their experiments and the missing parts are the bits that were not resolved. Other explanations may also apply. What I would be interested in is a FASTA file corresponding to the part of PDB for which there is an actual structure (with coordinates and everything).

Edit: Here is one example: The PDB file 2H9E(chain C) is reported to have the sequence:

KATMQCGENEKYDSCGSKECDKKCKYDGVEEEDDEEPNVPCLVRVCHQDCVCEEGFYRNK
DDKCVSAEDCELDNMDFIYPGTRN


The SIFT annotation, as proposed by Khader, maps this to Uniprot entry Q16938, residues 8-91 (which results in the same sequence as that shown above). So far, so good. However, when I look at the structure (or even try to use it for homology modeling), I notice that the structure starts at CGE... (without the first 5 residues) and ends with ..PGTR (without the last residue). Moreover, there are some residues missing in the middle of the structure. I completely understand _WHY_ they are missing, but it would be nice if there were a way to extract the _usable_ sequence. Ideally on a database-wide scale, because this would allow me to do searches and look for the optimal template (maybe there is a better-resolved structure). Failing that, it would be nice to extract the usuable sequence from a given pdb entry. It is possible that the 'joy' package recommended by Khader does just that, but it may be used only by people in academia. No joy for me.

While we're at it: is anybody aware of a simple tool that allows to renumber the residues in the pdb coordinate files? The authors don't appear to be consistent with their use of coordinates - sometimes they start at 1, sometimes the numbers apply to the whole protein (and not just the part present in the structure). These two questions are in a way related, as I would like to work with coordinates that are in agreement with the FASTA file of the sequence.

pdb structure database sequence • 3.2k views
0
Entering edit mode

Do I understand you correctly that you have a sequence and you would like to have the structure of a similar sequence? Why don't you BLAST against PDB?

3
Entering edit mode
9.6 years ago

Proteins structures are solved using X-ray crystallography or NMR based techniques. Due to experimental limitation it may not be able to clone / crystallize the entire protein as represented in Uniprot - this is one of the reason for discrepancy in the sequence derived from structure vs. the actual protein sequence. Another reason could be the experimental group may only interested in a conserved domain(s) / regions in the protein. During crystallization there is a higher chance that you may not be able to assign coordinates for all residues , this will be the primary reason for missing residues in the structures.

I am not sure if this data is readily available. But there are several ways to generate it.

1. You can get the FASTA file corresponding to PDB using SIFTS based mapping, this is not a FASTA file, but this curated SIFTS annotation provide a convenient mapping file that map PDBe to UniProt residues. Download the SIFTS annotation file here.

2. If you are only interested in a couple of PDB files, you may download and extract the sequence yourself using custom scripts or programs like _atm2ali_ (part of JOY package, download here). Also see a similar post for extracting sequence from structural co-ordinates.

3. If you are looking for a blast compatible database, you can use the pdbaa.fasta(download the file: pdbaa.tar.gz; binary format) file maintained at NCBI. I remember that NCBI used to maintain a raw .FASTA format of this file - but not able to locate it at the moment. I am not sure if this is a direct copy of the FASTA file maintained by RCSB, you need to check this with PDB/NCBI helpdesk.

I will go with option no. 2 if you are looking for this data to perform in-depth sequence-structure analysis, that will make sure my sequence data is indeed coming from the structure and I don't have any missing residues which is present in Uniprot FASTA, but missing in the actual structure.

Regarding re-numbering: I have explained my take on this in a previous question here.

EDIT: I would like to add the pdbaa service as part of PISCES server. Dubrack's group provide pre-compiled data as pdbaa / pdbaaent / pdbaanr. You can download the data here. I have no experience with this data, but this could be useful for your purpose. I would recommend you to check with authors about how they deal with missing residues.

1
Entering edit mode

Numbering might also start at positions > 1 when a propeptide is cleaved before the structure was determined.

0
Entering edit mode

Yes Michael that's an important point.

0
Entering edit mode

Yes Michael that's an important point. You can see such examples in the SIFTS mapping file.

0
Entering edit mode

Dear Khader, great suggestions! Unfortunately, they don't solve my problem. pdbaa is what I am working with anyway. The SIFT is a nice resource that is news to me, but it doesn't help (see the example in the edited question). It is possible that atm2ali from the JOY package is the solution, but I am not in academia so I cannot use it. :-(