How to extract all Pfam Sequences with cleared PDB Structure?
1
0
Entering edit mode
7.0 years ago
twinstar2 • 0

I downloaded and extracted the Pfam-A.full.ncbi.gz from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam31.0/

I splittet the huge file into smaller files for each family like: https://www.dropbox.com/sh/t0b173oa8odvsne/AABZLhiaq-jZtE5PrULvBm9na?dl=0

Now i try to extract all Sequences from the Alignment, which have an observed/cleared structures in the pdb database. Luckily Pfam provides the pdbmap file which, links pdb IDs to Pfam IDs. I can extract all Pfam families, which have observed structures in the pdb, but i can not extract the corresponding sequences. So any help on how to accomplish this is very appreciated.

PS: I currently use Python 2.7 with BioPython, but it can't handle the files.

Python pfam pdb • 2.4k views
ADD COMMENT
0
Entering edit mode
6.9 years ago
twinstar2 • 0

Okay for anyone with the same problem, i found the answer!

  1. Extract all UniProtKB ids (line 5) from the pdbmap file into a list (you may have to make smaller sub parts of the list due to Server limitations)
  2. Upload the List here: http://www.uniprot.org/mapping/ and select "UniProtKB AC/ID" as Input and "EMBL/GenBank/DDBJ CDS" as Output
  3. Rejoin the target_id list
  4. Iterate over all Pfam files and dump all sequences which are not in your targed_id list
  5. Let it run for 2 days
  6. ????
  7. profit

This will get you ~ 70.000 sequences

ADD COMMENT

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6