Hello everyone,
I've used Biopython on and off for the past decade. The Biopython developers have recommended Biostars as a wiki.
I am working with a large collection of PDB files. Each PDB file has only one type of protein in it. However, A Bio.PDB.PDBExceptions.PDBConstructionWarning is generated by some of these files when I call PDBParser().get_structure(). With every example in my data set, the warning is about a "discontinuity" in the protein. Looking more closely at those files, I determined that this occurs when there are multiple Chain objects within a Model (refer to the SMCRA hierarchy of PDB files). All of my single-Chain PDB files are being processed without warnings.
I have also determined that none of my multi-Chain files are heterogeneous. They are always repeats of the same protein. Either these proteins crystallized naturally as dimers, or they occur as homo-oligomers in vivo.
Now, here's the problem. I want to extract secondary structure information using DSSP. I have a local copy of the dssp package installed on my machine, which is called by Biopython's Bio.PDB.DSSP.DSSP. It works. But you have to provide Biopython's DSSP two arguments: a Model object, and a path to the PDB file. If I have a multi-chain PDB, DSSP returns a single, flat-file result with all the individual Chains concatenated. The files that contain single proteins are OK. The multimers are a problem. I'm pretty sure I don't have any proteins 20,000 amino acids long.
I tried making a new Bio.PDB.Model and adding a single Chain to it. That doesn't appear to be reliably compatible with the DSSP file. This may have to do with the order that Bio.PDB.Model.get_chains() returns chains, versus the order that DSSP wants to see them.
I may have to brute-force my way through this process. I may just DSSP everything, and then truncate the DSSP result at the number of rows corresponding to the length of one protein. But I would like a more efficient and elegant approach.
I am inspecting Bio.PDB.Model objects in the Python interpreter. I don't see any method to REMOVE Chains from Models in the documentation, only a method to add them. In fact, I am not sure how child objects are stored inside Bio.PDB objects, they are well hidden. There appear to list-like and dictionary-like views of the children, here's an example of a Model with two Chains:
In [15]: model.child_list
Out[15]: [<Chain id=A>, <Chain id=B>]
In [16]: model.child_dict
Out[16]: {'B': <Chain id=B>, 'A': <Chain id=A>}
Comments and advice are appreciated, thanks!