Question

Biopython DSSP treats multiple Chains as one?

0

Entering edit mode

7.4 years ago

jladasky • 0

Hello everyone,

I've used Biopython on and off for the past decade. The Biopython developers have recommended Biostars as a wiki.

I am working with a large collection of PDB files. Each PDB file has only one type of protein in it. However, A Bio.PDB.PDBExceptions.PDBConstructionWarning is generated by some of these files when I call PDBParser().get_structure(). With every example in my data set, the warning is about a "discontinuity" in the protein. Looking more closely at those files, I determined that this occurs when there are multiple Chain objects within a Model (refer to the SMCRA hierarchy of PDB files). All of my single-Chain PDB files are being processed without warnings.

I have also determined that none of my multi-Chain files are heterogeneous. They are always repeats of the same protein. Either these proteins crystallized naturally as dimers, or they occur as homo-oligomers in vivo.

Now, here's the problem. I want to extract secondary structure information using DSSP. I have a local copy of the dssp package installed on my machine, which is called by Biopython's Bio.PDB.DSSP.DSSP. It works. But you have to provide Biopython's DSSP two arguments: a Model object, and a path to the PDB file. If I have a multi-chain PDB, DSSP returns a single, flat-file result with all the individual Chains concatenated. The files that contain single proteins are OK. The multimers are a problem. I'm pretty sure I don't have any proteins 20,000 amino acids long.

I tried making a new Bio.PDB.Model and adding a single Chain to it. That doesn't appear to be reliably compatible with the DSSP file. This may have to do with the order that Bio.PDB.Model.get_chains() returns chains, versus the order that DSSP wants to see them.

I may have to brute-force my way through this process. I may just DSSP everything, and then truncate the DSSP result at the number of rows corresponding to the length of one protein. But I would like a more efficient and elegant approach.

I am inspecting Bio.PDB.Model objects in the Python interpreter. I don't see any method to REMOVE Chains from Models in the documentation, only a method to add them. In fact, I am not sure how child objects are stored inside Bio.PDB objects, they are well hidden. There appear to list-like and dictionary-like views of the children, here's an example of a Model with two Chains:

In [15]: model.child_list
Out[15]: [<Chain id=A>, <Chain id=B>]

In [16]: model.child_dict
Out[16]: {'B': <Chain id=B>, 'A': <Chain id=A>}

Comments and advice are appreciated, thanks!

Biopython DSSP • 3.3k views

ADD COMMENT • link 7.4 years ago by jladasky • 0

score 0 · Answer 1 · 2016-12-03

Well, I have a solution. It may not be the most elegant solution, but it works. Maybe someone else who needs to get DSSP information out of Biopython can make use of it.

# Procedure to extract DSSP information from PDB files with multiple
# Chains, using Python 3.5, Biopython 1.66, and DSSP (mkdssp) 2.2.1
# 2016-12-02
# John J. Ladasky Jr., Ph.D.

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB import DSSP 
from Bio.PDB.PDBExceptions import PDBConstructionWarning
import warnings

multiple_chains = False
with open(path, "r") as f:
    with warnings.catch_warnings():
        warnings.simplefilter("error")      # warnings become exceptions
        try:
            structure = _parser.get_structure("tmp", path)
        except PDBConstructionWarning:
            # The Model in the PDB file consists of multiple Chains.
            # Calling DSSP with the current Model will concatenate 
            # all Chains, and this is unavoidable because it's 
            # external to Python.  The DSSP result will need to be
            # truncated after the fact.
            warnings.simplefilter("ignore")
            structure = PDBParser().get_structure("tmp", path)
            multiple_chains = True
    model = next(structure.get_models())    # I always work with Model 1
    output = DSSP(model, path)              # DSSP expects a Model, why?
    if multiple_chains:
        # A DSSP object is dictionary-like, but with ordered keys.
        # The keys are nested tuples which are structured like PDB 
        # Residue identifiers.  The first element of the tuple is
        # the Chain name (typically, "A").
        chain_a = output.keys()[0][0]
        subset_keys = [k for k in output.keys() if k[0] == chain_a]
        # Now use the keys in chain "A" to truncate output.  The 
        # output object need not be a DSSP object, it just needs to
        # yield the right values when iterated below.
        output = [output[k] for k in subset_keys]
    num, amino, dssp = [], [], []
    for (n, a, d, *unused) in output:
        num.append(n)
        amino.append(a)
        dssp.append(d)

You supply a path to the PDB file, in the string named "path". After running this code, the amino acid numbers are in num, the residues are in amino, and the DSSP assignments are in dssp. There is other information in a DSSP object that I'm not using, which I pack into the tuple named "unused."