Dear All,
I have a list of Uniprot IDs. Based on these IDs I would like to parse the ordering of amino acid residues in the ATOM field of PDB structures. But ATOM field residue numbers do not always match with the order of residues in corresponding ResSeq numbers.After searching Biostars I found a post about SIFTS database.
But the residue number information in SIFTS database are in xml.gz files. I really don't know how to read these files using either R or Python.
I tried some solutions from Biostars itself .But they don't work in my case.I would like to give Uniport IDs (or PDB IDs) one bye one and parse the xml files to get the residues numbers in PDB and corresponding residue number in Res Seq field.
If appreciate suggestions from both R and Python experts, because I would like to know both approaches.
Link to SFITS database: https://www.ebi.ac.uk/pdbe/docs/sifts/quick.html
Following is the xml file repository: ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/
Thank you in advance
give us an example please.
"I tried some solutions from Biostars itself .But they don't work in my case." : what have you tried ?
This is R code.But it gives error.
and Python I don't know which module is good for me.I haven't ever parsed xml files.I found Beautiful Soup.I first tried the examples to learn it. But first of all I don't know how to read these files one by one.Hope that would be a great help.
For example I have two structures IGK9 and ICRN in my file (I have around 1200 structures in fact).
In SWIFTS database they can be found as:
Individual PDB entry data can either be found in a path like this:
ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/xml/1xyz.xml.gz - where 1xyz is the PDB code or in a path like this: (So here Ixyz canbe 1crn or 1gk9 etc)
ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/xy/1xyz.xml.gz - where 'xy' are the second and third characters of the PDB code and 1xyz is the PDB code itself.(here xy will be cr ,gk etc)
Hope this is what you asked for.
I meant what would be the expected ouput for "I would like to parse the ordering of amino acid residues in the ATOM field of PDB structures. " for ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/xy/1xyz.xml.gz
The xml file contains information like below:
Expected output:I would like to have them in tab/comma separated fomat(columns).
Row name showing SegId (This will be same for a single chain in a structure). Then dbResnum, dbResname, etc (all entities in crossRefDb fields) as columns inorder if any of these is not existing then "NA".
I think now Ihave answered your question.