The underlying problem was the same in both cases; the authors had decided that their ligand was actually composed of several 'residues' and had given the different parts different names. The PDB had not taken this craziness into account and had split the ligand into separate ones based on the HETATM residue names.
So, a little bit of extra legwork for the first case (1pph). I searched for the three ligands in the large SDF file, copied and pasted them into a new SDF, and joined the ligands into one entry with "babel --join". Finally I opened it in Avogadro and drew the missing bonds between the 'residues'.
Fine.
The second case was worse (1ppc). Only 3 of the 4 residues made it to the PDB's ligand SDF file; the fourth residue was a 'glycine' and was not labelled as HETATM in the PDB file - as a result it was not included in the ligand SDF file.
Time to get the sledgehammer. I don't use the PDB file format in Open Babel very often, so I wasn't sure if the following would work:
babel 1PPC.pdb --separate split.sdf
The second molecule in split.sdf is the complete ligand. Nice.
Now you might be wondering why I didn't just do this in the first place for all of the ligands. Well, the PDB file format doesn't contain bond order information, and so I was hoping to take advantage of the work the PDB had put into preparing its own lignad SDF files (which I think was one of the issues addressed during the PDB remediation process a few years ago).
If 2 out of my 36 ligands had this same problem, I wonder how common it is in the PDB data? Maybe I should make a list of cases and send it to the PDB.
Really really nice!
ReplyDeleteTomorrow I'll try to find more ambiguities...
how to do it?
If you have a look at the PDB file of 1PPH entry, you can read this:
REMARK 600 3-TAPAP, A SYNTHETIC THROMBIN INHIBITOR, IS NONCOVALENTLY
REMARK 600 BOUND TO THE ACTIVE SITE.
REMARK 600
REMARK 600 A CALCIUM CAL 480 AND A SULFATE SO4 ARE ADDITIONALLY
REMARK 600 PRESENT.
REMARK 600
REMARK 600 3-TAPAP "RESIDUES" (NOMENCLATURE SEE PAPER) ARE TOS I 1,
REMARK 600 APM I 2, PIP I 3.
REMARK 610
REMARK 610 MISSING HETEROATOM
REMARK 610 THE FOLLOWING RESIDUES HAVE MISSING ATOMS (M=MODEL NUMBER;
REMARK 610 RES=RESIDUE NAME; C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;
REMARK 610 I=INSERTION CODE):
REMARK 610 M RES C SSEQI
REMARK 610 APM E 2
maybe this is a common remark... let's see.
not so easy to find other examples :(
ReplyDeleteI saw that remark, but I don't know if the statement about the missing hetereoatom is even correct. The structure I see in the PDB corresponds to the one in the paper.
ReplyDeleteFor labeling parts of molecules as amino acids, it is frequently useful for peptidomimetics to have things labelled that way, or break up an inhibitor into parts that correspond to residues. It makes it much easier to understand the drug, even if it makes cheminformatics more difficult.
ReplyDeleteThat said, they usually label the amino acids as HETATM records, not ATOM records. (You can look up any amino acid and see this, eg http://www.pdb.org/pdb/ligand/ligandsummary.do?hetId=LEU says "Free Ligand in 57 structures", any of these will have a HETATM LEU)
I just wanted to say that while it looks silly in these cases, there are a ton of places in the PDB where having things this way is extremely useful. A blanket conversion of these types of cases into single residues would lose a lot of information.
Understood. If this is common practice, then the problem is that the PDB should *know* that this is a single ligand, and should output it and display it accordingly.
ReplyDeleteExtractFromPDBFiles from MayaChemTools should work fine.
ReplyDelete1. Extract all HETATM
2. Remove all of the artifacts (sulfate ions, Na, Cl, etc).