Tuesday, 10 July 2007

Pybel - Hack that SD file

At some point, most cheminformaticians need to hack at SD files. Maybe they need to extract the values of particular data fields, or add in some new ones. Unfortunately, our friends 'awk', 'grep' and 'Excel' (!) are not very useful for this so we need to use a scripting language.

The examples shown here use the Python module Pybel, included with the Open Babel distribution. Although it's a Python module, all of the hard work is done by the underlying C++ library.

Convert an SD file to a spreadsheet

Given an SD file with calculated descriptor values in the fields, I want to write out a Tab-separated file containing a SMILES string in the first column, the title in the second column, and then the descriptor values in the remaining columns.

import pybel

inputfile = pybel.readfile("sdf", "ace_ligands.sdf")
# We need to read the first molecule to find
# the descriptor names

mol = inputfile.next()
print "\t".join(["SMILES", "Title"] + mol.data.keys())

# Print out information on the first molecule
print "\t".join([mol.write("smi").split()[0], mol.title]
+ mol.data.values())
# Print out information on the remaining molecules
for mol in inputfile:
print "\t".join([mol.write("smi").split()[0], mol.title]
+ mol.data.values())
Notes:
  • The data fields in an SD file are in the 'data' attribute of a molecule, which behaves like a dictionary with keys() and values().
  • The "\t".join(mylist) concatenates a list of strings around Tab characters.
  • If you convert a molecule to a SMILES string, the result is a string containing the SMILES string, a Tab, the molecule title, and a carraige return. To simply access the SMILES string, you need to split() it and access the first element.

Add Rule-of-5 descriptor values to an SD file

Although Lipinksi's Rule-of-5 (or fives) may be both a Sacred Cow and an Evil Metric (according to the Great Molecular Crapshoot), we still love it. Here's how to add RO5 descriptor values to an SD file (requires Open Babel 2.1.1):

import pybel

HBD = pybel.Smarts("[!#6;!H0]")
HBA = pybel.Smarts("[$([$([#8,#16]);!$(*=N~O);" +
"!$(*~N=O);X1,X2]),$([#7;v3;" +
"!$([nH]);!$(*(-a)-a)])]")

def moredesc(mol):
ans = {}
ans['RotBonds'] = mol.OBMol.NumRotors()
ans['HBD'] = len(HBD.findall(mol))
ans['HBA'] = len(HBA.findall(mol))
ans['molwt'] = mol.molwt
return ans

output = pybel.Outputfile("sdf", "LipinskiRulesOK.sdf")

for mol in pybel.readfile("sdf", "NoRules.sdf"):
mol.data.update(mol.calcdesc())
mol.data.update(moredesc(mol))
output.write(mol)

output.close()
Notes:
  • The 'calcdesc' method of a Molecule returns a dictionary containing descriptor values for LogP, Molecular Refractivity (MR), and Polar Surface Area (PSA).
  • The 'update' method of a dictionary adds the contents of one dictionary to another.

12 comments:

Georg-Martin Krapper said...

Thanks for the reference to GMC!

I think the SMARTS for the Lipinski rule of 5 should be:

HBD [#7,#8;!H0]
HBA [#7,#8]

Lipinski defines hydrogen bond acceptors as nitrogen or oxygen and donors as nitrogen or oxygen with one more hydrogens (section 2.6 of paper). There appears to be a typo in section 3.1 of his paper but he does appear to count heteroatoms rather than hydrogens for donors.

Your HBD SMARTS will match thiols which Lipinski does not count as donors. One can criticise the Lipinski definitions but they are what were used for his analysis and should be used for anything claiming to be a Lipinski Ro5 descriptor. Hope this doesn't come across as overly pendantic.

Noel O'Boyle said...

How about Lipinksi-like? Well, in any case, the beauty of open source is that you can see exactly what definition I used for the HBA and HBD and agree or disagree as you see fit.

The definitions I used are from:
Lines 57 and 58 of smartsdescriptor.cpp in the OpenBabel development code. This is Chris Morley's code, but it takes the definitions from JOElib, which has references for HBA and HBD.

Apart from not being Lipinski-compliant, if you can see any other problems with these definitions (you seem to read SMARTS and SMILES as a native language), future generations of OpenBabel users will thank you.

Georg-Martin Krapper said...

A real quick comment because I'm going to be out of circulation for the next month or so. Your point about the open nature of the SMARTS is very important because it allows us to have our debate.

The acceptor definitions look weird to me. I would normally treat all nitrogens as acceptors unless they are hypervalent, cationic of adjacent to sp2 or sp carbons or nitrogens. Sulfonamide nitrogens are typically pyramidal but the electronic pull of the sulfonyl group is likely to zap any acceptor ability. Oxygen atoms are all likely to have some acceptor ability (expect in oxonium cations) but aromatic ethers, 2-connected ester O & furans will be very weak.

The definitions you have appear to allow sulfur without eliminating hypervalent sulfur and thioethers. The definitions appear to go to some length to eliminate nitro oxygen but will accept aromatic oxygen.

I think the v3 that qualifies the #7 will (correctly) eliminate hypervalent nitrogen and I would agree that pyrrole-like nitrogen [nH] should not be counted as an acceptor. However any 3-connected aromatic nitrogen [nX3] should be included in that category with [nH]. As mentioned above a case can be made for treating all 3-connected nitrogen next to sp2 or sp carbon ( [NX3][c,C&X3,C&X2] ) as non-hydrogen accepting. Amide N will be the most frequently encountered example of this type.

Finally the !$(*(-a)-a) looks fishy. It eliminates nitrogens singly bonded to two aromatic atoms. Although correct, it's an odd thing to code and it'd be a good idea to find out why it's there. I wondered if the single bonds were meant to be aromatic bonds. If this were the case, it would incorrectly eliminate pyridine nitrogen.

Hope this helps a bit. Maybe the original creators of the SMARTS can shed more light.

Peter Fedichev (Quantum CTO) said...

Noel,
thank you for the useful post. In fact I learnt about pybel from your blog first and only came to pybel wiki and other docs some time after.

The script calculating HBD and HDA appears to fail on a large sdf file. My view is that HBD.findall leads to a memory leak (at least what I see is that the process consumes all the memory available on my Ubuntu box and quits with:
terminate called after throwing an instance of 'std::bad_alloc'
what(): St9bad_alloc
Aborted
Please let me know if you know a remedy.
Regards,
Peter

Noel O'Boyle said...

I'll check this up, and get back to you...

Noel O'Boyle said...

The problem is in calcdesc(), and will be fixed in the next release of Pybel. Until then, here's a workaround:

import pybel
import openbabel as ob

HBD = pybel.Smarts("[!#6;!H0]")
HBA = pybel.Smarts("[$([$([#8,#16]);!$(*=N~O);" +
"!$(*~N=O);X1,X2]),$([#7;v3;" +
"!$([nH]);!$(*(-a)-a)])]")

def moredesc(mol):
ans = {}
ans['RotBonds'] = mol.OBMol.NumRotors()
ans['HBD'] = len(HBD.findall(mol))
ans['HBA'] = len(HBA.findall(mol))
ans['molwt'] = mol.molwt
return ans

descs = {'LogP': ob.OBLogP(), 'PSA': ob.OBPSA(), 'MR': ob.OBMR()}
def newcalcdesc(mol):
return dict([(x,y.Predict(mol.OBMol)) for x,y in descs.iteritems()])

output = pybel.Outputfile("sdf", "LipinskiRulesOK.sdf")

for mol in pybel.readfile("sdf", "ace_ligands.sdf"):
mol.data.update(newcalcdesc(mol))
mol.data.update(moredesc(mol))
output.write(mol)

output.close()

Unknown said...

I've just started using pybel and came across your interesting blog.

However, I'm curious about the definition of HBD used here. Lipinski defines this as the sum of OHs and NHs which I interpret differently to the other posters - to my mind this means that NH2 is counted twice.

Indeed, in his original Table 1, Atenolol has HBD value of 4 (similarly Aciclovir) which arises from counting NH2 as having HBD value of 2.

Could someone comment on this? I'm confused!

Noel O'Boyle said...

@Dast: I think there is only one way to find out for certain what Lipinksi's intention was - ask him! If you Google "Chris Lipinski" his email address is contained in the first hit. If you find out the answer, I would very be interested to hear.

Unknown said...

I can ask Chris Lipinski but unfortunately (to my mind), whatever his original intention was the term HBD has already come to be interpreted in different ways by different groups.

In my work we routinely use the Accord for Excel (Accelrys) add-in which counts NH2 twice and I would prefer to stick with this interpretation. Could you please suggest what SMARTS could give such a result?
Regards

Noel O'Boyle said...

I am afraid that I'm not very familiar with SMARTS. Perhaps you should email the CCL.net list.

Unknown said...

It's quite interesting. My question here is whether PYBEL is useful for converting SMILES to SDF format or not ?

Noel O'Boyle said...

Since Open Babel can do that, it's also possible to do that conversion with Pybel.