Tuesday, 23 January 2018

Which came first, the atom or the bond?

Let's suppose you need to iterate over the neighbours of an atom, and you want to know both the corresponding bond order and the atomic number of the neighbour. Given that toolkits typically provide iterators over the attached bonds or the attached atoms, but not both simultaneously, how exactly should you do this?

Should you:
  • (a) iterate over the neighbouring atoms, and then request the bond joining the two atoms?
  •     for nbr in ob.OBAtomAtomIter(atom):
            bond = atom.GetBond(nbr)
    
  • or (b) iterate over the attached bonds, and ask for the atom at the other end?
  •     for bond in ob.OBAtomBondIter(atom):
            nbr = bond.GetNbrAtom(atom)
    

Obviously, either way you get your atom and bond. But which is more efficient? Clearly, the answer to this depends on the internals of the toolkit. But if you assume that each atom knows its attached atoms and bonds, then it's only the second step that determines the relative efficiency. That is:
  • (a) given two atoms find the bond that joins them, versus
  • (b) given an atom and a bond find the atom at the other end

Since the implementation of (a) will probably involve the same test that (b) is doing plus additional work, it follows that (b) must be more efficient. I never really thought about this before until I was writing the kekulization code for Open Babel. It's the sort of thing that's useful to work out once and then apply in future without thinking. Sure, the speed difference may be minimal but given that you have to choose, you might as well write it the more efficient way.

Thursday, 18 January 2018

Implementing the Sayle tautomer hash with Open Babel

One of the consequences of last year's overhaul of handling of kekulization, aromaticity and implicit hydrogens is the ability to easily calculate structure hashes such as Roger Sayle's tautomer hash, which I wrote up on the NextMove blog a while ago.

I routinely use this hash to handle tautomers, particularly when dealing with R groups. It doesn't solve all tautomer issues (e.g. ones that involve carbon) but it can quickly bring you from having no support for handling tautomers to getting you 95% of the way. In fact, I've been thinking about adding this (and some of the other hashes that Roger has come up with) as cansmi output options. Anyhoo, here's an implementation in Python:
import pybel
ob = pybel.ob

def tautomerhash(smi):
    """Take a SMILES and return the Sayle tautomer hash:
    https://nextmovesoftware.com/blog/2016/06/22/fishing-for-matched-series-in-a-sea-of-structure-representations/
    """
    mol = pybel.readstring("smi", smi).OBMol
    mol.DeleteHydrogens() # just in case
    formalcharges = 0
    hcount = 0
    for atom in ob.OBMolAtomIter(mol):
        formalcharges += atom.GetFormalCharge()
        atom.SetFormalCharge(0)
        if atom.GetAtomicNum() != 6: # non-carbon
            hcount += atom.GetImplicitHCount()
        atom.SetImplicitHCount(0)
        atom.UnsetAromatic()
    for bond in ob.OBMolBondIter(mol):
        bond.SetBondOrder(1)
        bond.UnsetAromatic()
    mol.SetAromaticPerceived() # no point triggering perception
    return "%s_%d" % (pybel.Molecule(mol).write("can").rstrip(), hcount-formalcharges)

if __name__ == "__main__":
    smis = ["*c1c(c(C(=N)O)cc2nc([nH]c12)C(=O)[O-])N(=O)=O",
            "*c1c(c(C(=O)N)cc2[nH]c(nc12)C(=O)O)[N+](=O)[O-]"]
    for smi in smis:
        print(tautomerhash(smi))
The two SMILES in the example code above are those from the original blog post. Here they give the following two identical hashes (note to self: 'fix' the bracketed asterisk):
[O][C]([C]1[N][C]2[C]([N]1)[C]([*])[C]([C]([C]2)[C]([O])[N])N([O])[O])[O]_4
[O][C]([C]1[N][C]2[C]([N]1)[C]([*])[C]([C]([C]2)[C]([O])[N])N([O])[O])[O]_4