Thursday 18 January 2018

Implementing the Sayle tautomer hash with Open Babel

One of the consequences of last year's overhaul of handling of kekulization, aromaticity and implicit hydrogens is the ability to easily calculate structure hashes such as Roger Sayle's tautomer hash, which I wrote up on the NextMove blog a while ago.

I routinely use this hash to handle tautomers, particularly when dealing with R groups. It doesn't solve all tautomer issues (e.g. ones that involve carbon) but it can quickly bring you from having no support for handling tautomers to getting you 95% of the way. In fact, I've been thinking about adding this (and some of the other hashes that Roger has come up with) as cansmi output options. Anyhoo, here's an implementation in Python:
import pybel
ob = pybel.ob

def tautomerhash(smi):
    """Take a SMILES and return the Sayle tautomer hash:
    https://nextmovesoftware.com/blog/2016/06/22/fishing-for-matched-series-in-a-sea-of-structure-representations/
    """
    mol = pybel.readstring("smi", smi).OBMol
    mol.DeleteHydrogens() # just in case
    formalcharges = 0
    hcount = 0
    for atom in ob.OBMolAtomIter(mol):
        formalcharges += atom.GetFormalCharge()
        atom.SetFormalCharge(0)
        if atom.GetAtomicNum() != 6: # non-carbon
            hcount += atom.GetImplicitHCount()
            atom.SetImplicitHCount(0)
        atom.UnsetAromatic()
    for bond in ob.OBMolBondIter(mol):
        bond.SetBondOrder(1)
        bond.UnsetAromatic()
    mol.SetAromaticPerceived() # no point triggering perception
    return "%s_%d" % (pybel.Molecule(mol).write("can").rstrip(), hcount-formalcharges)

if __name__ == "__main__":
    smis = ["*c1c(c(C(=N)O)cc2nc([nH]c12)C(=O)[O-])N(=O)=O",
            "*c1c(c(C(=O)N)cc2[nH]c(nc12)C(=O)O)[N+](=O)[O-]"]
    for smi in smis:
        print(tautomerhash(smi))
The two SMILES in the example code above are those from the original blog post. Here they give the following two identical hashes (note to self: 'fix' the bracketed asterisk):
[O][C]([C]1[N][C]2[C]([N]1)[C](*)[C]([C]([CH]2)[C]([O])[N])N([O])[O])[O]_4
[O][C]([C]1[N][C]2[C]([N]1)[C](*)[C]([C]([CH]2)[C]([O])[N])N([O])[O])[O]_4

9 comments:

SteveR said...

I can think of two cases where this might be troublesome, depending on exactly what behavior you expect.

1 - Chiral centres. Using the above, N1CC[C@H](N)C1, N1CC[C@@H](N)C1 and N1CCC(N)C1 all collapse to the same hash. That can be fixed by not removing explicit H if the centre is a chiral centre

2 - Double bond geometry - C\C=C\C, C\C=C/C, C/C=C\C, C/C=C/C, CC=CC all collapse to the same hash. It is less obvious how to fix this. More concerningly, CCCC also appears to give the same hash?

Noel O'Boyle said...

Great points. Yes for 1 and mostly for 2, but CCCC has a different H count so cannot give the same hash as the version with a double bond (unless I misunderstand your point).

If truth be told, I am often working in a space where I am simultaneously wiping the stereo. More care is required if this must be preserved, and some thought applied to whether the stereo is "real" if an interconversion between two tautomers might cause rotation. As always, the details are everything - the code as presented is a starting point and must be tailored to the specific application.

Noel O'Boyle said...

Oh yes - I see your point now. Yup, that looks like an error.

Noel O'Boyle said...

I've corrected that by indenting "atom.SetImplicitHCount(0)". Thanks for pointing it out.

SteveR said...

Thanks Noel - not sure that fixes it though, as the final step of setting all bond orders to one, at least in another toolkit, results in the hashes being [CH3][CH2][CH2][CH3]_0 in all those cases, I think. If you dedent and remove the if statement, then you do get different hashes:

C\C=C\C -> [C][C][C][C]_8
C\C=C/C -> [C][C][C][C]_8
CCCC -> [C][C][C][C]_10

Steve

Noel O'Boyle said...

Here are the results for Open Babel (dev version). It appears to work fine:

CCCC_0
C/[CH][CH]\C_0

I just looked into RDKit, and there must be something else needed to get it to work as expected. Here's something I noticed in the past: https://github.com/rdkit/rdkit/issues/2339, but even reassigning radicals doesn't fix this case.

SteveR said...

Thanks Noel - I dont think RDKit is ever going to write out the / and \ once the double bond has gone. I just tried only partially processing in RDKit with a view to then crunching the SMILES afterwards - unfortunately, the result then is that the SMILES atom order is not the same for two related tautomers, e.g

C=CC(=O)C --> [C][C](=[O])[CH]=[CH2]_0
C=CC(O)=C --> [CH2]=[CH][C](=[CH2])[O]_1

I think there might be a solution based on fully processing to get the SMILES atom order and then reordering the atoms to that order on a fresh copy of the molecule and then either partially processing / string-chewing or using a custom SMILES writing function... neither sounds entirely trivial!

SteveR said...

Update - nope, the String-chewing method doesnt work, because the exact order of atoms in the SMILES after atom re-ordering is not the same as the order of atoms, which leaves the custom SMILES-writer I think.

SteveR said...

Actually, the renumbering (i.e. reordering) approach does work, but there are some unfortunate consequences of only counting the H's on the non-C atoms:

C=CC(=O)N, C=CC(O)=N, C=CC(=N)O, C=CC(N)=O --> [C][C][C]([N])[O]_2 (Correctly, all tautomers)

but:

C=CC(=O)C --> [C][C][C]([C])[O]_0
C=CC(O)=C --> [C][C][C]([C])[O]_1

i.e. not tautomers, because the 'mobile' H has not been counted in the 1st tautomer as it is on a C atom, but has in the second as it is on an O atom.

Unfortunately, the 'technical tautomer' pair:

c1cc(F)ccc1C1CCCCC1, C1CC(F)CCC1c1ccccc1 --> [F][C]1[C][C][C]([C]2[C][C][C][C][C]2)[C][C]1_0

are tautomers, as all of the 6 'mobile' H's are on C atoms and thus not counted...