Monday 25 July 2022

Post-hoc correction of generated representations

The performance of generative models based on SMILES is much improved compared to the early days when only 5-20% of the SMILES generated would be considered valid. In the previous post, an RNN was able to achieve ~96% valid SMILES. Is it possible to increase this still further via post-hoc correction, and if so, is this a good idea?

As a strawman, let's consider a simple approach to ensure that a formerly invalid SMILES becomes valid, which is to trim back the SMILES from the right until only a valid SMILES remains (code at end). Let's look at the results for the failures reported in the previous blog post:

Aromatic system cannot be kekulized
Orig:    Cc1c(NS(=O)(=O)c2cccc3cccnc23)c(=O)n(-c2ccccc2Cl)c1C#N
Fixed!:  C

Unmatched close parenthesis
Orig:    c1cn(-c2ccc3nc(-c4ccc5c(c4)OCO5)nc4ccccc34)[nH]c2=O)cn1
Fixed!:

1 branch has not been closed
Orig:    O=C(NC(C(=O)Nc1cc(Cc2n[nH]c(=O)c3ccccc23)ccc1F)C1CC1
Fixed!:  O=C

2 ring openings have not been closed
Orig:    OCC1CCC2(CC1)CC(Oc1ccccc1)C(C#N)[N+](=O1)[O-]
Fixed!:  OCC

1 ring opening has not been closed
Orig:    CN1C(=O)NC2(CC2(Cl)Br)c1ccccc1
Fixed!:  CN1C(=O)NC2(CC2(Cl)Br)c1ccccc

Aromatic system cannot be kekulized
Orig:    COc1cc(OC(F)F)c(C(=O)NCc2ccn(C)c(=O)c2)c(-c2ccc(N(C)S)cc2)cc1OC
Fixed!:  CO

Aromatic system cannot be kekulized
Orig:    COc1ccc2c(c1)c(Cc1ccc(F)cc1)c(C)[n+]2[O-]
Fixed!:  CO

1 ring opening has not been closed
Orig:    COc1cccc(Oc2ccc3c(c2)C(=O)c2c4c(ccc2CCO)OCO3)c1
Fixed!:  CO

Aromatic system cannot be kekulized
Orig:    O=c1c(Cn2cc(-c3ccccc3)nn2CC2CCCO2)ccn1Cc1ccccc1F
Fixed!:  O

Aromatic system cannot be kekulized
Orig:    Cc1c(Cc2cccc(-c3cc(=O)n(O)c(=O)s3)c2)coc2cc(O)cc(O)c12
Fixed!:  C

Aromatic system cannot be kekulized
Orig:    CC(=O)c1sc2c(c1NC(=O)c1ccc(O)c(OC)c1)c(C)n2C(C)C
Fixed!:  CC(=O)c1sc2c(c1NC(=O)c1ccc(O)c(OC)c1)c(C)n2

1 ring opening has not been closed
Orig:    O=C(Cn1cc([N+](=O)[O-])cn1)N1C2CCC1CC(n1c(=O)c(C(=O)O)nc3cccc1c32)OCOC4
Fixed!:  O=C(Cn1cc([N+](=O)[O-])cn1)N1C2CCC1CC(n1c(=O)c(C(=O)O)nc3cccc1c32)OCOC

Aromatic system cannot be kekulized
Orig:    CC(C)c1cccc(-c2nc3nccc(C(F)(F)F)n3c(=O)[nH]2)c1
Fixed!:  CC

Aromatic system cannot be kekulized
Orig:    CCOc1c(S(=O)(=O)Nc2ccc(F)cc2)c(O)c2ccccc2n1C
Fixed!:  CCOc1c(S(=O)(=O)Nc2ccc(F)cc2)c(O)c2ccccc2n1

Aromatic system cannot be kekulized
Orig:    CC(C)(C)c1ccc(-c2noc(O)n2N)cc1
Fixed!:  CC

1 branch has not been closed
Orig:    Cc1ccc(CN2CCC(C3CC3C(=O)Nc3ccc(S(=O)(=O)N4CCC4)cc32)cc1-n1cnnn1
Fixed!:  C

2 ring openings have not been closed
Orig:    c1ccc(Cn2cnc3c(OCC4CCCN5)nccc32)cc1
Fixed!:

2 ring openings have not been closed
Orig:    CCc1ncc(C(=O)Nc2cc(C3C4CN(Cc5cnn(C)c5C)CCOCC53)c(O)cn2)cn1
Fixed!:  CC

Aromatic system cannot be kekulized
Orig:    C#Cc1ccc2nc3ccc(C(=O)NCCN(C)C)cc3oc2c1
Fixed!:  C#C

Aromatic system cannot be kekulized
Orig:    O=C(c1nc2cc3ccccc3sc2c(=O)[nH]1)N1CCCc2ccccc21
Fixed!:  O=C

Unmatched close parenthesis
Orig:    CC1CN2c3c(nc4c(C(=O)O)nn(-c5cc(F)ccc5F)c4F)n3CC12)OCCO4
Fixed!:  CC

3 ring openings have not been closed
Orig:    COC1CCCc2c1nc1c(c2N)CCc2ccccc2-23
Fixed!:  COC

Uncommon valence or charge state
Orig:    CCOC(=O)C1=C(C)N=C2SC(=C[N+](=O)[O-])NC2=OC1=O
Fixed!:  CCOC(=O)C

Aromatic system cannot be kekulized
Orig:    CN(C)CCn1c(=C2C(=O)Nc3ccc(Br)cc32)c(O)c2ccccc21
Fixed!:  CN(C)CC

Aromatic system cannot be kekulized
Orig:    Oc1ccc(-c2ccc3c(O)c(OC4OCCCO4)c(Cl)c3c2)cc1
Fixed!:  O

Cannot have a bond opening and closing on the same atom
Orig:    Cc1ccc2[nH]c(C(=O)N3CC44C=CC(CC3)c3ccccc34)cc2n1
Fixed!:  C

1 ring opening has not been closed
Orig:    O=C(c1ccccc1)c1cc2c3c(c(OC4OC(CO)C(O)C(O)C4O)c(=O)cc3c(=O)n12)CCCCC3
Fixed!:  O=C(c1ccccc1)c1cc2c3c(c(OC4OC(CO)C(O)C(O)C4O)c(=O)cc3c(=O)n12)CCCCC

Aromatic system cannot be kekulized
Orig:    CCc1cnc(CN(C)CCCn2c(=O)c3cc(F)ccc3n2C)n1
Fixed!:  CC

1 ring opening has not been closed
Orig:    CNC(=O)c1cc2[nH]c(=O)c3c(n2n1)CC1CC3C(C)C(O)CCC12C
Fixed!:  CNC(=O)c1cc2[nH]c(=O)c3c(n2n1)CC1CC3C(C)C(O)CCC1

Aromatic system cannot be kekulized
Orig:    O=[N+]([O-])c1ccc2c(c1)SC(c1nc(Nc3ccccc3[N+](=O)[O-])[nH]c1=O)O2
Fixed!:  O

Aromatic system cannot be kekulized
Orig:    COC1COC(c2c(NC(c3ncccc3Br)c3ccccc3)[nH]n2)C1
Fixed!:  COC

Unmatched close parenthesis
Orig:    O=C(C1CCN(c2cc(N3CCC(n4c(=O)[nH]nc(-c5ccccc5)C4)cc3)OC4)[nH]c2=O)CC1)N1CCCC1
Fixed!:  O=C

Aromatic system cannot be kekulized
Orig:    O=C1COc2c(NC3CCCC3)c(F)cc3c(=O)c(CCc4nc5ccccc5[nH]4)sc(=O)n1c23
Fixed!:  O=C

2 ring openings have not been closed
Orig:    CC(=O)C1=C(O)C=C2Oc3c4c(c(C)c(O)c3C24C(C)CC1=O)OC(C)(C)C(C(=O)N1CCOCC1)C45
Fixed!:  CC(=O)C1=C(O)C=C2Oc3c4c(c(C)c(O)c3C24C(C)CC1=O)OC(C)(C)C(C(=O)N1CCOCC1)C

Unmatched close parenthesis
Orig:    Nc1ncc(-c2ccc(N3CCOC4CCNC4)cc3)cc2)C(=O)[nH]1
Fixed!:  N

1 branch has not been closed
Orig:    CC(O)(C#Cc1ccc2c(c1)C(O)(c1cn(C3CC(O)C(CO)O3)c(=O)[nH]c1=O)CC2(C)C
Fixed!:  CC

1 ring opening has not been closed
Orig:    CNCCCn1nc(COc2ccc3c4c(c(O)c(C)cc3c2=O)C(=O)N2CCCC2)c2c(N)ncnc21
Fixed!:  CNCCC

Aromatic system cannot be kekulized
Orig:    Cc1cc(NC(=O)NC(C)c2ccccc2)n2ccsc2n1
Fixed!:  C

Aromatic system cannot be kekulized
Orig:    COc1ccc2c(c1)CCc1c-2c2c(c3c1[nH]c1ccc(O)c1c3=O)C(=O)CCC2
Fixed!:  CO

Aromatic system cannot be kekulized
Orig:    Cn1ccc(Nc2cccc3ccccc23)c(N)n1
Fixed!:  C

Aromatic system cannot be kekulized
Orig:    CN1C(=O)c2c(n[nH]c2=O)Cc2ccccc21
Fixed!:  CN

2 ring openings have not been closed
Orig:    C=C1CCC2=CC(O)C3Oc4c(O)ccc5c4C34CCN(CC1C2C)C3O5
Fixed!:  C=C

Aromatic system cannot be kekulized
Orig:    O=C(O)c1c2c(c3cc(F)ccc3nc1-c1ccccc1)n(C)c(=O)n2C
Fixed!:  O=C(O)c1c2c(c3cc(F)ccc3nc1-c1ccccc1)n(C)c(=O)n2

As can be seen, the trimming can be rather drastic. But one could imagine more subtle methods. Errors about aromatic systems would disappear if Kekule SMILES were used. If ring openings were found to not be closed, then either close them on the last atom or ignore the opening. Ditto for branches.

If we look at SELFIES, we can see some of these approaches applied. Any characters beyond an invalid valence are ignored ("[C][=O][...]" as "[C][=O]"). If a branch is indicated but the number of atoms is fewer than this or information is missing, then the branch is assumed not to exist ("[C][Branch1][O]" treated as "[C]"). Similarly for rings.

I was curious how often these post-hoc corrections occur for a typical dataset. Given the same RNN trained on SELFIES (v1.0.4), 10K SELFIES were sampled and I calculated the % of times where removing the last symbol left the resultant SMILES unchanged (i.e. where at least the last SELFIES symbol is not being interpreted). This gave 10.5%.

Returning to the SMILES failures above, where the NN cannot learn the syntax, does it makes sense to apply post-hoc corrections to ensure validity? For example, if kekulized SMILES are used, then the errors about aromaticity will go away; if there is no change to the distribution of generated aromatic ring systems then this would be a win. The trimming method above is more problematic: two very similar representations could be interpreted as molecules of very different size (causing a discontinuity in the search/latent space). Indeed, given that all characters after an invalid state are ignored, the DNN could end up being trained on noise. A potentially better approach (although still symptomatic of a failure to adequately learn the syntax) would be to handle an uncommon valence or charge state by resampling from the generative distribution in the first place (the partialsmiles library might be helpful here as it can catch such failures as early as possible).

Code

import partialsmiles as ps

if __name__ == "__main__":
    fname = "Regular_SMILES_1K.smi"

    failures = []
    with open(fname) as inp:
        for line in inp:
            smi = line.rstrip()

            ok = False
            msg = None
            try:
                mol = ps.ParseSmiles(smi, partial=False)
                ok = True
            except ps.SMILESSyntaxError as e:
                msg = e.message
            except ps.ValenceError as e:
                msg = e.message
            except ps.KekulizationFailure as e:
                msg = e.message
            if not ok:
                failures.append( (smi, msg) )

    for smi, msg in failures:
        i = len(smi)
        ok = False
        while not ok and i >= 1:
            try:
                mol = ps.ParseSmiles(smi[:i], partial=False)
                ok = True
            except:
                i -= 1
        # Invariant: Either ok or i == 0

        print(msg)
        print("Orig:   ", smi)
        print("Fixed!: ", smi[:i])
        print()

Credits

This blog post originates in a conversation with Janosch Menke, Michael Blakey and John Mayfield that started at the ICCS. Thanks also to Morgan Thomas for data and feedback. Image via Flickr by epictop10.com (licensed CC-BY 2.0).