The performance of generative models based on SMILES is much improved compared to the early days when only 5-20% of the SMILES generated would be considered valid. In the previous post, an RNN was able to achieve ~96% valid SMILES. Is it possible to increase this still further via post-hoc correction, and if so, is this a good idea?
As a strawman, let's consider a simple approach to ensure that a formerly invalid SMILES becomes valid, which is to trim back the SMILES from the right until only a valid SMILES remains (code at end). Let's look at the results for the failures reported in the previous blog post:
Aromatic system cannot be kekulized Orig: Cc1c(NS(=O)(=O)c2cccc3cccnc23)c(=O)n(-c2ccccc2Cl)c1C#N Fixed!: C Unmatched close parenthesis Orig: c1cn(-c2ccc3nc(-c4ccc5c(c4)OCO5)nc4ccccc34)[nH]c2=O)cn1 Fixed!: 1 branch has not been closed Orig: O=C(NC(C(=O)Nc1cc(Cc2n[nH]c(=O)c3ccccc23)ccc1F)C1CC1 Fixed!: O=C 2 ring openings have not been closed Orig: OCC1CCC2(CC1)CC(Oc1ccccc1)C(C#N)[N+](=O1)[O-] Fixed!: OCC 1 ring opening has not been closed Orig: CN1C(=O)NC2(CC2(Cl)Br)c1ccccc1 Fixed!: CN1C(=O)NC2(CC2(Cl)Br)c1ccccc Aromatic system cannot be kekulized Orig: COc1cc(OC(F)F)c(C(=O)NCc2ccn(C)c(=O)c2)c(-c2ccc(N(C)S)cc2)cc1OC Fixed!: CO Aromatic system cannot be kekulized Orig: COc1ccc2c(c1)c(Cc1ccc(F)cc1)c(C)[n+]2[O-] Fixed!: CO 1 ring opening has not been closed Orig: COc1cccc(Oc2ccc3c(c2)C(=O)c2c4c(ccc2CCO)OCO3)c1 Fixed!: CO Aromatic system cannot be kekulized Orig: O=c1c(Cn2cc(-c3ccccc3)nn2CC2CCCO2)ccn1Cc1ccccc1F Fixed!: O Aromatic system cannot be kekulized Orig: Cc1c(Cc2cccc(-c3cc(=O)n(O)c(=O)s3)c2)coc2cc(O)cc(O)c12 Fixed!: C Aromatic system cannot be kekulized Orig: CC(=O)c1sc2c(c1NC(=O)c1ccc(O)c(OC)c1)c(C)n2C(C)C Fixed!: CC(=O)c1sc2c(c1NC(=O)c1ccc(O)c(OC)c1)c(C)n2 1 ring opening has not been closed Orig: O=C(Cn1cc([N+](=O)[O-])cn1)N1C2CCC1CC(n1c(=O)c(C(=O)O)nc3cccc1c32)OCOC4 Fixed!: O=C(Cn1cc([N+](=O)[O-])cn1)N1C2CCC1CC(n1c(=O)c(C(=O)O)nc3cccc1c32)OCOC Aromatic system cannot be kekulized Orig: CC(C)c1cccc(-c2nc3nccc(C(F)(F)F)n3c(=O)[nH]2)c1 Fixed!: CC Aromatic system cannot be kekulized Orig: CCOc1c(S(=O)(=O)Nc2ccc(F)cc2)c(O)c2ccccc2n1C Fixed!: CCOc1c(S(=O)(=O)Nc2ccc(F)cc2)c(O)c2ccccc2n1 Aromatic system cannot be kekulized Orig: CC(C)(C)c1ccc(-c2noc(O)n2N)cc1 Fixed!: CC 1 branch has not been closed Orig: Cc1ccc(CN2CCC(C3CC3C(=O)Nc3ccc(S(=O)(=O)N4CCC4)cc32)cc1-n1cnnn1 Fixed!: C 2 ring openings have not been closed Orig: c1ccc(Cn2cnc3c(OCC4CCCN5)nccc32)cc1 Fixed!: 2 ring openings have not been closed Orig: CCc1ncc(C(=O)Nc2cc(C3C4CN(Cc5cnn(C)c5C)CCOCC53)c(O)cn2)cn1 Fixed!: CC Aromatic system cannot be kekulized Orig: C#Cc1ccc2nc3ccc(C(=O)NCCN(C)C)cc3oc2c1 Fixed!: C#C Aromatic system cannot be kekulized Orig: O=C(c1nc2cc3ccccc3sc2c(=O)[nH]1)N1CCCc2ccccc21 Fixed!: O=C Unmatched close parenthesis Orig: CC1CN2c3c(nc4c(C(=O)O)nn(-c5cc(F)ccc5F)c4F)n3CC12)OCCO4 Fixed!: CC 3 ring openings have not been closed Orig: COC1CCCc2c1nc1c(c2N)CCc2ccccc2-23 Fixed!: COC Uncommon valence or charge state Orig: CCOC(=O)C1=C(C)N=C2SC(=C[N+](=O)[O-])NC2=OC1=O Fixed!: CCOC(=O)C Aromatic system cannot be kekulized Orig: CN(C)CCn1c(=C2C(=O)Nc3ccc(Br)cc32)c(O)c2ccccc21 Fixed!: CN(C)CC Aromatic system cannot be kekulized Orig: Oc1ccc(-c2ccc3c(O)c(OC4OCCCO4)c(Cl)c3c2)cc1 Fixed!: O Cannot have a bond opening and closing on the same atom Orig: Cc1ccc2[nH]c(C(=O)N3CC44C=CC(CC3)c3ccccc34)cc2n1 Fixed!: C 1 ring opening has not been closed Orig: O=C(c1ccccc1)c1cc2c3c(c(OC4OC(CO)C(O)C(O)C4O)c(=O)cc3c(=O)n12)CCCCC3 Fixed!: O=C(c1ccccc1)c1cc2c3c(c(OC4OC(CO)C(O)C(O)C4O)c(=O)cc3c(=O)n12)CCCCC Aromatic system cannot be kekulized Orig: CCc1cnc(CN(C)CCCn2c(=O)c3cc(F)ccc3n2C)n1 Fixed!: CC 1 ring opening has not been closed Orig: CNC(=O)c1cc2[nH]c(=O)c3c(n2n1)CC1CC3C(C)C(O)CCC12C Fixed!: CNC(=O)c1cc2[nH]c(=O)c3c(n2n1)CC1CC3C(C)C(O)CCC1 Aromatic system cannot be kekulized Orig: O=[N+]([O-])c1ccc2c(c1)SC(c1nc(Nc3ccccc3[N+](=O)[O-])[nH]c1=O)O2 Fixed!: O Aromatic system cannot be kekulized Orig: COC1COC(c2c(NC(c3ncccc3Br)c3ccccc3)[nH]n2)C1 Fixed!: COC Unmatched close parenthesis Orig: O=C(C1CCN(c2cc(N3CCC(n4c(=O)[nH]nc(-c5ccccc5)C4)cc3)OC4)[nH]c2=O)CC1)N1CCCC1 Fixed!: O=C Aromatic system cannot be kekulized Orig: O=C1COc2c(NC3CCCC3)c(F)cc3c(=O)c(CCc4nc5ccccc5[nH]4)sc(=O)n1c23 Fixed!: O=C 2 ring openings have not been closed Orig: CC(=O)C1=C(O)C=C2Oc3c4c(c(C)c(O)c3C24C(C)CC1=O)OC(C)(C)C(C(=O)N1CCOCC1)C45 Fixed!: CC(=O)C1=C(O)C=C2Oc3c4c(c(C)c(O)c3C24C(C)CC1=O)OC(C)(C)C(C(=O)N1CCOCC1)C Unmatched close parenthesis Orig: Nc1ncc(-c2ccc(N3CCOC4CCNC4)cc3)cc2)C(=O)[nH]1 Fixed!: N 1 branch has not been closed Orig: CC(O)(C#Cc1ccc2c(c1)C(O)(c1cn(C3CC(O)C(CO)O3)c(=O)[nH]c1=O)CC2(C)C Fixed!: CC 1 ring opening has not been closed Orig: CNCCCn1nc(COc2ccc3c4c(c(O)c(C)cc3c2=O)C(=O)N2CCCC2)c2c(N)ncnc21 Fixed!: CNCCC Aromatic system cannot be kekulized Orig: Cc1cc(NC(=O)NC(C)c2ccccc2)n2ccsc2n1 Fixed!: C Aromatic system cannot be kekulized Orig: COc1ccc2c(c1)CCc1c-2c2c(c3c1[nH]c1ccc(O)c1c3=O)C(=O)CCC2 Fixed!: CO Aromatic system cannot be kekulized Orig: Cn1ccc(Nc2cccc3ccccc23)c(N)n1 Fixed!: C Aromatic system cannot be kekulized Orig: CN1C(=O)c2c(n[nH]c2=O)Cc2ccccc21 Fixed!: CN 2 ring openings have not been closed Orig: C=C1CCC2=CC(O)C3Oc4c(O)ccc5c4C34CCN(CC1C2C)C3O5 Fixed!: C=C Aromatic system cannot be kekulized Orig: O=C(O)c1c2c(c3cc(F)ccc3nc1-c1ccccc1)n(C)c(=O)n2C Fixed!: O=C(O)c1c2c(c3cc(F)ccc3nc1-c1ccccc1)n(C)c(=O)n2
As can be seen, the trimming can be rather drastic. But one could imagine more subtle methods. Errors about aromatic systems would disappear if Kekule SMILES were used. If ring openings were found to not be closed, then either close them on the last atom or ignore the opening. Ditto for branches.
If we look at SELFIES, we can see some of these approaches applied. Any characters beyond an invalid valence are ignored ("[C][=O][...]" as "[C][=O]"). If a branch is indicated but the number of atoms is fewer than this or information is missing, then the branch is assumed not to exist ("[C][Branch1][O]" treated as "[C]"). Similarly for rings.
I was curious how often these post-hoc corrections occur for a typical dataset. Given the same RNN trained on SELFIES (v1.0.4), 10K SELFIES were sampled and I calculated the % of times where removing the last symbol left the resultant SMILES unchanged (i.e. where at least the last SELFIES symbol is not being interpreted). This gave 10.5%.
Returning to the SMILES failures above, where the NN cannot learn the syntax, does it makes sense to apply post-hoc corrections to ensure validity? For example, if kekulized SMILES are used, then the errors about aromaticity will go away; if there is no change to the distribution of generated aromatic ring systems then this would be a win. The trimming method above is more problematic: two very similar representations could be interpreted as molecules of very different size (causing a discontinuity in the search/latent space). Indeed, given that all characters after an invalid state are ignored, the DNN could end up being trained on noise. A potentially better approach (although still symptomatic of a failure to adequately learn the syntax) would be to handle an uncommon valence or charge state by resampling from the generative distribution in the first place (the partialsmiles library might be helpful here as it can catch such failures as early as possible).
Code
import partialsmiles as ps if __name__ == "__main__": fname = "Regular_SMILES_1K.smi" failures = [] with open(fname) as inp: for line in inp: smi = line.rstrip() ok = False msg = None try: mol = ps.ParseSmiles(smi, partial=False) ok = True except ps.SMILESSyntaxError as e: msg = e.message except ps.ValenceError as e: msg = e.message except ps.KekulizationFailure as e: msg = e.message if not ok: failures.append( (smi, msg) ) for smi, msg in failures: i = len(smi) ok = False while not ok and i >= 1: try: mol = ps.ParseSmiles(smi[:i], partial=False) ok = True except: i -= 1 # Invariant: Either ok or i == 0 print(msg) print("Orig: ", smi) print("Fixed!: ", smi[:i]) print()
Credits
This blog post originates in a conversation with Janosch Menke, Michael Blakey and John Mayfield that started at the ICCS. Thanks also to Morgan Thomas for data and feedback. Image via Flickr by epictop10.com (licensed CC-BY 2.0).
Would DeepSMILES be a better solution to this problem than SELFIES? Thanks
ReplyDeleteStay tuned for the next blogpost.
ReplyDelete