Tuesday, 2 October 2018

DeepSMILES - a SMILES-like syntax for use in machine-learning

At the last ICCS, Andrew Dalke and I got talking. We had both been around the poster session, and had heard first-hand the problems of "invalid SMILES" that were affecting the generation of SMILES strings using deep neural networks. I remember Xuhan Liu in particular asking me whether I had a solution to this. Well, I had been turning over an idea in my mind for a little while, and by the time I got talking to Andrew, I had decided to do something about it. Now Andrew knows a thing or two about SMILES himself, so I shouldn't have been surprised that he had also come up with something (along with two references from 1964 to back it up). We put the two ideas together and called it DeepSMILES.

The problematic areas of SMILES syntax involve paired ring closure symbols for cycles, and parentheses for branches. These particular aspects of the syntax are difficult to reproduce correctly when generating SMILES strings using machine-learning methods, and so a certain percentage of generated SMILES tend to fail basic syntax checks. While there have been a variety of approaches aimed at improving the SMILES generation (with quite some success), it is reasonable to assume that the syntax also causes difficulties during the learning phase.

Our approach is not to use SMILES, but instead an alternative syntax that does not have these problems. Paired ring closures are replaced by a single digit, the ring size; paired parentheses are replaced by close parentheses indicating branch size. See the talk below, the preprint, or the GitHub site for more information. Feedback (positive or negative) is welcome, either here or on GitHub.

No comments: