Wednesday 25 April 2018

Cheminformatics for deep learners: Valid SMILES and valid molecules

Recent papers on generating SMILES strings via deep-learning approaches have awoken the cheminformatics pedant in me when I see references to "valid SMILES" and/or "valid molecules". Do these terms make any sense? Do they mean what the authors think they mean?

As a concrete example here's a line from Rafael Gómez-Bombarelli et al (a ground-breaking paper in the field, which appeared on Arxiv back in 2016): "This increased the accuracy of generated SMILES strings, which resulted in higher fractions of valid SMILES strings."

"Valid SMILES", eh? Here's a nice example of a valid SMILES: C#C(#C)(#C)(#C)(#C), a carbon connected via triple bonds to five other carbons. It's a syntactically valid SMILES string that is happily read by many toolkits; for example, we can use Open Babel to calculate the molecular weight of the corresponding molecule::
> obabel -:"C#C(#C)(#C)(#C)(#C)" -otxt --append mw
77.1039

An invalid SMILES might be missing a parenthesis or ring closure, or begin with a bond symbol, or contain the element Zz or any number of things. The problem here is that the terms "in/valid SMILES" are used by the authors above with some other meaning, presumably related to the likelihood of the existence of the corresponding molecule. As I hope I have demonstrated, the validity of a SMILES string has nothing to do with whether the corresponding molecule might exist or not.

What I'm really talking about here is the difference between syntax and semantics: the meaning of the SMILES string versus its symbolic construction. Elsewhere Rafael Gómez-Bombarelli et al refers to "the fragility of [SMILES] syntax (opening and closing cycles and branches, allowed valences, etc.)". They should have stopped at "branches" - the "allowed valences" (whatever this may mean - it's undefined) is nothing to do with the syntax of a SMILES string.

So maybe "valid molecule" is a better term for that? Or at least that seems to be what people think. But "valid molecule" is an even more nebulous term - what is a valid molecule? One that's not a radical? One that might exist at standard temperatures and pressures without decomposing in a millisecond? Who knows. I think that what people actually mean is that the atoms in the molecule are all in common valences and charge states, or perhaps they just mean that it is rejected by RDKit (which might be for a number of reasons unrelated to the validity of the molecule, e.g. kekulization failure). If that's what they mean, they should just say that.

So please, a bit more clarity and a bit less woolly language. Think of the children cheminformaticians.

No comments: