Monday 6 February 2023

Smarts for Dummies Part II

In Part I I described 'Smarter SMARTS', a Python webapp to help you learn to write accurate SMARTS patterns. With Roger's permission, the code for that webapp is now on GitHub. In brief, you enter a SMARTS pattern and you see unique matches (by atomtypes) sorted to show the most unusual first (see Part I for an example).

With the release of the code, I wanted to make a public version for people to try. And so here it is. In contrast to the original, this version doesn't use Python but is entirely in JavaScript (code here).

Implementation Notes

With the move to JavaScript, things slowed down and so to keep the app responsive I turned to web workers. This was my first time using them, but it was surprisingly straightforward, although of course there is a cost in complexity.

I took the opportunity of this webapp to provide support for as many open source toolkits as I could use from JavaScript. Chen Jiang has compiled both Open Babel and Indigo to JavaScript (as part of cheminfo-to-web). There was an additional class I needed for OB, and a tweak required for Indigo, but he sorted it out. Similar work has been done for RDKit by Paolo Tosco but the classes/methods I need are not yet available. OpenChemLib (the library behind DataWarrior) has been available in JavaScript for some time, and I'm hopeful the classes I need will soon become available.

There were some bumps in the Indigo implementation. I found that the equivalent of getValence() did not accurately return the valence, and so I fell back to the much slower approach of summing up the bond orders and adding the implicit Hs. On a similar note, while there is a bond.isInRing() equivalent, there is not for atoms. This is an oversight in the toolkit, as it's a key property - again, I just calculated it myself. And finally Indigo does not support lowercase 'h' in SMARTS - this is documented; you need to use uppercase 'H'.

The depiction is done by John Mayfield's public CDK Depict instance. Thanks John! To highlight the match, I needed to use atom maps to highlight the first atom of the SMARTS match differently than the others. The only problem is that Indigo does not seem to support atom maps; or rather I couldn't get them to work. OB does support them, but that wasn't wrapped. So I used the usual hack for this situation which is to use isotopes instead, write out the SMILES, and then string-edit it into the atom map version. This is all fine until the molecule already contains an isotope, but hey.

The dataset is the 100K smallest molecules in ChEMBL (as of 2018), a number chosen to return results very quickly in the Python version. 100K is not really sufficient to do this problem justice especially once the SMARTS pattern becomes larger. This is why I describe this as a tool to help learn to write SMARTS patterns, rather than a tool to help write them. Not to knock it too much, but if you were to scale up from 100K to 100M and bung in a faster SMARTS search (e.g. Arthor), I think it'd be a more capable tool for someone developing SMARTS patterns.