Sunday, 25 November 2018

SMARTS for dummies

One of Roger's observations came to mind recently when I was looking at some SMARTS patterns in Open Babel that were causing some problems: "People cannot write SMARTS correctly." Instead, he argues, they should use tools to create SMARTS for them. The problem is not so much that patterns don't match molecules of interest, but that they match additional molecules (false positives) and may miss others (false negatives). False positives are a particular problem when transformations (e.g. SMIRKS) are applied to matching molecules, as they can result in exotic valence and charge states that were unintended.

The following is a description of a tool I put together to help craft SMARTS patterns. It is based upon the idea that a SMARTS pattern is completely defined by the molecules it matches. Thus, by looking at the matches in a database of sufficient size, we can see whether there are any mismatches.

So far, so obvious. What's novel is that the tool only returns distinct matches based upon simple atom types of the matching atoms, and sorts them by probability of those atom types (most unusual first). Thus, even for common substructures the number of hits is small and information-rich. Furthermore, by using my favourite trick of sorting the reference database by size (ChEMBL here), this means that the first hit found for each distinct match is likely to be small in size.

An example will illustrate: let's suppose we want to create a SMARTS pattern to match alcohols. I know that "CO" in SMILES indicates this, and so let's start with that. Only 28 hits are returned (from a 10% subset of ChEMBL), of which the first three are:
Ok, obviously we don't want a charged oxygen, so let's try "C[O+0]" and look at the first three hits:
And the oxygen should have a hydrogen, "C[Oh1+0]":
Now, about that carbon. I only had sp3-hybridised carbons in mind, so how about "[CX4][Oh1+0]"? Here is the full list of hits at this point:
At this point the hits are what I expected to see. Does this mean I should stop here? No, but it's a reasonable start.

After using this tool for a while, you begin to appreciate the importance of nailing down exactly what you want to see: if you only want to match neutral carbons, you should say so; if you don't want the oxygen to have additional neighbours, you should say so. Once every degree of freedom is nailed down...then it's finished. You might think this is going too far, but the molecules that exist in databases don't necessarily conform to typical valence rules - they may be unusual (consider TEMPO) or downright incorrect. Your SMARTS pattern should match exactly what you had in mind and not leave any wiggle room to match anything more: "[Cv4X4+0]-[OD1h1+0]"

This tool is by no means a panacea; you still need to think for yourself, and consider for example what to do about aromaticity, or the presence of explicit hydrogens or hydrogen isotopes, or whether to reorder the pattern for speed. But one could imagine a tool that could ease the burden further and do some of this for the user, or for example allow the user to mark "match" or "don't match" for the set of hits returned to automatically refine the query.

6 comments:

Ed Griffen said...

Nice post Noel, building SMARTS is grim and always represents a problem. I think it's and example of the Qualification problem(https://en.wikipedia.org/wiki/Qualification_problem) in that as you specify the query, it throws up molecules you don't want in your matching set, so you refine your query. You have to expand your testing set and there's no definitive way of showing your SMARTS is right just progressively less wrong.

An interesting approach from Matthias Rarey's Group is here: https://pubs.acs.org/doi/10.1021/acs.jcim.5b00323, I'm sure there's more to do.

Noel O'Boyle said...

Great links. And the automatic generation of the SMARTS patterns by Rarey is very interesting.

Caitlin said...

Hi Noel,
Thanks for an interesting post. I've been working on more automatic SMARTS creation as a part of the Open Force Field Initiative (openforcefield.org). I was wondering if your tool is available for use and if you could provide a link?

Unknown said...

It's definitely important to be careful about the patterns one is writing, but I think the use case still has to be considered when writing very strict patterns e.g. do you actually need to validate that the database's valence-bond conventions meet with your expectations? Or perhaps another way of putting it, if you don't want molecules with "bad valency" one could pass input molecules through an automated valency check rather than essentially encoding valency rules into every SMARTS pattern.

What I'm getting at specifically is whether if one locks down the connectivity and atoms whether locking down the charge actually gains you much, considering the drawback that it can exclude unanticipated resonance forms. A simple example of this is the nitro group, can one write something very simple like [NX3](~[OX1])~[OX1] and tolerate structures with mistakes e.g. -[N](=O)[O-], while "correctly" handling this necessitates something like [$([NX3](=O)=O),$([NX3+](=O)[O-])]

Noel O'Boyle said...

@Caitlin: It's not currently available, but I'm happy to discuss by email if any details are not clear.

@Unknown: This argument is suspiciously similar to one from a former colleague. The first paragraph describes SMARTS that are only valid for molecules preprocessed in a certain way. And that's fine, if documented. For example, it's common to require hydrogens to be suppressed. The drawback is that such patterns will not be transferable to another database.

Regarding the second paragraph, of course there's some merit in this. And you didn't mention tautomers also. It's the old precision versus recall argument. Depending on your application you either want to cast the net wide, or you want it precisely nailed down. In the case of performing a transformation, it's the latter. I don't like arguments based on a N=1 example, but the patterns you describe actually illustrate my point. The loose pattern hits CHEMBL1741048 (ON=[N+]([O-])[O-]) but the tight ones don't.

The tautomer and resonance form problem is not well handled by an explicit valence bond representation, but isn't the solution rather to do like InChI and require a certain total number of hydrogens and bond orders applied to certain atoms and bonds? This would be an extension to SMARTS of course.


Dan2097 said...

I omitted mentioning tautomers as I agree that it is not possible to precisely handle multiple tautomeric forms without resorting to recursive SMARTS or multiple patterns. An extension along the lines you mentioned would be nice, but I expect it would be difficult to implement in such a way that didn't further increase the burden on a human attempting to write such patterns.
To be pedantic a SMARTS match for a nitro should probably require the next atom to not be nitrogen, the strict pattern will still match Dinitrogen trioxide which is also a dubious hit.