Tuesday 12 July 2011

The alpha and omega of SMILES strings Part II

Smiling treeIn the previous post I described why it's useful to be able to generate SMILES strings starting and ending with particular atoms. This post describes how to do it.

To begin with, one way to do this is using ring closure notation, as Rajarshi and Andrew pointed out in replies to a question of mine at the Blue Obelisk Q&A early this year. Andrew went on to write a Python script that allowed complete reordering of all of the atoms of a molecule using this method.

I was hoping to find a more elegant method than using ring closures. Also, wholesale use of ring closures can cause problems with stereochemistry as this is a corner case that not all toolkits handle correctly. In any case, I kept thinking about this on and off, and eventually got around to trying some ideas out. In the end, the solution was easier than I'd thought.

A SMILES string is generated from a depth-first tree traversal of a graph. Every atom (except for the root) has a parent atom, and 0 or more child atoms. Setting the start atom is trivial; just make that the root. It turns out that setting the end atom requires only two rules: (1) parenthesise all of the child trees of the end atom, and all but the last child trees of other atoms (this latter should be the default in any case), (2) visit child trees that do not have a route (through 'unvisited' atoms) to the end atom first.

That's it...except for the corner cases. Click Comp Chem involves replacing an implicit H of the start and end atoms; if there is an explicit bracketed H present, then it needs to be removed to free up the valence. For example, if the endatom is [nH] or [C@@H] then the H needs to go. Otherwise you have weirdoddities like the following 5-coordinate C:
C[C@@H](Br)(Cl) + I gives C[C@@H](Br)(Cl)I
Note however that this may result in a SMILES string that does not accurately represent the original structure (but that's not the point of the exercise) e.g.
c1c[nH]cc1 -> c1c[n](cc1)
What about the stereochemistry? Well, I toss that out too at the moment; an alternative would be to allow the user to specify the resulting stereochemistry at the start and end atoms.

Image credit: crashoverreason

No comments: