The key point (and difficulty) when dealing with rings bonds on such double bonds is that, since the ring bond appears twice in the SMILES string (at both the opening and closing), the stereo symbol can appear at either occurrence or indeeed both. (I think this was a mistake in the SMILES specification, but there you go.) When writing a SMILES string, the preferred syntax just shows the stereo symbol at the end on the double bond (Open Babel will only output this syntax).
The following structure will be used as an example:
So, using the preferred syntax, a SMILES string for the example structure would be:
(a) C/C=C\1/NC1
In other words, from carbon-3 it's down to the C of the ring closure, and up to the N of the ring, where up and down are relative to carbon-1.
There's no need to specify the stereochemistry of both groups on the right-hand side, of course. The following SMILES is equivalent to (a) although not so clear:
(b) C/C=C1/NC1
Coming back to SMILES (a), we could have written the stereo symbol at the ring closure, or indeed at both ends:
(c) C/C=C\1/NC/1
(d) C/C=C1/NC/1
Note that the symbol used for the ring opening is the opposite of that for the ring closure. The rationale for this is that from the point of view of carbon-4, carbon-3 is up (hence C/1), whereas from the point of view of carbon-3, carbon-4 is down (hence C\1). Whatever...just stick to form (a) and you won't need to think about this, as it will just be a source of errors. (It would have been simpler for everyone if Daylight had only allowed the stereo symbol at the carbon on the double bond.)
So much for valid SMILES. How should invalid SMILES be handled? Consider the following:
(e) C/C=C\1\NC1
Both the ring closure and the N are down...? I don't think so. This should be treated as undefined stereochemistry.
How about the case where the two ring bonds have stereo symbols which are not in agreement?
(f) C/C=C\1NC\1
(g) C/C=C\1/NC\1
In both of these cases the stereochemistry for the ring bonds should be considered undefined. In Open Babel, I've chosen to handle these as follows:
(f) C/C=C\1NC\1 --> C/C=C1NC1 (ignore ring bond stereo)
--> CC=C1NC1 (undefined stereochemistry)
(g) C/C=C\1/NC\1 --> C/C=C1/NC1 (ignore ring bond stereo)
--> C/C=C\1/NC1 (defined stereochemistry)
Image credit: Kim+5
4 comments:
Those are some neat examples. As a bit of fun I've made my SMILES parser detect whether, in cases like these, the redundant stereo chemistry is consistent or inconsistent and hence then reject or accept the SMILES.
Another case which you have not explicitly mentioned in the category of clearly bad SMILES could be something like C/C=C1\/NC1.(OpenBabel 2.3 picks / as being the answer)
@Daniel: That's another valid option. Another option again, is to provide an option to enable that option. :-)
Thanks for the example of the bad SMILES. I must admit that until now I've focussed on parsing correct SMILES correctly - handling incorrect SMILES was way down my priority list. I'll follow this up - if you have any ideas on additional ones, leave more comments...
I should probably have been clearer. For OpenBabel where one of the use cases is that you are confronted with huge lists of SMILES of unknown quality your solution sounds like the better solution.
Ideally I guess an option/s on how strict SMILES parsing is would be optimal so power-users can curate their lists of SMILES.
Looking through my list of test cases from
https://bitbucket.org/dan2097/opsin/src/tip/core/src/test/java/uk/ac/cam/ch/wwmm/opsin/SMILESFragmentBuilderTest.java
I came across CC1=C/F.O\1 which appears to produce the wrong result in OpenBabel 2.3. Admittedly it sounds quite plausible that in your recent work that you may have fixed this case.
You're right - the test case you provided is now working correctly. I took a look at your unit tests, and will see if there's anything obvious we're missing. And BTW, I am honoured to have some named after me :-)
Post a Comment