Monday, 17 September 2012

A bit of a SMILES - Canonical fragments

A well-hidden feature of OB's SMILES writer is support for writing SMILES strings that represent fragments of a molecule. For example, if we read the SMILES string "CC(=O)Cl" but on writing specify the fragment containing the first two atoms, we get just "CC".

In OB 2.3.2 (coming soon), this can be done with the "F" SMILES output option:
obabel -:"CC(=O)Cl" -osmi -xF "1 2"
...but is a bit more awkward with OB 2.3.1:
obabel -:"CC(=O)Cl" -osmi
       --property SMILES_Fragment "[ 1 2 ]"
If you specify atoms that are not connected, you get a dot-disconnected representation:
> obabel -:"CC(=O)Cl" -osmi -xF "1 4"
C.Cl
So far that's pretty much as expected. But now, let's push it a bit. How about fragments that involve an "aromatic" atom?
> obabel -:"c1ccccc1F" -osmi -xF "6 7"
cF
Mmmm....interesting. Clearly this isn't a valid SMILES string. In fact, none of these "fragment SMILES" are proper SMILES strings - well, they may be valid SMILES but those SMILES do not have the same meaning. In short, the SMILES format does not support fragments.

So what's the point of these? Well, let's consider the canonicalised version, e.g.
>obabel -:O=C(Cl)C
        --property SMILES_Fragment "[ 1 2 ]" -ocan
C=O
Now imagine that you want to create a fragment-based fingerprint; all you need to do is generate the corresponding canonical fragment SMILES and hash them. Job done.

Another potential use would be to...oh oh...dinner time...you'll have to use your imagination. Before I go, just to note that credit for this feature, and most of the SMILES writer indeed, goes to Craig James.

5 comments:

timvdm said...

Although the example with the aromatic atom is not a valid SMARTS, this is actually a useful feature. I used this to enumerate the unique subgraphs in the PubChem database.

timvdm said...

SMARTS should be SMILES...

baoilleach said...

Ah! - is that how you did it. For others, here's the link to your post.

Andrew Dalke said...

Tim pointed that out to me, but in my enumeration for the MCS code I needed the ability to select which bonds to include as well. RDKit (now) supports that feature, but OB doesn't.

baoilleach said...

One upped again! :-) I might look into that. If you specify the bonds, are the corresponding atoms also included?