Monday, 29 June 2026

The wrong type of SMILES

When I wrote the blog post on ANNalog, there was an instant flurry of interest. ANNalog is a generative model that suggests medchem analogs given an input SMILES string. I was confident in the work that Wei Dai had done, but you can't help feeling nervous that there might be something you've missed. And sure enough, an issue was filed on GitHub the next day and instantly my heart sank:

I had a shot at ANNalog by submitting losartan as query:

annalog-generate -i losartan.smi -n 50 -m beam -o losartan_50_beam.tsv --device cuda

Frankly I find the results unexpected to say the least (see below). The top-ranked generated molecule looks very odd IMO. I did a substructure search in ChEMBL on the 4-membered ring with 3 nitrogens (C1=NN=N1) and (as expected) get zero hits. 

This was filed by Evert Homan on a Saturday morning. I was away from my computer, but pasting a few SMILES strings into CDK Depict instantly showed that, if anything, he was being generous:

 

But how could this be? I had never seen any results this bad, especially in beam search which typically stays very close to the input string. Then I realised something...

The SMILES string provided was "CCCCC1=NC(=C(N1CC2=CC=C(C=C2)C3=CC=CC=C3C4=NNN=N4)CO)Cl". Do you notice anything? There are no aromatic atom symbols (e.g. lowercase 'c') despite losartan having multiple aromatic rings - this is a Kekulé SMILES string. And therein lay the problem. ANNalog was trained on aromatic SMILES strings and had never seen aromatic rings written as conjugated bonds. When presented with this, there were no high probability predictions, only bad choices.

The potential fix was simple. Before making the prediction, we just needed to convert to an aromatic SMILES string. So I ran it through 'obabel' to see if this worked:

$ obabel -:"CCCCC1=NC(=C(N1CC2=CC=C(C=C2)C3=CC=CC=C3C4=NNN=N4)CO)Cl" -osmi
CCCCc1nc(c(n1Cc1ccc(cc1)c1ccccc1c1n[nH]nn1)CO)Cl

Using this string as input brought the results back into the realm of reality: 

But something was still a bit off. Almost all of the top ten results have fewer rings than the input. Also, the probabilities of these structures (given the input) is still in the same range (the highest score is -7.2 this time versus -7.4 last time).

This is perhaps a good moment to explain that ANNalog is trained on RDKit SMILES, after randomly shuffling the atoms (as we want to handle SMILES strings written in different ways). Here's the corresponding RDKit SMILES string:

CCCCc1nc(Cl)c(CO)n1Cc1ccc(-c2ccccc2-c2nn[nH]n2)cc1

Comparing to the Open Babel SMILES above there are various atom order shuffles, but if you look beyond that, you will notice two hyphens where OB had none. RDKit adds hyphens to indicate a single bond between aromatic rings that is not itself aromatic, whereas Open Babel only does this if the bond is itself in a ring. This doesn't matter in a semantic sense (*), but ANNalog has learnt to expect to see a hyphen separating aromatic rings, and so gives poor results when the provided SMILES string does not match this pattern. Using the RDKit one, the results are finally as boring as expected from beam search (and the highest score is now -3.8):

There's a lesson here around how it's important to normalise input to the same form used in training. Also, how subtle differences in SMILES strings can have an outsized effect. And a useful reminder not to release on a Friday. But ultimately, users of ANNalog can now provide SMILES in whatever format they like, and it should just work as expected.
 
* When reading a SMILES string, an implicit bond joining two aromatic atoms is considered a single bond except where it is in a ring itself, in which case it is considered an aromatic bond. Hence why you *have* to use a single bond symbol when it is in a ring and joins two aromatic atoms, but it is optional otherwise.

No comments: