When I wrote the blog post on ANNalog, there was an instant flurry of interest. ANNalog is a generative model that suggests medchem analogs given an input SMILES string. I was confident in the work that Wei Dai had done, but you can't help feeling nervous that there might be something you've missed. And sure enough, an issue was filed on GitHub the next day and instantly my heart sank:
I had a shot at ANNalog by submitting losartan as query:
annalog-generate -i losartan.smi -n 50 -m beam -o losartan_50_beam.tsv --device cudaFrankly I find the results unexpected to say the least (see below). The top-ranked generated molecule looks very odd IMO. I did a substructure search in ChEMBL on the 4-membered ring with 3 nitrogens (C1=NN=N1) and (as expected) get zero hits.
This was filed by Evert Homan on a Saturday morning. I was away from my computer, but pasting a few SMILES strings into CDK Depict instantly showed that, if anything, he was being generous:
But how could this be? I had never seen any results this bad, especially in beam search which typically stays very close to the input string. Then I realised something...
The SMILES string provided was "CCCCC1=NC(=C(N1CC2=CC=C(C=C2)C3=CC=CC=C3C4=NNN=N4)CO)Cl". Do you notice anything? There are no aromatic atom symbols (e.g. lowercase 'c') despite losartan having multiple aromatic rings - this is a Kekulé SMILES string. And therein lay the problem. ANNalog was trained on aromatic SMILES strings and had never seen aromatic rings written as conjugated bonds. When presented with this, there were no high probability predictions, only bad choices.
The potential fix was simple. Before making the prediction, we just needed to convert to an aromatic SMILES string. So I ran it through 'obabel' to see if this worked:
$ obabel -:"CCCCC1=NC(=C(N1CC2=CC=C(C=C2)C3=CC=CC=C3C4=NNN=N4)CO)Cl" -osmi
CCCCc1nc(c(n1Cc1ccc(cc1)c1ccccc1c1n[nH]nn1)CO)Cl
Using this string as input brought the results back into the realm of reality:
But something was still a bit off. Almost all of the top ten results have fewer rings than the input. Also, the probabilities of these structures (given the input) is still in the same range (the highest score is -7.2 this time versus -7.4 last time).This is perhaps a good moment to explain that ANNalog is trained on RDKit SMILES, after randomly shuffling the atoms (as we want to handle SMILES strings written in different ways). Here's the corresponding RDKit SMILES string:
CCCCc1nc(Cl)c(CO)n1Cc1ccc(-c2ccccc2-c2nn[nH]n2)cc1
Comparing to the Open Babel SMILES above there are various atom order shuffles, but if you look beyond that, you will notice two hyphens where OB had none. RDKit adds hyphens to indicate a single bond between aromatic rings that is not itself aromatic, whereas Open Babel only does this if the bond is itself in a ring. This doesn't matter in a semantic sense (*), but ANNalog has learnt to expect to see a hyphen separating aromatic rings, and so gives poor results when the provided SMILES string does not match this pattern. Using the RDKit one, the results are finally as boring as expected from beam search (and the highest score is now -3.8):


No comments:
Post a Comment