Friday, 17 April 2026

ANNalog, a generative model for MedChem analogs

I'm delighted to announce that Wei Dai's work on ANNalog, has just been published in Journal of Cheminformatics (currently early access). This is a Python application that takes a molecule represented by a SMILES string and generates MedChem analogs using a deep neural network trained on pairs of molecules from the same ChEMBL assay. This work comes from Wei's PhD with Arianna Fornili at QMUL and Nxera as industry partner (Jon Tyzack), where I continue to act as co-supervisor.

The code is available on GitHub. I won't recapitulate the README in the repo, but I'll mention a few points which are not covered there fully.

To begin with, in my case I need to use 'uv' instead of 'conda' to install (due to conda's licensing conditions). Here's how I do it:

$ uv venv annalog_env --python=3.12 
$ source annalog_env/bin/activate
(annalog_env) $ uv pip install numpy==2.4.3 pandas==3.0.1 tqdm==4.67.3 torch==2.10.0 torchvision==0.25.0 rdkit==2025.9.6 scikit-learn==1.8.0 annalog

Once installed, whether with conda or uv, here's a basic example of use, that generates the 10 most probable analogs given a single SMILES as input (a SMILES file is also accepted):

$ annalog-generate -i "CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12" -n 10
input_smiles	rank	generated_smiles	score
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 1 CCCc1nn(C)c2c(=O)[nH]c(-c3ccccc3OCC)nc12 -4.036181999828045
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 2 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(CN4CCN(C)CC4)ccc3OCC)nc12 -5.025642307042602
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 3 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(N4CCN(C)CC4)ccc3OCC)nc12 -5.148663511712925
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 4 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N)ccc3OCC)nc12 -5.308277614323515
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 5 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(NS(=O)(=O)C)ccc3OCC)nc12 -5.468033235034966
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 6 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(CN4CCOCC4)ccc3OCC)nc12 -5.6292664217390325
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 7 CCCc1nn(C)c2c(=O)[nH]c(-c3c(OC)cccc3)nc12 -5.676750207183716
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 8 CCCc1nn(C)c2c(=O)[nH]c(-c3ccccc3)nc12 -5.692219721810034
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 9 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(Cl)ccc3OCC)nc12 -5.694125330995178
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 10 CCCc1nn(C)c2c(=O)[nH]c(-c3ccc(N4CCN(C)CC4)cc3OCC)nc12 -5.8864854959632 

The example above uses (classic) beam search. In the course of implementing the library, we realised that the term "beam search" seems to mean different things to different people, and typically you need to look at the code to see what they actually meant. ANNalog implements two variants of beam search. With classic beam search (--method beam), at each token position, all beam_width candidates are expanded simultaneously as a batch, then pruned back to beam_width. This makes it fast but greedy, and it can miss the globally optimal sequence by pruning it too early. With best-first beam search (--method BF-beam), a priority queue is used to always expand the single highest-probability partial sequence first, regardless of length. This is slower and more memory-intensive, but more likely to find the globally best sequences.

As a general rule, modifications tend to occur on the right-hand side of the SMILES string. By rewriting the SMILES string in a particular way, this can be used to direct modifications to a certain part of the molecule; this can be enforced more formally by required the start of the string, the prefix to be fixed (see "--prefix"). Conversely, if you want to spread modifications evenly across the molecule, you may wish to pass in multiple SMILES variants or have the script do this for you (see "--exploration-method variants" and "--variant-number"). This is shown by the following example that samples from the distribution:

annalog-generate -i "CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12" -n 10 --method sample --exploration-method variants

When sampling, the default temperature is 1.2. Increasing this too far may not be a good idea as it increases the chance of unlikely tokens being sampled. Instead, if you wish to explore the search space further, it might make sense to reduce the temperature a bit (e.g. 1.1) and feed the output of one run back in again. The "--exploration-method recursive" does this automatically, but if you want to want to combine this with variants the easiest way is to write the output to a file, pull out the generated SMILES, and feed them back in as a file.

By default, invalid SMILES are filtered. This uses the partialsmiles library (which I've previously described here) to avoid selecting tokens during the generation process that would result in semantically or syntactically invalid SMILES. Obviously, if the model had the ability to perfectly understand SMILES, this would not be necessary but at least this approach is more efficient than filtering after-the-fact. You can turn off this filter if you wish ("--keep-invalid") to see how much the results change.

Similarly, not every molecule generated is gold - there's a certain (I would say low) percentage of dubious structures generated. Structures are checked automatically by default using Eloy Félix's chembl_gen_check to run a set of tests against the generated structures (including code adapted from Wim Dehaen's LACAN). Depending on your starting structure and use case, you wish to use these as hard filters or to rank prior to visual inspection.

We have tried to make sure that this tool is of practical use in a drug discovery setting. Let us know how you get on.