Friday, 17 April 2026

ANNalog, a generative model for MedChem analogs

I'm delighted to announce that Wei Dai's work on ANNalog, has just been published in Journal of Cheminformatics (currently early access). This is a Python application that takes a molecule represented by a SMILES string and generates MedChem analogs using a deep neural network trained on pairs of molecules from the same ChEMBL assay. This work comes from Wei's PhD with Arianna Fornili at QMUL and Nxera as industry partner (Jon Tyzack), where I continue to act as co-supervisor.

The code is available on GitHub. I won't recapitulate the README in the repo, but I'll mention a few points which are not covered there fully.

To begin with, in my case I need to use 'uv' instead of 'conda' to install (due to conda's licensing conditions). Here's how I do it:

$ uv venv annalog_env --python=3.12 
$ source annalog_env/bin/activate
(annalog_env) $ uv pip install numpy==2.4.3 pandas==3.0.1 tqdm==4.67.3 torch==2.10.0 torchvision==0.25.0 rdkit==2025.9.6 scikit-learn==1.8.0 annalog

Once installed, whether with conda or uv, here's a basic example of use, that generates the 10 most probable analogs given a single SMILES as input (a SMILES file is also accepted):

$ annalog-generate -i "CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12" -n 10
input_smiles	rank	generated_smiles	score
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 1 CCCc1nn(C)c2c(=O)[nH]c(-c3ccccc3OCC)nc12 -4.036181999828045
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 2 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(CN4CCN(C)CC4)ccc3OCC)nc12 -5.025642307042602
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 3 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(N4CCN(C)CC4)ccc3OCC)nc12 -5.148663511712925
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 4 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N)ccc3OCC)nc12 -5.308277614323515
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 5 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(NS(=O)(=O)C)ccc3OCC)nc12 -5.468033235034966
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 6 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(CN4CCOCC4)ccc3OCC)nc12 -5.6292664217390325
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 7 CCCc1nn(C)c2c(=O)[nH]c(-c3c(OC)cccc3)nc12 -5.676750207183716
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 8 CCCc1nn(C)c2c(=O)[nH]c(-c3ccccc3)nc12 -5.692219721810034
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 9 CCCc1nn(C)c2c(=O)[nH]c(-c3cc(Cl)ccc3OCC)nc12 -5.694125330995178
CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12 10 CCCc1nn(C)c2c(=O)[nH]c(-c3ccc(N4CCN(C)CC4)cc3OCC)nc12 -5.8864854959632 

The example above uses (classic) beam search. In the course of implementing the library, we realised that the term "beam search" seems to mean different things to different people, and typically you need to look at the code to see what they actually meant. ANNalog implements two variants of beam search. With classic beam search (--method beam), at each token position, all beam_width candidates are expanded simultaneously as a batch, then pruned back to beam_width. This makes it fast but greedy, and it can miss the globally optimal sequence by pruning it too early. With best-first beam search (--method BF-beam), a priority queue is used to always expand the single highest-probability partial sequence first, regardless of length. This is slower and more memory-intensive, but more likely to find the globally best sequences.

As a general rule, modifications tend to occur on the right-hand side of the SMILES string. By rewriting the SMILES string in a particular way, this can be used to direct modifications to a certain part of the molecule; this can be enforced more formally by required the start of the string, the prefix to be fixed (see "--prefix"). Conversely, if you want to spread modifications evenly across the molecule, you may wish to pass in multiple SMILES variants or have the script do this for you (see "--exploration-method variants" and "--variant-number"). This is shown by the following example that samples from the distribution:

annalog-generate -i "CCCc1nn(C)c2c(=O)[nH]c(-c3cc(S(=O)(=O)N4CCN(C)CC4)ccc3OCC)nc12" -n 10 --method sample --exploration-method variants

When sampling, the default temperature is 1.2. Increasing this too far may not be a good idea as it increases the chance of unlikely tokens being sampled. Instead, if you wish to explore the search space further, it might make sense to reduce the temperature a bit (e.g. 1.1) and feed the output of one run back in again. The "--exploration-method recursive" does this automatically, but if you want to want to combine this with variants the easiest way is to write the output to a file, pull out the generated SMILES, and feed them back in as a file.

By default, invalid SMILES are filtered. This uses the partialsmiles library (which I've previously described here) to avoid selecting tokens during the generation process that would result in semantically or syntactically invalid SMILES. Obviously, if the model had the ability to perfectly understand SMILES, this would not be necessary but at least this approach is more efficient than filtering after-the-fact. You can turn off this filter if you wish ("--keep-invalid") to see how much the results change.

Similarly, not every molecule generated is gold - there's a certain (I would say low) percentage of dubious structures generated. Structures are checked automatically by default using Eloy Félix's chembl_gen_check to run a set of tests against the generated structures (including code adapted from Wim Dehaen's LACAN). Depending on your starting structure and use case, you wish to use these as hard filters or to rank prior to visual inspection.

We have tried to make sure that this tool is of practical use in a drug discovery setting. Let us know how you get on.

Saturday, 17 January 2026

Improvement in reasoning performance of LLMs over time

If you tried using ChatGPT when it first came out and concluded that it wasn't much use for a scientific reasoning task, it might be time to try it again. For the task I semi-described in the previous post, it worked exceptionally well. Here I've tried to gather some information on what's changed between the original release and today, and how you can keep track of improvements in performance.

Background

The LLM developed by OpenAI is called GPT ("Generative Pre-trained Transformer"). It was released as a chatbot publicly in Nov 2022 and it changed the world (for better or worse). This release was already version 3.5. Since then, LLMs have been developed by several companies/groups, often with new versions multiple times a year.

For scientific problems, a particular landmark was the development of LLMs with 'reasoning' capability; the ability to perform multi-step logical problem-solving. OpenAI's o1 model (Sept 2024) was the poster child for this. Currently OpenAI's GPT-5.2 represents the state-of-the-art, though the upcoming Gemini 3 Pro (currently in preview) may have the edge on it. But how do we know this?

The benchmark

LLM benchmarks exist for coding tasks, solving maths problems, medical exams and so forth, but for reasoning, a single benchmark - the GPQA benchmark - has established itself as the gold standard, in particular the "Diamond" subset thereof. GPQA stands for "Google-Proof Q&A". In Nov 2023, Rein et al (New York Uni) published a preprint describing how they employed incentivised PhD-level contractors to put together a set of scientific multiple-choice questions (4 choices) covering biology, chemistry and physics that could not be answered by Google searches and required domain knowledge. Random answering scores 25%; PhD-level non-scientists (note: my interpretation) with access to the internet scored 34%; PhD-level scientists scored 69.7% (note: I don't see this figure in the preprint, but it is widely quoted).

Just a note about the 69.7%. The exact figure doesn't matter, but there's some idea that beating this figure means superhuman results. Um, no. An expert in biology is not expected to answer the physics questions very well. Are coding LLMs superhuman if they are better than a Python expert at answering Java questions? If they beat an expert at their own area of expertise, now that would be impressive. Given that it is expected that three experts sitting together would get 100%, it follows that the benchmark is not capable of measuring superhuman performance (this is actually mentioned at the end of the preprint). A more interesting number perhaps is 55% which roughly gives a measure of how well a scientist does when answering questions outside of their domain (I estimated this value by subtracting the 33.3% of correct answers inside the expert's domain from 69.7% and scaling it up, i.e. (69.7 - 100/3) / 2 * 3).

The results

The key point is that this is a set of difficult scientific questions that require reasoning, and that we can track the performance of LLMs over time against this benchmark. Here are the results for OpenAI models taken from an analysis by Epoch AI. Each individual model was tested 16 times, with the typical stderr around 2%:

  • GPT-3.5 (Nov 2022): 28% (taken from here
  • GPT-4 (Mar 2023): 36%
  • GPT-4 Turbo (Nov 2023): 42%
  • GPT-4o (Aug 2024): 49%
  • o1-mini (Sep 2024): 62% (at "high" reasoning setting)
  • o1 (Dec 2024): 76% (medium)
  • o3-mini (Jan 2025): 77% (high)
  • GPT-4.1 (Apr 2025): 67% (note: the 4.1s are a non-reasoning model)
    • GPT-4.1 mini: 66% (this is a smaller/faster/cheaper version of GPT-4.1)
    • GPT-4.1 nano: 49%
  • GPT-5 (Aug 2025): 85% (medium)
    • GPT-5-mini: 72%
    • GPT-5-nano: 67%
  • GPT-5.1 (Nov 2025): 85% (medium) 88% (high)
  • GPT-5.2 (Dec 2025): 88% (medium) 88% (high) 91% (xhigh)

Conclusion

In summary, there have been major improvements over the last three years in terms of reasoning. Nor has it reached a peak. Gemini 3 Pro Preview (Nov 2025) is reported as 93%. In August, a new kid on the block, Autopoiesis, reported 92.4% for their Aristotle X1 Verify model (but this is not publicly available nor independently verified). It feels like we will soon get to the point where the field will need a more difficult benchmark.

Finally, it should be noted that pure performance is not the only relevant criteria. Price, energy usage, and ability to run locally will also be important considerations. Let's see what 2026 brings.

Image credit: Epoch AI, CC BY, https://epoch.ai/benchmarks/gpqa-diamond.

Saturday, 3 January 2026

Classifying PubMed Abstracts with LLMs

I've just spent the Christmas playing with the OpenAI API, and I am impressed at what is now possible. There are 38M PubMed Abstracts (*); through judicious use of keyword filters (bringing it down to 250K) followed by a pass through a cheap model (bringing it down to 3.5K), I have essentially categorised all of the abstracts of interest with GPT-5.2 for less than $20 (**). Even leaving out the cheaper model and going directly to GPT-5.2 would still only be less than $200.

(If you read this before Jan 11 2026, and think that this topic sounds interesting, I encourage you to apply for openings in my group !)

This is what's known as a reasoning task. Here is the general prompt that I used, except that I have hidden some of the details (the definition section was suggested by ChatGPT as an improvement but I don't know that it helped).

Based on this article title and abstract, does the associated paper describe X?

Definitions:
- Y
- Z

Instructions:
- Reply with exactly one of: Yes, No, or Maybe.
- If the answer is No, reply with only: No
- If the answer is Yes or Maybe, briefly justify your answer by quoting specific words or phrases from the title or abstract as evidence.
- Do not infer beyond what is explicitly stated in the text.
When running it through GPT-5.2, I used slightly more complicated instructions in order to identify primary research data::
Instructions:
- Reply with exactly one of: Yes (primary), Yes (secondary), No, or Maybe.
  - Yes (primary) indicates that the paper is a primary reference for X, rather than being a review article or mentioning it as background
- If the answer is No, reply with only: No
- Otherwise, briefly justify your answer by quoting specific words or phrases from the title or abstract as evidence.
- Do not infer beyond what is explicitly stated in the text.

The cheaper/faster/less accurate model I used was GPT5-Nano. As a validation, I put all of the 2023 and 2024 abstracts through GPT-5.2 and only found a single additional "Yes" (on top of the 53 found by GPT5-Nano), and I'm not convinced about that particular case. GPT-5.2 was definitely required though as a final pass to reduce false positives; I passed all of the Yeses and Maybes from the Nano through to GPT-5.2 and it mostly trimmed down the Yeses and didn't promote the Maybes.

Run-time is somewhat dependant on how many batches you need to use as there is an overhead associated with each batch. When you sign up for API access, you start on Tier 1 and need to use lots of batches which must be run serially to avoid queue limits. In one case, I ran 75 batches of 3000 with GPT5-Nano; each one took 20 mins on average which was fine, but the API SLA only guarantees an answer within 24h would would be bit of a problem. After 7 days, I was bumped up to Tier 3 and it's possible that I could do the same thing in a single batch now. For example, on Tier 3, I ran 4 batches of 10000 with GPT5.2, each one taking 1h.

Given that the prompt above lends itself to many questions, you can use the details I've provided to get an idea of what might be possible for your questions of interest and for what budget. The trickiest part right now is dealing with the batch API, queue limits, and JSON, which can all be a bit tedious until you have something in place. Indeed, it feels like the sort of thing where there's a business opportunity to provide a biologist-friendly interface to abstract (***) this away. In any case, it's easy to predict that in 2026 we will be seeing less of BioBERT and similar models, and more adoption of LLMs.

Footnotes:

* in the 2025 baseline file - soon to be updated.

** $17.07 for GPT5-Nano over 250K abstracts and $1.82 for GPT-5.2 over 3.5K abstracts.

*** for once no pun intended!

Monday, 15 December 2025

Openings for Scientific Developers in the Chemical Biology Services team

As part of a collaboration with Open Targets on the development of a Side Effect Resource to guide target selection, we have two positions across NLP engineer/scientific developer/data engineer (i.e. your expertise might be in one or the other). Cheminformatics and bioinformatics would both be within scope.

We also have a more senior role, the position of Technical Lead in our team. This could be someone from either a scientific or developer background, but with appropriate experience dealing with technical infrastructure (Kubernetes, etc.). From previous experience, we know that this is a difficult role to fill so I would very much encourage suitable candidates to apply - if you are unsure, then reach out (oboyle@ebi.ac.uk) and we can discuss.

And as I always point out, due to EMBL-EBI's special status, the quoted salaries are tax-free and so are equivalent to the net salary from another job. Benefits, especially for non-residents, are generous also. Both positions have 11th Jan as the closing date, so apply now.

Monday, 28 July 2025

A new job, a postdoc opportunity, an open biological curator role, and a user group meeting


Almost exactly 6 months ago, I took over the leadership of the Chemical Biology Services team at EMBL-EBI, Hinxton, UK. This is the team that looks after ChEMBL, ChEBI, SureChEMBL, UniChem and OPSIN (check out our webpage, blog and follow us on LinkedIn, Bluesky). I could say a lot more about the honour of being appointed to this role, the responsibility I feel to the community, how this is the dream job for a cheminformatician, but let's keep it short!

And you can join this team too! If you want to be part of improving these services for the community, then please get in touch regarding the ARISE2 postdoc call described on our blog. There are a lot of cool things that we could do.

And that's not the only opportunity. We have just opened a role for a biological data curator. This is a key role in terms of maintaining the quality of our data. Currently this work is not very visible to the community, and we want to change that, which is one of the reasons we want to build closer links with everyone via....

...the ChEMBL UGM (!), a user group meeting around our services. This will take place on Jun 10-11 2026. You will be hearing more about this on our blog, likely in September. Apart from everything else you might expect from a UGM, this particular meeting at the start of my 9-year tenure will play a key role in influencing future development of our services. If you want to be informed when more information is available and registration opens, please email chembl-ugm@ebi.ac.uk.

Saturday, 12 July 2025

A 16-year time loop

Next week I wlil be attending the Computer Aided Drug Design Gordon Research Conference (CADD GRC) in the US (Maine).

The last (and only) time I attended this meeting was 16 years ago. At that time I was a postdoc at the CCDC working on the GOLD docking software, and presented a poster "Why multiple scoring functions can improve docking performance". Right after my 5-minute flash presentation (see program), some random English guy presented his poster about a database with a funny name that was about to be released.

This was July 2009; three months later ChEMBL 01 would be released.

Even at the time, I recognised that this was a big deal. In academia, research was difficult if not impossible because of the lack of data. Meanwhile industry published research that used their internal data and couldn't be challenged, compared, or built upon. To quote a recent conversation with an industry figure, showing results on ChEMBL keeps everyone honest.

But as well as learning about ChEMBL, I got to know John Overington, who subsequently invited me to present on protein ligand docking and cheminformatics at various training courses he organised at the EBI, even after I moved back to Ireland. For me, this was a really great connection to have as it increased my profile, and I got to meet many of the leading figures in the field. In return, John got the occassional bug report emailed directly to his inbox which I'm sure was exactly what he wanted. :-) We've kept in touch over the years, and he has been a great help when I've needed it.

And so, I am returning to where it all started. Only this time I am the random English Irish guy with the poster about ChEMBL. And who knows, maybe I, in my turn, will meet a future custodian of ChEMBL...?

Sunday, 9 February 2025

MMPA: The case of the missing group Part II

I was reading over my recent blogpost on MMPA when I started wondering whether the approaches I described actually were the best ways of finding the correspondences with the 'missing groups'. Just looking at the diagrams, in each case there's another approach that jumps out...

In the original post, I described taking the third case and manufacturing matches to the first two. What if, instead, we take the first two and look for matches to the third? That is, chop off the asterisks (where attached to a ring), adjust the hydrogens and then see whether there exists a match to a full molecule:

This approach may well be more efficient as the number of implicit hydrogens in a typical molecule is significant.

In the original blogpost, I again described taking the third case and manufacturing a match to the first two. An alternative would be to take the constant parts (i.e. the singly-attached groups) of all potential series, join the two together and then look for a match to a full molecule.

In this case, it's not so clear-cut that this would be more efficient than the original approach described. But if I get around to implementing these alternative approaches, I'll report back...