Noel O'Blog

Monday, 28 July 2025

A new job, a postdoc opportunity, an open biological curator role, and a user group meeting

Almost exactly 6 months ago, I took over the leadership of the Chemical Biology Services team at EMBL-EBI, Hinxton, UK. This is the team that looks after ChEMBL, ChEBI, SureChEMBL, UniChem and OPSIN (check out our webpage, blog and follow us on LinkedIn, Bluesky). I could say a lot more about the honour of being appointed to this role, the responsibility I feel to the community, how this is the dream job for a cheminformatician, but let's keep it short!

And you can join this team too! If you want to be part of improving these services for the community, then please get in touch regarding the ARISE2 postdoc call described on our blog. There are a lot of cool things that we could do.

And that's not the only opportunity. We have just opened a role for a biological data curator. This is a key role in terms of maintaining the quality of our data. Currently this work is not very visible to the community, and we want to change that, which is one of the reasons we want to build closer links with everyone via....

...the ChEMBL UGM (!), a user group meeting around our services. We now have a provisional date, 05-06 Mar 2025 (edit: ~~24-25 Feb 2026~~). You will be hearing more about this on our blog, likely in September. Apart from everything else you might expect from a UGM, this particular meeting at the start of my 9-year tenure will play a key role in influencing future development of our services. If you want to be informed when more information is available and registration opens, please email chembl-ugm@ebi.ac.uk.

Saturday, 12 July 2025

A 16-year time loop

Next week I wlil be attending the Computer Aided Drug Design Gordon Research Conference (CADD GRC) in the US (Maine).

The last (and only) time I attended this meeting was 16 years ago. At that time I was a postdoc at the CCDC working on the GOLD docking software, and presented a poster "Why multiple scoring functions can improve docking performance". Right after my 5-minute flash presentation (see program), some random English guy presented his poster about a database with a funny name that was about to be released.

This was July 2009; three months later ChEMBL 01 would be released.

Even at the time, I recognised that this was a big deal. In academia, research was difficult if not impossible because of the lack of data. Meanwhile industry published research that used their internal data and couldn't be challenged, compared, or built upon. To quote a recent conversation with an industry figure, showing results on ChEMBL keeps everyone honest.

But as well as learning about ChEMBL, I got to know John Overington, who subsequently invited me to present on protein ligand docking and cheminformatics at various training courses he organised at the EBI, even after I moved back to Ireland. For me, this was a really great connection to have as it increased my profile, and I got to meet many of the leading figures in the field. In return, John got the occassional bug report emailed directly to his inbox which I'm sure was exactly what he wanted. :-) We've kept in touch over the years, and he has been a great help when I've needed it.

And so, I am returning to where it all started. Only this time I am the random ~~English~~ Irish guy with the poster about ChEMBL. And who knows, maybe I, in my turn, will meet a future custodian of ChEMBL...?

Sunday, 9 February 2025

MMPA: The case of the missing group Part II

I was reading over my recent blogpost on MMPA when I started wondering whether the approaches I described actually were the best ways of finding the correspondences with the 'missing groups'. Just looking at the diagrams, in each case there's another approach that jumps out...

In the original post, I described taking the third case and manufacturing matches to the first two. What if, instead, we take the first two and look for matches to the third? That is, chop off the asterisks (where attached to a ring), adjust the hydrogens and then see whether there exists a match to a full molecule:

This approach may well be more efficient as the number of implicit hydrogens in a typical molecule is significant.

In the original blogpost, I again described taking the third case and manufacturing a match to the first two. An alternative would be to take the constant parts (i.e. the singly-attached groups) of all potential series, join the two together and then look for a match to a full molecule.

In this case, it's not so clear-cut that this would be more efficient than the original approach described. But if I get around to implementing these alternative approaches, I'll report back...

Saturday, 1 February 2025

SmiZip 2.0 released

SmiZip is a compressor for SMILES strings that uses the unused characters from the extended ASCII set. So, for example, while the character "C" is already used for Carbon, part of Chlorine, etc., the character 'J' could be used to indicate "C(=O)".

While most of the v2.0 changes (listed below) are tidy-ups or minor improvements, there is one in particular that simplifies a particular use case, to encode only into ASCII printable characters. This means that the compressed string can be stored in places where the full 256 characters are not supported, e.g. a JavaScript file (you can encode such characters with Unicode but that would increase the file size).

As an example, I trained on a dataset derived from ChEMBL and was able to compress to 30.8% on a hold-out test set using all available characters (202 multigrams). But even sticking to the printable ASCII character set, I was able to compress to 43.7% despite only having 43 characters available for multigrams. We benefit here from the fact that there are diminishing returns with each additional multigram - the majority of the benefit is derived from the initial multigrams.

As a specific example, here's CHEMBL505931 which is a polysaccharide attached to a sterol and compresses from 322 characters to 138 (42.9%):

[C@]12(O[C@@H](C[C@H]1C)C(=O)CC)[C@]1(CCC3=C(CC[C@H]4[C@@]([C@H](CC[C@]34C)O[C@@H]3O[C@@H]([C@H]([C@@H]([C@H]3O)O)O)CO[C@@H]3OC[C@@H]([C@@H]([C@H]3O[C@@H]3O[C@@H]([C@H]([C@@H]([C@H]3O[C@@H]3O[C@H]([C@@H]([C@H]([C@H]3O)O)O)C)O[C@@H]3OC[C@@H]([C@@H]([C@H]3O[C@@H]3OC[C@H]([C@@H]([C@H]3O)O)O)O)O)O)CO)O)O)(C)CO)[C@@]1(CC2)C)C
compresses to
]12(O>,D1C$ *):]1,*3=~*D4:@](D,C:]34CV>3O>(D(>(D3OVV$O>3OC>(>(D3O>3O>(D(>(D3O>3OD(>(D(D3OVV$V>3OC>(>(D3O>3OCD(>(D3OVVVVV$OVV),$O):@]1,C2$$

Changes: v2.0 (2025-01)

find_best_ngrams.py: new option --non-printable to facilitate encoding into printable ASCII charactersr
find_best_ngrams.py: --chars is now required (help text provides a reasonable starting point) to force the user to consider the list
find_best_ngrams.py: if the end of the training .SMI file is reached, the script wraps around to the start
find_best_ngrams.py: --cr corrected to --lf.
compress.py: A better error message is generated if an attempt is made to encode a character not present in the JSON file
compress.py: support added for .SMI files without titles.

Thanks to Adriano Rutz (@adafede) and Charles Tapley Hoyt (@cthoyt) for feedback.

Friday, 24 January 2025

MMPA: The case of the missing group

An efficient approach to Matched Molecular Pair Analysis (MMPA) consists of fragmenting molecules at a subset of single bonds and then collating the results to find cases where there's the same scaffold (or linker) but a different R group (or groups). If my memory serves me correctly this algorithm was first described by Wagener and Lommerse, but it was the publication by Hussain and Rea that really popularised it.

What I want to discuss here is the importance of handling the 'missing group'. To simplify things, let's fragment only at non-ring single bonds that are attached to rings.

Single cut

Consider the fragmentation of the following three molecules:

Molecule A:

Molecule B:

Molecule C:

For molecules A and B, the algorithm will break the bond to the F and Cl, replacing it by a dummy atom; the two scaffolds are identical and so A and B will be identified as a matched pair.

We also want C to be identified as part of the same matched series, but because the hydrogens are implicit, there's nothing to break off that would yield the matching scaffold. You could convert all implicit Hs to explicit which, well, would work; however you would have to hand in your cheminformatics badge at the desk due to the performance hit :-). Instead, we handle this empty group separately by iterating over the ring atoms that have an implicit H and generating the same scaffold fragments that would have been generated in the explicit case, each time associating it with the *[H] R group:

This is shown here for Molecule C, but the same procedure would apply to A and B. In fact, a significant proportion of the run time will just be dealing with this case as hydrogens on rings are not uncommon.

Double cut

The typical use of double cuts is to find molecules that are the same except for replacement of a linker between two rings. After a first cut, a second cut is done on the results if possible.

Molecule D

Molecule E

Molecule F

Molecules D and E share the same scaffolds, and so their linkers are identified as a matched pair. This isn't the case for Molecule F as it only gets as far as a single cut after which there is nothing left to cut for the next step.

The solution is to treat all single cut results as an additional outcome of the double cut process if (and only if) both ends are attached to a ring. After renumbering one of the locants, we associate the resulting scaffold pair with the empty linker: *-*. This procedure doesn't change anything for D and E, but gives the necessary result for F:

Are you finished?

Yes. Fortunately, these corner cases don't affect higher numbers of cuts. The smallest group that can have three or more cuts is a single atom, and this will be handled automatically.

ReplaceR - common R group and linker replacements

Introducing ReplaceR, a webapp that suggests R group or linker replacements that a medicinal chemist might consider making after the query, as well as those typically made before the query. This builds upon the ideas I described in a previous post.

In the course of a medicinal chemistry project, molecules progress through a series of changes that often involve replacement of R groups (or linkers). A key insight is that there is a certain order to these changes; in general, R groups that are simpler (in size, or in ease of synthesis, or just cheaper) will be made before those are more complex. It would be unusual to make the bromide without having already tried chloride; if you see a tetrazole, then a carboxylic acid was likely already made.

Given true time series data from projects, we could work out this information. However, even without this, we can infer similar information from co-occurrence data in matched molecular series across publications and patents in ChEMBL (see this 2019 talk and blogpost of mine for more details). To give a sense of the method, consider that butyl co-occurs with propyl in matched series in 801 papers, but given that propyl occurs in 2765 series versus butyl's 1905 we infer that butyl is typically made after propyl (if at all). Groups in the 'after' panel are sorted by co-occurence with the query. Groups in the 'before' panel are instead sorted by co-occurrence/frequency - if we didn't do this, hydrogen or methyl would always be top of the list!

Note that no biological activity data or physicochemical property data (e.g. MWs) is used in this analysis. The suggestions are not based on bioisosteres, SAR transfer or QSAR models, but are simply a reflection of historical choices made by medicinal chemists across a large number of publications. I note that the 2021 papers by Mahendra Awale et al and Kosuke Takeuchi et al touch on many of the same issues.

This webapp may be useful to medicinal chemists. But in particular, computational chemists and cheminformaticians will find the app and the data behind it useful to support automatic enumeration and virtual screening of likely next steps. Note that the webapp just lists the 'nearest' 12 groups before and after. The full dataset is available on Zenodo. You can find me at baoilleach@gmail.com or file an issue on GitHub.

Cheminformatics Credits: EMBL-EBI Chemical Biology Services Team for ChEMBL data, John Mayfield et al for Chemistry Development Kit and CDKDepict, Geoff Hutchison et al for Open Babel, Greg Landrum et al for RDKit, Chen Jiang for openbabel.js, Michel Moreau for rdkit.js, Peter Ertl and Bruno Bienfait for JSME, Noel O'Boyle (me) for ReplaceR.

Tuesday, 14 January 2025

Release of partialsmiles v2.0 - A validating parser for partial SMILES

A while back, I wrote a library called partialsmiles to parse and validate prefixes of SMILES strings. I've just updated it with a minor enhancement (it now returns the full state of the parser, not just the molecule) and thought I would show an example of typical use in the context of generative models.

For a change, let's not use neural nets. Using molecules from ChEMBL converted to random SMILES, I looked at all runs of 8 tokens and counted up the frequency of occurrence of the following token. For example, following Cc1ccccc, '1' was observed 340389 times, '2' 582 times (etc.); following COc1ccc(, 'c' was observed 33288 times, then 'C', 'N', 'O', 'Cl', etc. To handle the start, I padded with 8 start tokens.

Once these frequencies are in hand, we can use this to power a generative model where we sample from the distribution of frequencies at each step. Just like with a neural net, we can apply a temperature factor; this determines to what extent we exaggerate or downplay the difference between frequencies. A value of 1.0 means use the frequencies/probabilities as provided, while an extreme value of 10 would mean treat all (non-zero probability) tokens as equally likely. The other extreme of 0.1 would mean only ever pick the most likely token.

The rationale for partialsmiles is that we can perform checks for validity while generating the tokens, rather than waiting until we complete the SMILES string (i.e. when we generate an END token). If, at some point, the partial SMILES string turns out to be invalid we could just discard the string; this would speed things up but not change the result. An alternative, which I show below, is to avoid sampling tokens that will lead to an invalid SMILES; this will increase the number of valid SMILES thus making the process more efficient. Here I do this by setting the frequencies of those tokens to 0.

If we use the code below to generate 1000 SMILES strings and use a temperature factor of 1.0, the number of valid strings increases from 230 to 327 when partialsmiles is used. A smaller increase is observed for a temperature factor of 1.2 (205 to 247) and larger for 0.8 (263 to 434).

I note in passing that it's possible to force the model to avoid adding an END token unless it results in a valid full SMILES string (see PREVENT_FULL_INVALID_STRING below). I'm afraid that doesn't work out very well. Some really long SMILES string are generated that just keep going until the monkeys typing it eventually close the ring or parenthesis or whatever is the problem. Some things you just gotta let go, or patch up afterwards.

Oh, you want to see some of the generated molecules? No, you don't; you really don't :-) There's a reason neural nets are popular. :-) You can find the associated code on GitHub.

import pickle
import tqdm
import numpy as np
import partialsmiles as ps

np.random.seed(1)

PREVENT_FULL_INVALID_STRING = False

SOS, EOS = range(2)

TOKENS = [
  '^', '$', # i.e. SOS, EOS
  'c', 'C', '(', ')', 'O', '1', '2', '=', 'N', 'n', '3', '[C@H]',
  '[C@@H]', '4', 'F', '-', 'S', '/', 'Cl', '[nH]', 's', 'o', '5', '[C@]', '#',
  '[C@@]', '\\', '[O-]', '[N+]', 'Br', '6', 'P', '7', '8', '9']

TOKEN_IDXS = list(range(len(TOKENS)))

def find_disallowed_tokens(seq, freqs):
    disallowed = set()
    smi = "".join(TOKENS[x] for x in seq)
    for i, x in enumerate(TOKENS[2:], 2):
        if freqs[i] > 0:
            try:
                ps.ParseSmiles(smi + x, partial=True)
            except ps.Error:
                disallowed.add(i)

    if PREVENT_FULL_INVALID_STRING:
        if freqs[EOS] > 0:
            try:
                ps.ParseSmiles(smi, partial=False)
            except:
                disallowed.add(EOS)
    return disallowed

def generate(all_freqs, prefix_length, temperature, avoid_invalid):
    seq = [SOS] * prefix_length
    i = 0
    while True:
        idx = (seq[i]<<6) + (seq[i+1]<<12) + (seq[i+2]<<18) + (seq[i+3]<<24) + (seq[i+4]<<30) + (seq[i+5]<<36) + (seq[i+6]<<42) + (seq[i+7]<<48)
        freqs = [all_freqs.get(idx+j, 0) for j in TOKEN_IDXS]
        if avoid_invalid:
            disallowed_tokens = find_disallowed_tokens(seq[prefix_length:], freqs)
            freqs = [all_freqs.get(idx+j, 0) if j not in disallowed_tokens else 0 for j in TOKEN_IDXS]
            if all(freq==0 for freq in freqs):
                return ""
        probs = freqs/np.sum(freqs)
        p = np.power(probs, 1.0/temperature)
        q = p / np.sum(p)
        chosen = np.random.choice(TOKEN_IDXS, 1, p=q)
        seq.append(chosen[0])
        if chosen == 1: # EOS
            break
        i += 1
    return seq[PREFIX_LENGTH:-1]

if __name__ == "__main__":
    PREFIX_LENGTH = 8
    TEMPERATURE = 0.8
    AVOID_INVALID = True

    with open(f"../nbu/probabilities.{PREFIX_LENGTH}.2.pkl", "rb") as inp:
        all_freqs = pickle.load(inp)
        print("Loaded...")

    ofile = f"../nbu/out.{PREFIX_LENGTH}_{TEMPERATURE}_{AVOID_INVALID}.smi"
    with open(ofile, "w") as out:
        for i in tqdm.tqdm(range(1000)):
            tokens = generate(all_freqs, PREFIX_LENGTH, TEMPERATURE, AVOID_INVALID)
            smi = "".join(TOKENS[x] for x in tokens)
            out.write(f"{smi}\n")