Friday, 24 January 2025

MMPA: The case of the missing group

An efficient approach to Matched Molecular Pair Analysis (MMPA) consists of fragmenting molecules at a subset of single bonds and then collating the results to find cases where there's the same scaffold (or linker) but a different R group (or groups). If my memory serves me correctly this algorithm was first described by Wagener and Lommerse, but it was the publication by Hussain and Rea that really popularised it.

What I want to discuss here is the importance of handling the 'missing group'. To simplify things, let's fragment only at non-ring single bonds that are attached to rings.

Single cut

Consider the fragmentation of the following three molecules:

Molecule A: 

 Molecule B:

Molecule C:

For molecules A and B, the algorithm will break the bond to the F and Cl, replacing it by a dummy atom; the two scaffolds are identical and so A and B will be identified as a matched pair.

We also want C to be identified as part of the same matched series, but because the hydrogens are implicit, there's nothing to break off that would yield the matching scaffold. You could convert all implicit Hs to explicit which, well, would work; however you would have to hand in your cheminformatics badge at the desk due to the performance hit :-). Instead, we handle this empty group separately by iterating over the ring atoms that have an implicit H and generating the same scaffold fragments that would have been generated in the explicit case, each time associating it with the *[H] R group:

This is shown here for Molecule C, but the same procedure would apply to A and B. In fact, a significant proportion of the run time will just be dealing with this case as hydrogens on rings are not uncommon.

Double cut

The typical use of double cuts is to find molecules that are the same except for replacement of a linker between two rings. After a first cut, a second cut is done on the results if possible.

Molecule D

Molecule E

Molecule F

Molecules D and E share the same scaffolds, and so their linkers are identified as a matched pair. This isn't the case for Molecule F as it only gets as far as a single cut after which there is nothing left to cut for the next step.

The solution is to treat all single cut results as an additional outcome of the double cut process if (and only if) both ends are attached to a ring. After renumbering one of the locants, we associate the resulting scaffold pair with the empty linker: *-*. This procedure doesn't change anything for D and E, but gives the necessary result for F:

Are you finished?

Yes. Fortunately, these corner cases don't affect higher numbers of cuts. The smallest group that can have three or more cuts is a single atom, and this will be handled automatically.

ReplaceR - common R group and linker replacements

Introducing ReplaceR, a webapp that suggests R group or linker replacements that a medicinal chemist might consider making after the query, as well as those typically made before the query. This builds upon the ideas I described in a previous post.

In the course of a medicinal chemistry project, molecules progress through a series of changes that often involve replacement of R groups (or linkers). A key insight is that there is a certain order to these changes; in general, R groups that are simpler (in size, or in ease of synthesis, or just cheaper) will be made before those are more complex. It would be unusual to make the bromide, without having already tried chloride; if you see a tetrazole, then a carboxylic acid was likely already made.

Given true time series data from projects, we could work out this information. However, even without this, we can infer similar information from co-occurrence data in matched molecular series across publications and patents in ChEMBL (see this 2019 talk and blogpost of mine for more details). To give a sense of the method, consider that butyl co-occurs with propyl in matched series in 801 papers, but given that propyl occurs in 2765 series versus butyl's 1905 we infer that butyl is typically made after propyl (if at all). Groups in the 'after' panel are sorted by co-occurence with the query. Groups in the 'before' panel are instead sorted by co-occurrence/frequency - if we didn't do this, hydrogen or methyl would always be top of the list!

Note that no biological activity data or physicochemical property data (e.g. MWs) is used in this analysis. The suggestions are not based on bioisosteres, SAR transfer or QSAR models, but are simply a reflection of historical choices made by medicinal chemists across a large number of publications. I note that the 2021 paper by Mahendra Awale et al touches on many of the same issues.

This webapp may be useful to medicinal chemists. But in particular, computational chemists and cheminformaticians will find the app and the data behind it useful to support automatic enumeration and virtual screening of likely next steps. Note that the webapp just lists the 'nearest' 12 groups before and after. If you are interested in the full dataset, or just want to give feedback, drop me a line at baoilleach@gmail.com or file an issue on GitHub.

Cheminformatics Credits: EMBL-EBI Chemical Biology Services Team for ChEMBL data, John Mayfield et al for Chemistry Development Kit and CDKDepict, Geoff Hutchison et al for Open Babel, Greg Landrum et al for RDKit, Chen Jiang for openbabel.js, Michel Moreau for rdkit.js, Peter Ertl and Bruno Bienfait for JSME, Noel O'Boyle (me) for ReplaceR.

Tuesday, 14 January 2025

Release of partialsmiles v2.0 - A validating parser for partial SMILES

A while back, I wrote a library called partialsmiles to parse and validate prefixes of SMILES strings. I've just updated it with a minor enhancement (it now returns the full state of the parser, not just the molecule) and thought I would show an example of typical use in the context of generative models.

For a change, let's not use neural nets. Using molecules from ChEMBL converted to random SMILES, I looked at all runs of 8 tokens and counted up the frequency of occurrence of the following token. For example, following Cc1ccccc, '1' was observed 340389 times, '2' 582 times (etc.); following COc1ccc(, 'c' was observed 33288 times, then 'C', 'N', 'O', 'Cl', etc. To handle the start, I padded with 8 start tokens.

Once these frequencies are in hand, we can use this to power a generative model where we sample from the distribution of frequencies at each step. Just like with a neural net, we can apply a temperature factor; this determines to what extent we exaggerate or downplay the difference between frequencies. A value of 1.0 means use the frequencies/probabilities as provided, while an extreme value of 10 would mean treat all (non-zero probability) tokens as equally likely. The other extreme of 0.1 would mean only ever pick the most likely token.

The rationale for partialsmiles is that we can perform checks for validity while generating the tokens, rather than waiting until we complete the SMILES string (i.e. when we generate an END token). If, at some point, the partial SMILES string turns out to be invalid we could just discard the string; this would speed things up but not change the result. An alternative, which I show below, is to avoid sampling tokens that will lead to an invalid SMILES; this will increase the number of valid SMILES thus making the process more efficient. Here I do this by setting the frequencies of those tokens to 0.

If we use the code below to generate 1000 SMILES strings and use a temperature factor of 1.0, the number of valid strings increases from 230 to 327 when partialsmiles is used. A smaller increase is observed for a temperature factor of 1.2 (205 to 247) and larger for 0.8 (263 to 434).

I note in passing that it's possible to force the model to avoid adding an END token unless it results in a valid full SMILES string (see PREVENT_FULL_INVALID_STRING below). I'm afraid that doesn't work out very well. Some really long SMILES string are generated that just keep going until the monkeys typing it eventually close the ring or parenthesis or whatever is the problem. Some things you just gotta let go, or patch up afterwards.

Oh, you want to see some of the generated molecules? No, you don't; you really don't :-) There's a reason neural nets are popular. :-) You can find the associated code on GitHub.

import pickle
import tqdm
import numpy as np
import partialsmiles as ps

np.random.seed(1)

PREVENT_FULL_INVALID_STRING = False

SOS, EOS = range(2)

TOKENS = [
  '^', '$', # i.e. SOS, EOS
  'c', 'C', '(', ')', 'O', '1', '2', '=', 'N', 'n', '3', '[C@H]',
  '[C@@H]', '4', 'F', '-', 'S', '/', 'Cl', '[nH]', 's', 'o', '5', '[C@]', '#',
  '[C@@]', '\\', '[O-]', '[N+]', 'Br', '6', 'P', '7', '8', '9']

TOKEN_IDXS = list(range(len(TOKENS)))

def find_disallowed_tokens(seq, freqs):
    disallowed = set()
    smi = "".join(TOKENS[x] for x in seq)
    for i, x in enumerate(TOKENS[2:], 2):
        if freqs[i] > 0:
            try:
                ps.ParseSmiles(smi + x, partial=True)
            except ps.Error:
                disallowed.add(i)

    if PREVENT_FULL_INVALID_STRING:
        if freqs[EOS] > 0:
            try:
                ps.ParseSmiles(smi, partial=False)
            except:
                disallowed.add(EOS)
    return disallowed

def generate(all_freqs, prefix_length, temperature, avoid_invalid):
    seq = [SOS] * prefix_length
    i = 0
    while True:
        idx = (seq[i]<<6) + (seq[i+1]<<12) + (seq[i+2]<<18) + (seq[i+3]<<24) + (seq[i+4]<<30) + (seq[i+5]<<36) + (seq[i+6]<<42) + (seq[i+7]<<48)
        freqs = [all_freqs.get(idx+j, 0) for j in TOKEN_IDXS]
        if avoid_invalid:
            disallowed_tokens = find_disallowed_tokens(seq[prefix_length:], freqs)
            freqs = [all_freqs.get(idx+j, 0) if j not in disallowed_tokens else 0 for j in TOKEN_IDXS]
            if all(freq==0 for freq in freqs):
                return ""
        probs = freqs/np.sum(freqs)
        p = np.power(probs, 1.0/temperature)
        q = p / np.sum(p)
        chosen = np.random.choice(TOKEN_IDXS, 1, p=q)
        seq.append(chosen[0])
        if chosen == 1: # EOS
            break
        i += 1
    return seq[PREFIX_LENGTH:-1]

if __name__ == "__main__":
    PREFIX_LENGTH = 8
    TEMPERATURE = 0.8
    AVOID_INVALID = True

    with open(f"../nbu/probabilities.{PREFIX_LENGTH}.2.pkl", "rb") as inp:
        all_freqs = pickle.load(inp)
        print("Loaded...")

    ofile = f"../nbu/out.{PREFIX_LENGTH}_{TEMPERATURE}_{AVOID_INVALID}.smi"
    with open(ofile, "w") as out:
        for i in tqdm.tqdm(range(1000)):
            tokens = generate(all_freqs, PREFIX_LENGTH, TEMPERATURE, AVOID_INVALID)
            smi = "".join(TOKENS[x] for x in tokens)
            out.write(f"{smi}\n")

Tuesday, 31 December 2024

push and pop directories

Here's a tool that I've been using daily since 2005 at least. I had a write-up on my old website, but with its recent disappearance it seems like a good time to update it and publish it here.

If you add the code below to your .bashrc, then you can type "push here" in one terminal window to record the current directory under the name 'here', and then "pop here" in another terminal window to change to the same directory.

In other words, it's a simple bookmark system for directories backed by a persistent on-disk file-based database (i.e. a text file :-) ). You may find this useful to support sshing into multiple machines that have a shared home folder, or to synchronise windows in a screen session or tabs in a terminal emulator.

See you all next year...🎉

popnamelist="$HOME/.popnames.txt"

# Supporting code for 'pop'
function popname () {

  if [ -z "$1" ]; then
      echo "  Syntax: popname TAG"
      return 1
  fi

  if [ -f $popnamelist ]; then
      grep "^$1" $popnamelist > $popnamelist.tmp
      if [ ! -s $popnamelist.tmp ]; then
          echo "No such tag"
      else
          awk '{print $2}' $popnamelist.tmp
      fi
  else
      echo "  There isn't anything to pop!"
      return 1
  fi
}

# Pushes the current directory into a 'memory slot' indexed by a tag
# See also 'pop'
function push() {

  if [ -z "$1" ]; then
      echo "  Syntax: push TAG"
      return 1
  fi

  touch $popnamelist
  grep -v "^$1" $popnamelist > $popnamelist.tmp

  echo "$1 `pwd`" >> $popnamelist.tmp
  sort $popnamelist.tmp > $popnamelist
  rm $popnamelist.tmp
}

# Pops the directory associated with 'tag' and makes it the current
# directory.
# Then you can use: pop mytag
# or echo `popname mytag`, or cat `popname mytag`/readme.txt

function pop() {
  if [ -z "$1" ]; then
    echo "  Syntax: pop TAG"
    echo
    cat $popnamelist
  else
    cd `popname $1`
  fi
}

Saturday, 28 December 2024

The effect of ties on evaluation of virtual screens

For a virtual screening method where the associated score is prone to ties (e.g. a fingerprint comparison), care must be taken to handle these appropriately or the results will be biased towards better performance of either inactives or actives.

Let's work through a specific example, starting with some setup code assigning scores to a set of actives and separately to a set of inactives:

import random
random.seed(42)

INACTIVE, ACTIVE = range(2)

def mean_rank_of_actives(data):
    ranks = [i for (i, x) in enumerate(data) if x[1]==ACTIVE]
    return sum(ranks)/len(ranks)

if __name__ == "__main__":
    # Generate scores for actives and inactives
    actives = [1.0]*10 + [0.5]*10
    random.shuffle(actives)
    inactives = [0.5]*10 + [0.0]*70
    random.shuffle(inactives)

I've baked in a large number of ties. Half of the actives have a score of 0.5, a value that is shared by an equal number of inactives. We will use mean rank of the actives as a proxy here for ROC AUC - if I remember correctly, one is proportional to the other.

Our first attempt at evaluating performance is as follows:

def first(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    everything.sort(reverse=True) # rank the entries - highest score first
    return mean_rank_of_actives(everything)

It turns out that all of the actives are sorted ahead of the inactives, giving an overoptimistic mean rank of 9.5. Why? Because the ties are resolved based on the second item in the tuples, and ACTIVE (i.e. 1) is greater than INACTIVE (i.e. 0). In fact, swapping the values of these flags changes the results to the overpessimistic value of 14.5. Using a piece of text, e.g. "active" or "inactive", has the same problem.

No worries - when we sort, let's make sure we sort using the score values only:

def second(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    everything.sort(reverse=True, key=lambda x:x[0])
    return mean_rank_of_actives(everything)

Unfortunately, the result is still 9.5. Python's sort is a stable sort, which means that items retain their original order if there are ties. Since all of the actives come first in the original list, the tied actives still come before the tied inactives after sorting. To get rid of the influence of the original order, we need to shuffle the list into a random order:

def third(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    random.shuffle(everything)
    everything.sort(reverse=True, key=lambda x:x[0])
    return mean_rank_of_actives(everything)

This gives 12.0 +/- 0.67 (from 1000 repetitions), a more accurate assessment of the performance.

I think that this is an interesting little problem because it's rather subtle; even when you think you've solved it (as in the second solution above), some other issue rears its head. The only way to be sure is to test with tied data.

Note: If we are interested specifically in handling ties correctly in ranks, a common approach is to assign the same rank to all members of the tie. That's not really the point I want to make here. In practice, the evaluation metric might something quite different such as enrichment at 1%, which would not be amenable to such a solution.

Friday, 20 December 2024

Engagement on X vs Mastodon vs BlueSky

Egon posted recently on the situation with scientists leaving X and moving to another microblogging platform. He pointed out some pros and cons of BlueSky vs Mastodon and how it may be possible to bridge between the two.

This post doesn't really speak to the points that Egon raises, but rather gives a n=1 measure of the current level of community engagement on different platforms. I posted a link to my previous blog post on the three platforms mentioned in the title, and here's what I found:
  • X - Since Sept 2011 - I have 1190 followers - 270 views (34 link clicks), 4 comments, 11 additional likes (including one from the poteen brewery I often get confused with :-) )
  • Mastodon - Since Dec 2022 - I have 147 followers - 2 comments, 1 additional like
  • BlueSky - Since 15 Dec 2024, i.e. one day before the blog post - I have 28 followers - 2 comments, 11 additional likes/reposts (including one from "Low Quality Facts" - make of that what you will)
  • My blog - Since April 2007 - I no longer have visibility of the number of followers, but there are 57 people who use the follow.it link to follow by email - 428 views, 3 comments

First of all, the numbers confirm that X is no longer where the community is; despite being only 5 days on BlueSky and a handful of followers, the engagement just about matches X. Mastodon is an interesting platform, and great to have as an alternative should BlueSky not pan out, but the numbers and engagement have not taken off; it feels quiet over there, in a way that BlueSky doesn't. In each case, to actually find out the news, the user had to click through to the blog, so the figure of 428 there should give the overall total across everything (including LinkedIn where I also posted).

From the little time I have spent on BlueSky, it has the feel of the old Twitter. The Android app needs a bit of work (scrolling is very jerky on the main feed) but I'm sure it will be sorted. I moved early to Mastodon as I would have liked it to take off, but that hasn't happened. I note that BlueSky has plans to monetise somehow and time will tell whether this turns it into something closer to the current X. In the meanwhile, I guess I'll see you over there.

@baoilleach.bsky.social/@baoilleach@mstdn.science/@baoilleach

Monday, 16 December 2024

Moving to pastures new, but still in the same field Part IV

Following on from my previous post on the topic, I will shortly be leaving Nxera Pharma (Sosei Heptares as was).

The last five years at Nxera have passed quickly. I've gone from "what is this GPCR you speak of?" to providing tools and resources to support structure-based GPCR drug discovery, and indeed doing it myself. I'll miss working with my colleagues in CompChem and throughout Nxera, but I'm leaving in the knowledge that the Cheminformatics line is in very capable hands.

Which brings me to my next move...

I am honoured and super excited to take up the role of Chemical Biology Resources Team Leader at the EBI, leading the team responsible for ChEMBL, SureChEMBL, UniChem and ChEBI. There are some big shoes to fill, but I can't wait to start working with the fantastic team behind these resources. The start date is early February, so wish me luck!