Tuesday, 14 January 2025

Release of partialsmiles v2.0 - A validating parser for partial SMILES

A while back, I wrote a library called partialsmiles to parse and validate prefixes of SMILES strings. I've just updated it with a minor enhancement (it now returns the full state of the parser, not just the molecule) and thought I would show an example of typical use in the context of generative models.

For a change, let's not use neural nets. Using molecules from ChEMBL converted to random SMILES, I looked at all runs of 8 tokens and counted up the frequency of occurrence of the following token. For example, following Cc1ccccc, '1' was observed 340389 times, '2' 582 times (etc.); following COc1ccc(, 'c' was observed 33288 times, then 'C', 'N', 'O', 'Cl', etc. To handle the start, I padded with 8 start tokens.

Once these frequencies are in hand, we can use this to power a generative model where we sample from the distribution of frequencies at each step. Just like with a neural net, we can apply a temperature factor; this determines to what extent we exaggerate or downplay the difference between frequencies. A value of 1.0 means use the frequencies/probabilities as provided, while an extreme value of 10 would mean treat all (non-zero probability) tokens as equally likely. The other extreme of 0.1 would mean only ever pick the most likely token.

The rationale for partialsmiles is that we can perform checks for validity while generating the tokens, rather than waiting until we complete the SMILES string (i.e. when we generate an END token). If, at some point, the partial SMILES string turns out to be invalid we could just discard the string; this would speed things up but not change the result. An alternative, which I show below, is to avoid sampling tokens that will lead to an invalid SMILES; this will increase the number of valid SMILES thus making the process more efficient. Here I do this by setting the frequencies of those tokens to 0.

If we use the code below to generate 1000 SMILES strings and use a temperature factor of 1.0, the number of valid strings increases from 230 to 327 when partialsmiles is used. A smaller increase is observed for a temperature factor of 1.2 (205 to 247) and larger for 0.8 (263 to 434).

I note in passing that it's possible to force the model to avoid adding an END token unless it results in a valid full SMILES string (see PREVENT_FULL_INVALID_STRING below). I'm afraid that doesn't work out very well. Some really long SMILES string are generated that just keep going until the monkeys typing it eventually close the ring or parenthesis or whatever is the problem. Some things you just gotta let go, or patch up afterwards.

Oh, you want to see some of the generated molecules? No, you don't; you really don't :-) There's a reason neural nets are popular. :-) You can find the associated code on GitHub.

import pickle
import tqdm
import numpy as np
import partialsmiles as ps

np.random.seed(1)

PREVENT_FULL_INVALID_STRING = False

SOS, EOS = range(2)

TOKENS = [
  '^', '$', # i.e. SOS, EOS
  'c', 'C', '(', ')', 'O', '1', '2', '=', 'N', 'n', '3', '[C@H]',
  '[C@@H]', '4', 'F', '-', 'S', '/', 'Cl', '[nH]', 's', 'o', '5', '[C@]', '#',
  '[C@@]', '\\', '[O-]', '[N+]', 'Br', '6', 'P', '7', '8', '9']

TOKEN_IDXS = list(range(len(TOKENS)))

def find_disallowed_tokens(seq, freqs):
    disallowed = set()
    smi = "".join(TOKENS[x] for x in seq)
    for i, x in enumerate(TOKENS[2:], 2):
        if freqs[i] > 0:
            try:
                ps.ParseSmiles(smi + x, partial=True)
            except ps.Error:
                disallowed.add(i)

    if PREVENT_FULL_INVALID_STRING:
        if freqs[EOS] > 0:
            try:
                ps.ParseSmiles(smi, partial=False)
            except:
                disallowed.add(EOS)
    return disallowed

def generate(all_freqs, prefix_length, temperature, avoid_invalid):
    seq = [SOS] * prefix_length
    i = 0
    while True:
        idx = (seq[i]<<6) + (seq[i+1]<<12) + (seq[i+2]<<18) + (seq[i+3]<<24) + (seq[i+4]<<30) + (seq[i+5]<<36) + (seq[i+6]<<42) + (seq[i+7]<<48)
        freqs = [all_freqs.get(idx+j, 0) for j in TOKEN_IDXS]
        if avoid_invalid:
            disallowed_tokens = find_disallowed_tokens(seq[prefix_length:], freqs)
            freqs = [all_freqs.get(idx+j, 0) if j not in disallowed_tokens else 0 for j in TOKEN_IDXS]
            if all(freq==0 for freq in freqs):
                return ""
        probs = freqs/np.sum(freqs)
        p = np.power(probs, 1.0/temperature)
        q = p / np.sum(p)
        chosen = np.random.choice(TOKEN_IDXS, 1, p=q)
        seq.append(chosen[0])
        if chosen == 1: # EOS
            break
        i += 1
    return seq[PREFIX_LENGTH:-1]

if __name__ == "__main__":
    PREFIX_LENGTH = 8
    TEMPERATURE = 0.8
    AVOID_INVALID = True

    with open(f"../nbu/probabilities.{PREFIX_LENGTH}.2.pkl", "rb") as inp:
        all_freqs = pickle.load(inp)
        print("Loaded...")

    ofile = f"../nbu/out.{PREFIX_LENGTH}_{TEMPERATURE}_{AVOID_INVALID}.smi"
    with open(ofile, "w") as out:
        for i in tqdm.tqdm(range(1000)):
            tokens = generate(all_freqs, PREFIX_LENGTH, TEMPERATURE, AVOID_INVALID)
            smi = "".join(TOKENS[x] for x in tokens)
            out.write(f"{smi}\n")

Tuesday, 31 December 2024

push and pop directories

Here's a tool that I've been using daily since 2005 at least. I had a write-up on my old website, but with its recent disappearance it seems like a good time to update it and publish it here.

If you add the code below to your .bashrc, then you can type "push here" in one terminal window to record the current directory under the name 'here', and then "pop here" in another terminal window to change to the same directory.

In other words, it's a simple bookmark system for directories backed by a persistent on-disk file-based database (i.e. a text file :-) ). You may find this useful to support sshing into multiple machines that have a shared home folder, or to synchronise windows in a screen session or tabs in a terminal emulator.

See you all next year...🎉

popnamelist="$HOME/.popnames.txt"

# Supporting code for 'pop'
function popname () {

  if [ -z "$1" ]; then
      echo "  Syntax: popname TAG"
      return 1
  fi

  if [ -f $popnamelist ]; then
      grep "^$1" $popnamelist > $popnamelist.tmp
      if [ ! -s $popnamelist.tmp ]; then
          echo "No such tag"
      else
          awk '{print $2}' $popnamelist.tmp
      fi
  else
      echo "  There isn't anything to pop!"
      return 1
  fi
}

# Pushes the current directory into a 'memory slot' indexed by a tag
# See also 'pop'
function push() {

  if [ -z "$1" ]; then
      echo "  Syntax: push TAG"
      return 1
  fi

  touch $popnamelist
  grep -v "^$1" $popnamelist > $popnamelist.tmp

  echo "$1 `pwd`" >> $popnamelist.tmp
  sort $popnamelist.tmp > $popnamelist
  rm $popnamelist.tmp
}

# Pops the directory associated with 'tag' and makes it the current
# directory.
# Then you can use: pop mytag
# or echo `popname mytag`, or cat `popname mytag`/readme.txt

function pop() {
  if [ -z "$1" ]; then
    echo "  Syntax: pop TAG"
    echo
    cat $popnamelist
  else
    cd `popname $1`
  fi
}

Saturday, 28 December 2024

The effect of ties on evaluation of virtual screens

For a virtual screening method where the associated score is prone to ties (e.g. a fingerprint comparison), care must be taken to handle these appropriately or the results will be biased towards better performance of either inactives or actives.

Let's work through a specific example, starting with some setup code assigning scores to a set of actives and separately to a set of inactives:

import random
random.seed(42)

INACTIVE, ACTIVE = range(2)

def mean_rank_of_actives(data):
    ranks = [i for (i, x) in enumerate(data) if x[1]==ACTIVE]
    return sum(ranks)/len(ranks)

if __name__ == "__main__":
    # Generate scores for actives and inactives
    actives = [1.0]*10 + [0.5]*10
    random.shuffle(actives)
    inactives = [0.5]*10 + [0.0]*70
    random.shuffle(inactives)

I've baked in a large number of ties. Half of the actives have a score of 0.5, a value that is shared by an equal number of inactives. We will use mean rank of the actives as a proxy here for ROC AUC - if I remember correctly, one is proportional to the other.

Our first attempt at evaluating performance is as follows:

def first(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    everything.sort(reverse=True) # rank the entries - highest score first
    return mean_rank_of_actives(everything)

It turns out that all of the actives are sorted ahead of the inactives, giving an overoptimistic mean rank of 9.5. Why? Because the ties are resolved based on the second item in the tuples, and ACTIVE (i.e. 1) is greater than INACTIVE (i.e. 0). In fact, swapping the values of these flags changes the results to the overpessimistic value of 14.5. Using a piece of text, e.g. "active" or "inactive", has the same problem.

No worries - when we sort, let's make sure we sort using the score values only:

def second(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    everything.sort(reverse=True, key=lambda x:x[0])
    return mean_rank_of_actives(everything)

Unfortunately, the result is still 9.5. Python's sort is a stable sort, which means that items retain their original order if there are ties. Since all of the actives come first in the original list, the tied actives still come before the tied inactives after sorting. To get rid of the influence of the original order, we need to shuffle the list into a random order:

def third(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    random.shuffle(everything)
    everything.sort(reverse=True, key=lambda x:x[0])
    return mean_rank_of_actives(everything)

This gives 12.0 +/- 0.67 (from 1000 repetitions), a more accurate assessment of the performance.

I think that this is an interesting little problem because it's rather subtle; even when you think you've solved it (as in the second solution above), some other issue rears its head. The only way to be sure is to test with tied data.

Note: If we are interested specifically in handling ties correctly in ranks, a common approach is to assign the same rank to all members of the tie. That's not really the point I want to make here. In practice, the evaluation metric might something quite different such as enrichment at 1%, which would not be amenable to such a solution.

Friday, 20 December 2024

Engagement on X vs Mastodon vs BlueSky

Egon posted recently on the situation with scientists leaving X and moving to another microblogging platform. He pointed out some pros and cons of BlueSky vs Mastodon and how it may be possible to bridge between the two.

This post doesn't really speak to the points that Egon raises, but rather gives a n=1 measure of the current level of community engagement on different platforms. I posted a link to my previous blog post on the three platforms mentioned in the title, and here's what I found:
  • X - Since Sept 2011 - I have 1190 followers - 270 views (34 link clicks), 4 comments, 11 additional likes (including one from the poteen brewery I often get confused with :-) )
  • Mastodon - Since Dec 2022 - I have 147 followers - 2 comments, 1 additional like
  • BlueSky - Since 15 Dec 2024, i.e. one day before the blog post - I have 28 followers - 2 comments, 11 additional likes/reposts (including one from "Low Quality Facts" - make of that what you will)
  • My blog - Since April 2007 - I no longer have visibility of the number of followers, but there are 57 people who use the follow.it link to follow by email - 428 views, 3 comments

First of all, the numbers confirm that X is no longer where the community is; despite being only 5 days on BlueSky and a handful of followers, the engagement just about matches X. Mastodon is an interesting platform, and great to have as an alternative should BlueSky not pan out, but the numbers and engagement have not taken off; it feels quiet over there, in a way that BlueSky doesn't. In each case, to actually find out the news, the user had to click through to the blog, so the figure of 428 there should give the overall total across everything (including LinkedIn where I also posted).

From the little time I have spent on BlueSky, it has the feel of the old Twitter. The Android app needs a bit of work (scrolling is very jerky on the main feed) but I'm sure it will be sorted. I moved early to Mastodon as I would have liked it to take off, but that hasn't happened. I note that BlueSky has plans to monetise somehow and time will tell whether this turns it into something closer to the current X. In the meanwhile, I guess I'll see you over there.

@baoilleach.bsky.social/@baoilleach@mstdn.science/@baoilleach

Monday, 16 December 2024

Moving to pastures new, but still in the same field Part IV

Following on from my previous post on the topic, I will shortly be leaving Nxera Pharma (Sosei Heptares as was).

The last five years at Nxera have passed quickly. I've gone from "what is this GPCR you speak of?" to providing tools and resources to support structure-based GPCR drug discovery, and indeed doing it myself. I'll miss working with my colleagues in CompChem and throughout Nxera, but I'm leaving in the knowledge that the Cheminformatics line is in very capable hands.

Which brings me to my next move...

I am honoured and super excited to take up the role of Chemical Biology Resources Team Leader at the EBI, leading the team responsible for ChEMBL, SureChEMBL, UniChem and ChEBI. There are some big shoes to fill, but I can't wait to start working with the fantastic team behind these resources. The start date is early February, so wish me luck!

Thursday, 12 September 2024

Do NOT use SDFdb

It's not even a real database. All it does is:

import sdfdb
db = sdfdb.SDFdb("mysdf.sdf")
molfile = db.get_mol("mytitle")

Ok, so it's quick to index even quite a large SD file. But if Andrew Dalke ever sees the corners it cuts...! I mean, it doesn't even support Windows line endings.

And the code? Well, it speaks for itself:

import re

class SDFdb:
    def __init__(self, fname):
        self.fname = fname
        self.file = open(self.fname, "rb")
        self._create_index()

    def _create_index(self):
        patt = re.compile(b"\$\$\$\$\n(.*)\n")
        chunksize = 100000
        self.lookup = {}
        self.start = []
        idx = 0
        position = 0
        title = self.file.readline().rstrip().decode("ascii")
        self.lookup[title] = idx
        idx += 1
        self.start.append(position)
        position = self.file.tell()
        while text := self.file.read(chunksize):
            if text[-1] != '\n':
                text += self.file.readline()
            # Invariant: text ends with "\n"
            if text.endswith(b"\n$$$$\n"):
                text += self.file.readline()
            # Invariant: text never ends with "\n$$$$\n"
            for m in patt.finditer(text):
                title = m.groups()[0].decode("ascii")
                if title in self.lookup:
                    print(f"WARNING: Duplicate title {title}")
                self.lookup[title] = idx
                offset = m.start() + 5
                idx += 1
                self.start.append(position + offset)
            position = self.file.tell()
        self.start.append(self.file.tell()) # should be EOF

    def get_mol(self, title):
        idx = self.lookup.get(title, None)
        if idx is None:
            raise KeyError(f"Title '{title}' not found")
        self.file.seek(self.start[idx])
        text = self.file.read(self.start[idx+1]-self.start[idx])
        return text.decode("ascii")

    def close(self):
        self.file.close()

...though maybe I can see it would be useful for a large number of random accesses into a large (well-behaved) SD file.

That is, if they weren't already in a database. (Or *cough* they were in a database but you thought this was simpler than writing the code to batch up queries.)

Tuesday, 4 June 2024

Every ChEMBL everywhere, all at once

The team over at ChEMBL has just announced a symposium to celebrate 15 years of ChEMBLdb and 10 of SureChEMBL. With this in mind (or perhaps for other reasons), I have attempted to extract data from all 34 releases of ChEMBLdb. Let's see how I got on...

One of the things I love about ChEMBL is that they still provide all of the download files for all previous releases. But before w'get wget, let's see if there's an easier way, namely Charles Tapley Hoyt's ChEMBL Downloader - if it can do the work for me, then that'll save a whole bunch of work. Looking into it, however, the ability to run SQL queries relies on an SQLite database being available, which is only the case for ChEMBL 21 or later.

So old skool it is, which leads to the question of which database format to use. Well, it turns out that there's only one database format that is available for every single release of ChEMBL, and that's MySQL. One bash loop later, and I have all 34 chembl_nn_mysql.tar.gz files (well, two extra wgets for 22_1 and 24_1). A bit of unzipping later and my home folder is 300G heavier and we are ready to roll.

That is, if we can navigate the slightly differently named/arranged files in every case. Is the unzipped folder chembl_nn_mysql, or is it chembl_nn with the chembl_nn_mysql folder inside that? Are the setup instructions in INSTALL or INSTALL_mysql? Is the SQL data in a .sql file or a .dmp file? It is of course nitpicky to even mention this given the nature of what I'm getting for free from ChEMBL, but have you ever tried to install 34 ChEMBLs? :-) In the end, a carefully crafted find/grep was able to pull out the correct create command:

for d in `find . | grep INSTALL | sort`; do grep "create database" $d |sed "s/mysql>//g" ; done

This is simply 'create database chembl_nn' from version 1 until version 22, and then from 23 onwards it changes to specify the character set as UTF-8. For the actual import, I used a Python script to work out the right commands, wrote them to a shell script and then ran them. Just another 700G later, and it's all available in MySQL.

What I wanted to do was simply export a set of SMILES, InChIs and CHEMBL Ids from each of the versions. What I thought would be a problem turned out not to be at all; while the schema as a whole has grown and changed quite a bit over the years, the only effect for me was that 'compounds' became 'compound_structures' in ChEMBL 9.

More of an issue was that the data stored has changed over the years. I had forgotten that ChEMBL 1 did not have ChEMBL Ids, nor version 2; in fact, it wasn't until version 8 that they appeared on the scene. Before that, there was an attempt to use ChEBI ids and at the very start there was just the molregno. Furthermore, version 1 doesn't have Standard InChI - it just has a non-standard one (why, or what the settings are, I don't know ...Added 30/06/2024: Standard InChI was not available until shortly before ChEMBL 1 - that would explain it!). This at least we can pull in from version 2 by matching on molregno and SMILES, and then calculate any missing Standard InChIs off-line.

There was a point to all this, but that'll wait for another day. What I've shown here is that despite 15 years covering 34 releases, it's possible to recreate every single release and relive those glory days (ah, ChEMBL 16, will I ever forget you?).