Friday, 20 December 2024

Engagement on X vs Mastodon vs BlueSky

Egon posted recently on the situation with scientists leaving X and moving to another microblogging platform. He pointed out some pros and cons of BlueSky vs Mastodon and how it may be possible to bridge between the two.

This post doesn't really speak to the points that Egon raises, but rather gives a n=1 measure of the current level of community engagement on different platforms. I posted a link to my previous blog post on the three platforms mentioned in the title, and here's what I found:
  • X - Since Sept 2011 - I have 1190 followers - 270 views (34 link clicks), 4 comments, 11 additional likes (including one from the poteen brewery I often get confused with :-) )
  • Mastodon - Since Dec 2022 - I have 147 followers - 2 comments, 1 additional like
  • BlueSky - Since 15 Dec 2024, i.e. one day before the blog post - I have 28 followers - 2 comments, 11 additional likes/reposts (including one from "Low Quality Facts" - make of that what you will)
  • My blog - Since April 2007 - I no longer have visibility of the number of followers, but there are 57 people who use the follow.it link to follow by email - 428 views, 3 comments

First of all, the numbers confirm that X is no longer where the community is; despite being only 5 days on BlueSky and a handful of followers, the engagement just about matches X. Mastodon is an interesting platform, and great to have as an alternative should BlueSky not pan out, but the numbers and engagement have not taken off; it feels quiet over there, in a way that BlueSky doesn't. In each case, to actually find out the news, the user had to click through to the blog, so the figure of 428 there should give the overall total across everything (including LinkedIn where I also posted).

From the little time I have spent on BlueSky, it has the feel of the old Twitter. The Android app needs a bit of work (scrolling is very jerky on the main feed) but I'm sure it will be sorted. I moved early to Mastodon as I would have liked it to take off, but that hasn't happened. I note that BlueSky has plans to monetise somehow and time will tell whether this turns it into something closer to the current X. In the meanwhile, I guess I'll see you over there.

@baoilleach.bsky.social/@baoilleach@mstdn.science/@baoilleach

Monday, 16 December 2024

Moving to pastures new, but still in the same field Part IV

Following on from my previous post on the topic, I will shortly be leaving Nxera Pharma (Sosei Heptares as was).

The last five years at Nxera have passed quickly. I've gone from "what is this GPCR you speak of?" to providing tools and resources to support structure-based GPCR drug discovery, and indeed doing it myself. I'll miss working with my colleagues in CompChem and throughout Nxera, but I'm leaving in the knowledge that the Cheminformatics line is in very capable hands.

Which brings me to my next move...

I am honoured and super excited to take up the role of Chemical Biology Resources Team Leader at the EBI, leading the team responsible for ChEMBL, SureChEMBL, UniChem and ChEBI. There are some big shoes to fill, but I can't wait to start working with the fantastic team behind these resources. The start date is early February, so wish me luck!

Thursday, 12 September 2024

Do NOT use SDFdb

It's not even a real database. All it does is:

import sdfdb
db = sdfdb.SDFdb("mysdf.sdf")
molfile = db.get_mol("mytitle")

Ok, so it's quick to index even quite a large SD file. But if Andrew Dalke ever sees the corners it cuts...! I mean, it doesn't even support Windows line endings.

And the code? Well, it speaks for itself:

import re

class SDFdb:
    def __init__(self, fname):
        self.fname = fname
        self.file = open(self.fname, "rb")
        self._create_index()

    def _create_index(self):
        patt = re.compile(b"\$\$\$\$\n(.*)\n")
        chunksize = 100000
        self.lookup = {}
        self.start = []
        idx = 0
        position = 0
        title = self.file.readline().rstrip().decode("ascii")
        self.lookup[title] = idx
        idx += 1
        self.start.append(position)
        position = self.file.tell()
        while text := self.file.read(chunksize):
            if text[-1] != '\n':
                text += self.file.readline()
            # Invariant: text ends with "\n"
            if text.endswith(b"\n$$$$\n"):
                text += self.file.readline()
            # Invariant: text never ends with "\n$$$$\n"
            for m in patt.finditer(text):
                title = m.groups()[0].decode("ascii")
                if title in self.lookup:
                    print(f"WARNING: Duplicate title {title}")
                self.lookup[title] = idx
                offset = m.start() + 5
                idx += 1
                self.start.append(position + offset)
            position = self.file.tell()
        self.start.append(self.file.tell()) # should be EOF

    def get_mol(self, title):
        idx = self.lookup.get(title, None)
        if idx is None:
            raise KeyError(f"Title '{title}' not found")
        self.file.seek(self.start[idx])
        text = self.file.read(self.start[idx+1]-self.start[idx])
        return text.decode("ascii")

    def close(self):
        self.file.close()

...though maybe I can see it would be useful for a large number of random accesses into a large (well-behaved) SD file.

That is, if they weren't already in a database. (Or *cough* they were in a database but you thought this was simpler than writing the code to batch up queries.)

Tuesday, 4 June 2024

Every ChEMBL everywhere, all at once

The team over at ChEMBL has just announced a symposium to celebrate 15 years of ChEMBLdb and 10 of SureChEMBL. With this in mind (or perhaps for other reasons), I have attempted to extract data from all 34 releases of ChEMBLdb. Let's see how I got on...

One of the things I love about ChEMBL is that they still provide all of the download files for all previous releases. But before w'get wget, let's see if there's an easier way, namely Charles Tapley Hoyt's ChEMBL Downloader - if it can do the work for me, then that'll save a whole bunch of work. Looking into it, however, the ability to run SQL queries relies on an SQLite database being available, which is only the case for ChEMBL 21 or later.

So old skool it is, which leads to the question of which database format to use. Well, it turns out that there's only one database format that is available for every single release of ChEMBL, and that's MySQL. One bash loop later, and I have all 34 chembl_nn_mysql.tar.gz files (well, two extra wgets for 22_1 and 24_1). A bit of unzipping later and my home folder is 300G heavier and we are ready to roll.

That is, if we can navigate the slightly differently named/arranged files in every case. Is the unzipped folder chembl_nn_mysql, or is it chembl_nn with the chembl_nn_mysql folder inside that? Are the setup instructions in INSTALL or INSTALL_mysql? Is the SQL data in a .sql file or a .dmp file? It is of course nitpicky to even mention this given the nature of what I'm getting for free from ChEMBL, but have you ever tried to install 34 ChEMBLs? :-) In the end, a carefully crafted find/grep was able to pull out the correct create command:

for d in `find . | grep INSTALL | sort`; do grep "create database" $d |sed "s/mysql>//g" ; done

This is simply 'create database chembl_nn' from version 1 until version 22, and then from 23 onwards it changes to specify the character set as UTF-8. For the actual import, I used a Python script to work out the right commands, wrote them to a shell script and then ran them. Just another 700G later, and it's all available in MySQL.

What I wanted to do was simply export a set of SMILES, InChIs and CHEMBL Ids from each of the versions. What I thought would be a problem turned out not to be at all; while the schema as a whole has grown and changed quite a bit over the years, the only effect for me was that 'compounds' became 'compound_structures' in ChEMBL 9.

More of an issue was that the data stored has changed over the years. I had forgotten that ChEMBL 1 did not have ChEMBL Ids, nor version 2; in fact, it wasn't until version 8 that they appeared on the scene. Before that, there was an attempt to use ChEBI ids and at the very start there was just the molregno. Furthermore, version 1 doesn't have Standard InChI - it just has a non-standard one (why, or what the settings are, I don't know ...Added 30/06/2024: Standard InChI was not available until shortly before ChEMBL 1 - that would explain it!). This at least we can pull in from version 2 by matching on molregno and SMILES, and then calculate any missing Standard InChIs off-line.

There was a point to all this, but that'll wait for another day. What I've shown here is that despite 15 years covering 34 releases, it's possible to recreate every single release and relive those glory days (ah, ChEMBL 16, will I ever forget you?).

Thursday, 28 March 2024

Learning the LINGO

LINGO is a simple similarity measure for SMILES strings, based on shared n-grams between the strings themselves. In many respects the relationship between an ECFP4 fingerprint and a molecule is equivalent to the relationship between a LINGO and a SMILES string. While it has obvious drawbacks, its charm lies in its ease of calculation which lends itself to high-speed implementations by some very clever people (Haque/Pande/Walters, Grant...Sayle).

I'm not going to add an entry to that canon, but I recently realised that the Counter class in Python supports union and intersection, which makes it just perfect for an elegant implementation in Python:

import itertools
import collections

from mtokenize import tokenize

def sliding_window(iterable, n):
    """Collect data into overlapping fixed-length chunks or blocks
    This is one of the recipes in the itertools documentation.

    >>> ["".join(x) for x in sliding_window('ABCDEFG', 4)]
    ['ABCD', 'BCDE', 'CDEF', 'DEFG']
    """
    it = iter(iterable)
    window = collections.deque(itertools.islice(it, n-1), maxlen=n)
    for x in it:
        window.append(x)
        yield tuple(window)

def lingo(seqA, seqB):
    """Implement a 'count' version of LINGO given tokenized SMILES strings"""
    counts = [collections.Counter(sliding_window(seq, 4)) for seq in [seqA, seqB]]
    intersection = counts[0] & counts[1]
    union = counts[0] | counts[1]
    tanimoto = len(intersection)/len(union)
    return tanimoto

if __name__ == "__main__":
    import doctest
    doctest.testmod()

    smiA = "c1ccccc1C(=O)Cl"
    smiB = "c1ccccc1C(=O)N"
    print(lingo(tokenize(smiA), tokenize(smiB)))
For 'mtokenize' I used code for tokenizing SMILES from the last blogpost. As an aside, try renaming mtokenize.py to tokenize.py and making the corresponding change to the import statement. See how long it takes you to work out why this causes an error :-).

Thursday, 28 December 2023

Eyes on tokenize

I was writing a tokenizer for SMILES and came across a recent paper by the IBM Research team on reaction standardisation which contained a description of their tokenization method. It uses a regex:

(\%\([0-9]{3}\)|\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\||\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])

Looking at the source code, here's an adapted version of how this is applied:

SMILES_TOKENIZER_PATTERN = r"(\%\([0-9]{3}\)|\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\||\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"
SMILES_REGEX = re.compile(SMILES_TOKENIZER_PATTERN)
def split_on_regexp(smi):
    """
    >>> split_on_regexp("Cl11%11%(111)C[C@@H](Br)I")
    ['Cl', '1', '1', '%11', '%(111)', 'C', '[C@@H]', '(', 'Br', ')', 'I']
    """
    return SMILES_REGEX.findall(smi)

My approach is to tokenize directly on a single pass through the string, and so I was curious how it compared...

Tokenize v1

The above code takes 7.7s to tokenize 1M SMILES from ChEMBL (Python 3.10 on Linux in a VM). "Surely a hand-rolled tokenizer will do better?" says I:

def tokenize_v1(smi):
    """
    >>> tokenize_v1("Cl11%11%(111)C[C@@H](Br)I")
    ['Cl', '1', '1', '%11', '%(111)', 'C', '[C@@H]', '(', 'Br', ')', 'I']
    """
    tokens = []
    i = 0
    N = len(smi)
    while i < N:
        x = smi[i]
        if x == 'C' and i+1<N and smi[i+1]=='l':
            tokens.append("Cl")
            i += 1
        elif x == 'B' and i+1<N and smi[i+1]=='r':
            tokens.append("Br")
            i += 1
        elif x=='[':
            j = i+1
            while smi[j] != ']':
                j += 1
            tokens.append(smi[i:j+1])
            i += j-i
        elif x == '%':
            if smi[i+1] == '(':
                j = i
                while smi[j] != ')':
                    j += 1
                tokens.append(smi[i:j+1])
                i += j-i
            else:
                tokens.append(smi[i:i+3])
                i += 2
        else:
            tokens.append(x)
        i += 1
    return tokens

10.4s, I'm afraid. Well, we can't give up without a fight. Time to optimise.

Tokenize v2

Consider that 'C' is the most common character but will be among the slowest to handle due to the check for 'Cl' (and then the subsequent 'if' statements). How about we check for the 'l' in 'Cl' instead of the 'C'. The previous token will be incorrect ('C'), but we can just correct it:

def tokenize_v2(smi):
    """
    >>> tokenize_v2("Cl11%11%(111)C[C@@H](Br)I")
    ['Cl', '1', '1', '%11', '%(111)', 'C', '[C@@H]', '(', 'Br', ')', 'I']
    """
    tokens = []
    i = 0
    N = len(smi)
    while i < N:
        x = smi[i]
        if x == 'l':
            tokens[-1] = 'Cl'
        elif x == 'r':
            tokens[-1] = 'Br'
        elif x=='[':
            j = i+1
            while smi[j] != ']':
                j += 1
            tokens.append(smi[i:j+1])
            i += j-i
        elif x == '%':
            if smi[i+1] == '(':
                j = i
                while smi[j] != ')':
                    j += 1
                tokens.append(smi[i:j+1])
                i += j-i
            else:
                tokens.append(smi[i:i+3])
                i += 2
        else:
            tokens.append(x)
        i += 1
    return tokens

Down to 9.8s. Not a major step forward, but it enables the next optimisation...

Tokenize v3

Optimising parsing for SMILES simply boils down to making 'C' fast even if everything else is slowed down, as the overall average will still be faster. With this in mind, let's minimise the 'if' statements that 'C' needs to go through by bundling all of them into a single test up-front:

chars = set('[lr%')
def tokenize_v3(smi):
    """
    >>> tokenize_v3("Cl11%11%(111)C[C@@H](Br)I")
    ['Cl', '1', '1', '%11', '%(111)', 'C', '[C@@H]', '(', 'Br', ')', 'I']
    """
    tokens = []
    i = 0
    N = len(smi)
    while i < N:
        x = smi[i]
        if x not in chars:
            tokens.append(x)
            i += 1
        else:
            if x == 'l':
                tokens[-1] = 'Cl'
                i += 1
            elif x == 'r':
                tokens[-1] = 'Br'
                i += 1
            elif x=='[':
                j = i+1
                while smi[j] != ']':
                    j += 1
                tokens.append(smi[i:j+1])
                i += j-i + 1
            else: # %
                if smi[i+1] == '(':
                    j = i
                    while smi[j] != ')':
                        j += 1
                    tokens.append(smi[i:j+1])
                    i += j-i + 1
                else:
                    tokens.append(smi[i:i+3])
                    i += 3
    return tokens

Which comes in as 6.1s, a modest improvement on the regex. But the story does not end here...

Python 3.12 vs Python 3.10

The Python 3.10 I was using earlier is the system Python on Ubuntu 22.04, but Python 3.12 is available via Conda. More recent Pythons contain significant speed-ups but it appears that these speed-ups have a greater effect on pure Python code rather than regex handling, which is presumably already handled by optimised C code. The associated timings are 7.2s (regex), 5.6s (v1), 4.9s (v2) and 4.1s (v3).

PyPy

I always like to keep an eye on PyPy, which is a drop-in replacement for Python 3 except that it cannot be used with OB or RDKit (though see this and this). With PyPy, the difference is even greater: 6.6s (regex), 1.7s (v1), 0.8s (v2), 1.0s (v3). Note that v2 is coming out ahead of v3 surprisingly. I guess that PyPy realises that it can use a switch statement for v2 but doesn't realise for v3.

Conclusion

Even the slowest of the speeds above isn't going to affect most applications, but hopefully the discussion around optimisations and Python versions is of interest. Ultimately my own preference is to avoid regexs unless necessary, as I find them difficult to read and check for errors, though others may prefer the one line simplicity of a carefully-crafted regex.

The full code is available as a gist.

Tuesday, 26 December 2023

Marvel at dc SMILES

In an alternate history, Perl would be of language of choice for cheminformatics, and Ivan Tubert-Brohman's plans for world domination would have come to fruition. He is the author of PerlMol (see also his origin story [hat-tip to Rich]):

PerlMol is a collection of Perl modules for chemoinformatics and computational chemistry with the philosophy that "simple things should be simple". It should be possible to write one-liners that use this toolkit to do meaningful "molecular munging". The PerlMol toolkit provides objects and methods for representing molecules, atoms, and bonds in Perl; doing substructure matching; and reading and writing files in various formats.

Chatting at the recent RDKit meeting, Ivan mentioned some cheminformatics code he was planning to release in a language/tool that predates Perl, and even C! It uses 'dc', the 'desktop calculator' utility on Linux and is now available on GitHub as dc-smiles:

I have come to realize that using RDKit to parse SMILES has been a distraction, because as it turns out, UNIX has shipped with a built-in SMILES parser since the 1970s! It is a special mode of the dc program; one only needs to invoke dc with the right one-word command to enable SMILES-parsing mode.

Given the legendary stability of UNIX, it only makes sense to switch back to using the 50-year old SMILES parser that comes with dc instead of relative newcomers such as RDKit.

Something tells me that the text above should not be taken too seriously. In any case, after checking it out of GitHub, you can try it out using something like:

./smi.sh "O[C@@H](Cl)C#N"

...which gives the following JSON output in Matt Swain's CommonChem format:

{
 "commonchem": {"version": 10},
 "defaults": {
  "atom":{"stereo":"unspecified"},
  "bond":{"stereo":"unspecified"}
 },
 "molecules": [{
  "atoms": [
   {"z": 8, "isotope": 0, "chg": 0},
   {"z": 6, "isotope": 0, "chg": 0, "impHs": 1},
   {"z": 17, "isotope": 0, "chg": 0},
   {"z": 6, "isotope": 0, "chg": 0},
   {"z": 7, "isotope": 0, "chg": 0}
  ],
  "bonds": [
   {"atoms": [3, 4], "bo": 3},
   {"atoms": [1, 3], "bo": 1},
   {"atoms": [1, 2], "bo": 1},
   {"atoms": [0, 1], "bo": 1}
  ]
 }]
}

How he got dc to do this, I don't understand, but may 2024 bring us all more weird and wacky cheminformatics!