Thursday, 28 December 2023

Eyes on tokenize

I was writing a tokenizer for SMILES and came across a recent paper by the IBM Research team on reaction standardisation which contained a description of their tokenization method. It uses a regex:

(\%\([0-9]{3}\)|\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\||\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])

Looking at the source code, here's an adapted version of how this is applied:

SMILES_TOKENIZER_PATTERN = r"(\%\([0-9]{3}\)|\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\||\(|\)|\.|=|#|-|\+|\\|\/|:|~|@|\?|>>?|\*|\$|\%[0-9]{2}|[0-9])"
SMILES_REGEX = re.compile(SMILES_TOKENIZER_PATTERN)
def split_on_regexp(smi):
    """
    >>> split_on_regexp("Cl11%11%(111)C[C@@H](Br)I")
    ['Cl', '1', '1', '%11', '%(111)', 'C', '[C@@H]', '(', 'Br', ')', 'I']
    """
    return SMILES_REGEX.findall(smi)

My approach is to tokenize directly on a single pass through the string, and so I was curious how it compared...

Tokenize v1

The above code takes 7.7s to tokenize 1M SMILES from ChEMBL (Python 3.10 on Linux in a VM). "Surely a hand-rolled tokenizer will do better?" says I:

def tokenize_v1(smi):
    """
    >>> tokenize_v1("Cl11%11%(111)C[C@@H](Br)I")
    ['Cl', '1', '1', '%11', '%(111)', 'C', '[C@@H]', '(', 'Br', ')', 'I']
    """
    tokens = []
    i = 0
    N = len(smi)
    while i < N:
        x = smi[i]
        if x == 'C' and i+1<N and smi[i+1]=='l':
            tokens.append("Cl")
            i += 1
        elif x == 'B' and i+1<N and smi[i+1]=='r':
            tokens.append("Br")
            i += 1
        elif x=='[':
            j = i+1
            while smi[j] != ']':
                j += 1
            tokens.append(smi[i:j+1])
            i += j-i
        elif x == '%':
            if smi[i+1] == '(':
                j = i
                while smi[j] != ')':
                    j += 1
                tokens.append(smi[i:j+1])
                i += j-i
            else:
                tokens.append(smi[i:i+3])
                i += 2
        else:
            tokens.append(x)
        i += 1
    return tokens

10.4s, I'm afraid. Well, we can't give up without a fight. Time to optimise.

Tokenize v2

Consider that 'C' is the most common character but will be among the slowest to handle due to the check for 'Cl' (and then the subsequent 'if' statements). How about we check for the 'l' in 'Cl' instead of the 'C'. The previous token will be incorrect ('C'), but we can just correct it:

def tokenize_v2(smi):
    """
    >>> tokenize_v2("Cl11%11%(111)C[C@@H](Br)I")
    ['Cl', '1', '1', '%11', '%(111)', 'C', '[C@@H]', '(', 'Br', ')', 'I']
    """
    tokens = []
    i = 0
    N = len(smi)
    while i < N:
        x = smi[i]
        if x == 'l':
            tokens[-1] = 'Cl'
        elif x == 'r':
            tokens[-1] = 'Br'
        elif x=='[':
            j = i+1
            while smi[j] != ']':
                j += 1
            tokens.append(smi[i:j+1])
            i += j-i
        elif x == '%':
            if smi[i+1] == '(':
                j = i
                while smi[j] != ')':
                    j += 1
                tokens.append(smi[i:j+1])
                i += j-i
            else:
                tokens.append(smi[i:i+3])
                i += 2
        else:
            tokens.append(x)
        i += 1
    return tokens

Down to 9.8s. Not a major step forward, but it enables the next optimisation...

Tokenize v3

Optimising parsing for SMILES simply boils down to making 'C' fast even if everything else is slowed down, as the overall average will still be faster. With this in mind, let's minimise the 'if' statements that 'C' needs to go through by bundling all of them into a single test up-front:

chars = set('[lr%')
def tokenize_v3(smi):
    """
    >>> tokenize_v3("Cl11%11%(111)C[C@@H](Br)I")
    ['Cl', '1', '1', '%11', '%(111)', 'C', '[C@@H]', '(', 'Br', ')', 'I']
    """
    tokens = []
    i = 0
    N = len(smi)
    while i < N:
        x = smi[i]
        if x not in chars:
            tokens.append(x)
            i += 1
        else:
            if x == 'l':
                tokens[-1] = 'Cl'
                i += 1
            elif x == 'r':
                tokens[-1] = 'Br'
                i += 1
            elif x=='[':
                j = i+1
                while smi[j] != ']':
                    j += 1
                tokens.append(smi[i:j+1])
                i += j-i + 1
            else: # %
                if smi[i+1] == '(':
                    j = i
                    while smi[j] != ')':
                        j += 1
                    tokens.append(smi[i:j+1])
                    i += j-i + 1
                else:
                    tokens.append(smi[i:i+3])
                    i += 3
    return tokens

Which comes in as 6.1s, a modest improvement on the regex. But the story does not end here...

Python 3.12 vs Python 3.10

The Python 3.10 I was using earlier is the system Python on Ubuntu 22.04, but Python 3.12 is available via Conda. More recent Pythons contain significant speed-ups but it appears that these speed-ups have a greater effect on pure Python code rather than regex handling, which is presumably already handled by optimised C code. The associated timings are 7.2s (regex), 5.6s (v1), 4.9s (v2) and 4.1s (v3).

PyPy

I always like to keep an eye on PyPy, which is a drop-in replacement for Python 3 except that it cannot be used with OB or RDKit (though see this and this). With PyPy, the difference is even greater: 6.6s (regex), 1.7s (v1), 0.8s (v2), 1.0s (v3). Note that v2 is coming out ahead of v3 surprisingly. I guess that PyPy realises that it can use a switch statement for v2 but doesn't realise for v3.

Conclusion

Even the slowest of the speeds above isn't going to affect most applications, but hopefully the discussion around optimisations and Python versions is of interest. Ultimately my own preference is to avoid regexs unless necessary, as I find them difficult to read and check for errors, though others may prefer the one line simplicity of a carefully-crafted regex.

The full code is available as a gist.

Tuesday, 26 December 2023

Marvel at dc SMILES

In an alternate history, Perl would be of language of choice for cheminformatics, and Ivan Tubert-Brohman's plans for world domination would have come to fruition. He is the author of PerlMol (see also his origin story [hat-tip to Rich]):

PerlMol is a collection of Perl modules for chemoinformatics and computational chemistry with the philosophy that "simple things should be simple". It should be possible to write one-liners that use this toolkit to do meaningful "molecular munging". The PerlMol toolkit provides objects and methods for representing molecules, atoms, and bonds in Perl; doing substructure matching; and reading and writing files in various formats.

Chatting at the recent RDKit meeting, Ivan mentioned some cheminformatics code he was planning to release in a language/tool that predates Perl, and even C! It uses 'dc', the 'desktop calculator' utility on Linux and is now available on GitHub as dc-smiles:

I have come to realize that using RDKit to parse SMILES has been a distraction, because as it turns out, UNIX has shipped with a built-in SMILES parser since the 1970s! It is a special mode of the dc program; one only needs to invoke dc with the right one-word command to enable SMILES-parsing mode.

Given the legendary stability of UNIX, it only makes sense to switch back to using the 50-year old SMILES parser that comes with dc instead of relative newcomers such as RDKit.

Something tells me that the text above should not be taken too seriously. In any case, after checking it out of GitHub, you can try it out using something like:

./smi.sh "O[C@@H](Cl)C#N"

...which gives the following JSON output in Matt Swain's CommonChem format:

{
 "commonchem": {"version": 10},
 "defaults": {
  "atom":{"stereo":"unspecified"},
  "bond":{"stereo":"unspecified"}
 },
 "molecules": [{
  "atoms": [
   {"z": 8, "isotope": 0, "chg": 0},
   {"z": 6, "isotope": 0, "chg": 0, "impHs": 1},
   {"z": 17, "isotope": 0, "chg": 0},
   {"z": 6, "isotope": 0, "chg": 0},
   {"z": 7, "isotope": 0, "chg": 0}
  ],
  "bonds": [
   {"atoms": [3, 4], "bo": 3},
   {"atoms": [1, 3], "bo": 1},
   {"atoms": [1, 2], "bo": 1},
   {"atoms": [0, 1], "bo": 1}
  ]
 }]
}

How he got dc to do this, I don't understand, but may 2024 bring us all more weird and wacky cheminformatics!

Friday, 3 November 2023

Cheminformatics position at Sosei Heptares, and other updates

I don't seem to find/make time for blogging these days, so here's a three-in-one post. First off, congratulations to Morgan Thomas, aka Dr Thomas, for passing his PhD viva on on "Improving de novo molecular generation for structure-based drug design". Morgan started his PhD around the same time I joined Sosei Heptares, and it has been a pleasure to co-supervise him. The PhD was at the University of Cambridge (Andreas Bender) and sponsored by Sosei Heptares (Chris de Graaf, myself, and also Rob Smith for a period). We're still writing up a few final pieces of work, so keep an eye out. In the meanwhile, Morgan has moved on to the next stage of his career and joined the group of Gianni De Fabritiis at Pompeu Fabra University of Barcelona.

Second, I want to mention the recent Special Issue of Current Opinion in Structural Biology on AI Methodologies in Structural Biology, which I co-edited alongside Chris de Graaf and Andreas Bender. Here's a link to the overview, but all credit to the authors for their coverage of the current state-of-the-art in the field.

And finally, we have an opening for a cheminformatician at Sosei Heptares - apply, tell your friends, etc. The details are all at that link. This is based in Cambridge, UK. We have no specific grade in mind - you could be just finishing your PhD, or be much more experienced. If you have any questions, e.g. you would like to apply but are not sure if you meet the requirements, just email me at noel.oboyle@soseiheptares.com. The closing date is 19th Jan 2023, but I'd recommend applying now if interested.

Saturday, 23 September 2023

SmiZip at the 12th RDKit UGM

 

I really enjoyed the recent RDKit UGM in Mainz hosted by Paul Czodrowski and his group, and presided over as usual by Greg Landrum. Great talks, and also it was good to catch up with old and new friends. In particular it was great to meet Tim Vandermeersch again, with whom I worked on Open Babel many years ago now. He blew people's minds with a talk about converting SMARTS patterns to highly-efficient C++ at compile-time.
 
My talk was more about shrinking things, specifically SMILES strings. Back in 2001 Roger gave a talk where he described SmiZip, a method to compress SMILES strings. I revisit this, and provide the first public implementation of SmiZip. I also go on to discuss some potential applications including use as a measure of structural complexity:

 

If SmiZip sounds interesting, check out the info on the GitHub site and then "pip install smizip".