Monday 19 September 2022

I am the walrus operator

During my time at NextMove Software I learnt quite a bit about writing efficient C++. I applied some of these learnings to Open Babel during that period, e.g. a 55X improvement in SMILES reading/writing speeds. This was gratifying at the time, but of course the flip side of a 55X improvement is that it was 55X slower before that. Really, code should be written to be 'lean' in the first place.

These days, when I write code, it's in Python. Even though it's slower than C++ (or maybe, because it's slower) I still try to write efficiently. For example, consider the following:

if key in mydict:
    myfunction(mydict[key])
else:
    notfound += 1

Both the first and second lines repeat the same work, looking up the same value in a dictionary. These days, I just can't bring myself to write code that does something twice. Instead, I want to look it up once and use the result twice:

val = mydict.get(key, None)
if val:
    myfunction(val)
else:
    notfound += 1

A downside to this is that it loses a bit of readability. In addition, if 'val' is not used elsewhere then it seems untidy to define it with an apparent broader scope than it needs. (Note that some care is needed with the 'if' statement to avoid problems with values of 'val' that evaluate False. "if val is not None" can help here.)

Enter the walrus operator, :=, which arrived with Python 3.8 (the name comes from the similarity to a pair of eyes with two tusks). This allows you to assign a value to a variable in the middle of an expression; for example in the expression used in an 'if' statement. It was/is quite controversial, and the arguments around it led Guido to step down from the BDFL position he had held since the start of Python.

Here is how it could be used in this case:

if val:=mydict.get(key, None):
    myfunction(val)
else:
    notfound += 1

Is this more readable, or is it confusing? I'm familiar with this idiom from C++, but its overuse and/or abuse can really confuse things. See this stack overflow question for example; imagine seeing that in someone's code and trying to figure out what it's doing. The docs I link to above end with the following request: "Try to limit use of the walrus operator to clean cases that reduce complexity and improve readability." - we'll see how that works out.

My feeling is that the walrus operator is best avoided, or used very sparingly; what do you think?

Sunday 28 August 2022

Tournament sizes and their effect on selection distribution

If you've ever looked at genetic algorithms, you may have been struck by the counter-intuitive nature of the selection step. In each iteration, you don't take the best results from the previous iteration and combine them for the next iteration; in fact, if you do this, it works very poorly as you get premature convergence to a local minimum as you've gone all exploitation and no exploration. What you need to do is hold yourself back, and select 'goodish' solutions, because goodish can lead to much better in the long run.

One way to do this is with tournament selection. The one parameter is the size of the tournament, T. It works very simply; choose T items at random and keep the best. Remove that item from the pool and repeat this procedure until you have selected the required number of items.

I usually use a value of T=3, but I was curious what exactly this distribution looked like, and the effect of the tournament size. Since I was not able to find an analytical solution (feel free to leave a comment with the details if you do), I wrote a simulation which selects 1000 numbers between 0 and 99 inclusive with different tournament sizes (here I did selection with replacement just to get the distribution of the first selected item). The code also prints out the smallest number where the cumulative sum of the distribution exceeds 50%:

Tournament size of 1: 50% reached at 49
Tournament size of 2: 50% reached at 29
Tournament size of 3: 50% reached at 20
Tournament size of 4: 50% reached at 15
Tournament size of 5: 50% reached at 12
Tournament size of 6: 50% reached at 10
Tournament size of 7: 50% reached at 9
import random
random.seed(1)
import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":
    tour_sizes = [1, 2, 3, 4, 5, 6, 7]
    num_repetitions = 1000
    num_selections = 1000
    for tour_size in tour_sizes:
        averages = [[] for x in range(100)]
        for rep in range(num_repetitions): # repeat to get the mean
            count = [0] * 100
            for i in range(num_selections): # observe the distribution across selections
                # tournament selection of size tour_size
                select = min(random.sample(range(100), tour_size))
                count[select] += 1
            for i in range(100):
                averages[i].append(count[i])
        for i in range(100):
            averages[i] = np.mean(averages[i])
        plt.plot(range(100), averages, label=f'{tour_size}')
        cumsum = 0
        for i in range(100):
            cumsum += averages[i]
            if cumsum >= num_selections / 2:
                print(f"Tournament size of {tour_size}: 50% reached at {i}")
                break
    plt.legend()
    plt.ylabel("Count")
    plt.title("Effect of tournament size on distribution")
    plt.xlim(0, 100)
    plt.ylim(bottom=0)
    plt.show()

Update (2023-06-22): I've guessed the analytical solution as y = 10 . N . f**(N-1) where N is the tournament size and f = 1-(x/100) in the diagram above.

Monday 15 August 2022

Notes from the 12th International Conference on Chemical Structures

Following on from my earlier blog post, I recently attended the 12th ICCS in Noordwijkerhout, the Netherlands. This was a really great conference, this time organised by Gerard van Westen and colleagues. If you're interested in attending next time, unfortunately you'll have to wait another 3 years...

The conference started with the awarding of the CSA Trust Mike Lynch Award to Dr Greg Landrum, for his work on RDKit and the community around it. As a CSA Trustee, I had the pleasure of presenting this along with fellow trustee Wendy Warr.

But it wasn't all presenting awards. There was also the cruise on the IJsselmeer to be done, as ably illustrated by Peter Ertl on the right.

And most importantly, the conference itself. Once again I captured my notes as tweets which are available here. Sosei Heptares had three presentations, slides for which are available here:

  • Morgan Thomas (oral presentation): Augmented Hill-Climb improves language-based de novo molecule generation as benchmarked via the open source MolScore platform
  • Dominique Sydow (oral presentation): Integrated Structural Cheminformatics Analysis Tools for Customisable Chemogenomics Driven Kinase and GPCR Drug Design
  • Noel O’Boyle (poster presentation): Application of DeepSMILES to machine-learning of chemical structures

Monday 25 July 2022

Post-hoc correction of generated representations

The performance of generative models based on SMILES is much improved compared to the early days when only 5-20% of the SMILES generated would be considered valid. In the previous post, an RNN was able to achieve ~96% valid SMILES. Is it possible to increase this still further via post-hoc correction, and if so, is this a good idea?

As a strawman, let's consider a simple approach to ensure that a formerly invalid SMILES becomes valid, which is to trim back the SMILES from the right until only a valid SMILES remains (code at end). Let's look at the results for the failures reported in the previous blog post:

Aromatic system cannot be kekulized
Orig:    Cc1c(NS(=O)(=O)c2cccc3cccnc23)c(=O)n(-c2ccccc2Cl)c1C#N
Fixed!:  C

Unmatched close parenthesis
Orig:    c1cn(-c2ccc3nc(-c4ccc5c(c4)OCO5)nc4ccccc34)[nH]c2=O)cn1
Fixed!:

1 branch has not been closed
Orig:    O=C(NC(C(=O)Nc1cc(Cc2n[nH]c(=O)c3ccccc23)ccc1F)C1CC1
Fixed!:  O=C

2 ring openings have not been closed
Orig:    OCC1CCC2(CC1)CC(Oc1ccccc1)C(C#N)[N+](=O1)[O-]
Fixed!:  OCC

1 ring opening has not been closed
Orig:    CN1C(=O)NC2(CC2(Cl)Br)c1ccccc1
Fixed!:  CN1C(=O)NC2(CC2(Cl)Br)c1ccccc

Aromatic system cannot be kekulized
Orig:    COc1cc(OC(F)F)c(C(=O)NCc2ccn(C)c(=O)c2)c(-c2ccc(N(C)S)cc2)cc1OC
Fixed!:  CO

Aromatic system cannot be kekulized
Orig:    COc1ccc2c(c1)c(Cc1ccc(F)cc1)c(C)[n+]2[O-]
Fixed!:  CO

1 ring opening has not been closed
Orig:    COc1cccc(Oc2ccc3c(c2)C(=O)c2c4c(ccc2CCO)OCO3)c1
Fixed!:  CO

Aromatic system cannot be kekulized
Orig:    O=c1c(Cn2cc(-c3ccccc3)nn2CC2CCCO2)ccn1Cc1ccccc1F
Fixed!:  O

Aromatic system cannot be kekulized
Orig:    Cc1c(Cc2cccc(-c3cc(=O)n(O)c(=O)s3)c2)coc2cc(O)cc(O)c12
Fixed!:  C

Aromatic system cannot be kekulized
Orig:    CC(=O)c1sc2c(c1NC(=O)c1ccc(O)c(OC)c1)c(C)n2C(C)C
Fixed!:  CC(=O)c1sc2c(c1NC(=O)c1ccc(O)c(OC)c1)c(C)n2

1 ring opening has not been closed
Orig:    O=C(Cn1cc([N+](=O)[O-])cn1)N1C2CCC1CC(n1c(=O)c(C(=O)O)nc3cccc1c32)OCOC4
Fixed!:  O=C(Cn1cc([N+](=O)[O-])cn1)N1C2CCC1CC(n1c(=O)c(C(=O)O)nc3cccc1c32)OCOC

Aromatic system cannot be kekulized
Orig:    CC(C)c1cccc(-c2nc3nccc(C(F)(F)F)n3c(=O)[nH]2)c1
Fixed!:  CC

Aromatic system cannot be kekulized
Orig:    CCOc1c(S(=O)(=O)Nc2ccc(F)cc2)c(O)c2ccccc2n1C
Fixed!:  CCOc1c(S(=O)(=O)Nc2ccc(F)cc2)c(O)c2ccccc2n1

Aromatic system cannot be kekulized
Orig:    CC(C)(C)c1ccc(-c2noc(O)n2N)cc1
Fixed!:  CC

1 branch has not been closed
Orig:    Cc1ccc(CN2CCC(C3CC3C(=O)Nc3ccc(S(=O)(=O)N4CCC4)cc32)cc1-n1cnnn1
Fixed!:  C

2 ring openings have not been closed
Orig:    c1ccc(Cn2cnc3c(OCC4CCCN5)nccc32)cc1
Fixed!:

2 ring openings have not been closed
Orig:    CCc1ncc(C(=O)Nc2cc(C3C4CN(Cc5cnn(C)c5C)CCOCC53)c(O)cn2)cn1
Fixed!:  CC

Aromatic system cannot be kekulized
Orig:    C#Cc1ccc2nc3ccc(C(=O)NCCN(C)C)cc3oc2c1
Fixed!:  C#C

Aromatic system cannot be kekulized
Orig:    O=C(c1nc2cc3ccccc3sc2c(=O)[nH]1)N1CCCc2ccccc21
Fixed!:  O=C

Unmatched close parenthesis
Orig:    CC1CN2c3c(nc4c(C(=O)O)nn(-c5cc(F)ccc5F)c4F)n3CC12)OCCO4
Fixed!:  CC

3 ring openings have not been closed
Orig:    COC1CCCc2c1nc1c(c2N)CCc2ccccc2-23
Fixed!:  COC

Uncommon valence or charge state
Orig:    CCOC(=O)C1=C(C)N=C2SC(=C[N+](=O)[O-])NC2=OC1=O
Fixed!:  CCOC(=O)C

Aromatic system cannot be kekulized
Orig:    CN(C)CCn1c(=C2C(=O)Nc3ccc(Br)cc32)c(O)c2ccccc21
Fixed!:  CN(C)CC

Aromatic system cannot be kekulized
Orig:    Oc1ccc(-c2ccc3c(O)c(OC4OCCCO4)c(Cl)c3c2)cc1
Fixed!:  O

Cannot have a bond opening and closing on the same atom
Orig:    Cc1ccc2[nH]c(C(=O)N3CC44C=CC(CC3)c3ccccc34)cc2n1
Fixed!:  C

1 ring opening has not been closed
Orig:    O=C(c1ccccc1)c1cc2c3c(c(OC4OC(CO)C(O)C(O)C4O)c(=O)cc3c(=O)n12)CCCCC3
Fixed!:  O=C(c1ccccc1)c1cc2c3c(c(OC4OC(CO)C(O)C(O)C4O)c(=O)cc3c(=O)n12)CCCCC

Aromatic system cannot be kekulized
Orig:    CCc1cnc(CN(C)CCCn2c(=O)c3cc(F)ccc3n2C)n1
Fixed!:  CC

1 ring opening has not been closed
Orig:    CNC(=O)c1cc2[nH]c(=O)c3c(n2n1)CC1CC3C(C)C(O)CCC12C
Fixed!:  CNC(=O)c1cc2[nH]c(=O)c3c(n2n1)CC1CC3C(C)C(O)CCC1

Aromatic system cannot be kekulized
Orig:    O=[N+]([O-])c1ccc2c(c1)SC(c1nc(Nc3ccccc3[N+](=O)[O-])[nH]c1=O)O2
Fixed!:  O

Aromatic system cannot be kekulized
Orig:    COC1COC(c2c(NC(c3ncccc3Br)c3ccccc3)[nH]n2)C1
Fixed!:  COC

Unmatched close parenthesis
Orig:    O=C(C1CCN(c2cc(N3CCC(n4c(=O)[nH]nc(-c5ccccc5)C4)cc3)OC4)[nH]c2=O)CC1)N1CCCC1
Fixed!:  O=C

Aromatic system cannot be kekulized
Orig:    O=C1COc2c(NC3CCCC3)c(F)cc3c(=O)c(CCc4nc5ccccc5[nH]4)sc(=O)n1c23
Fixed!:  O=C

2 ring openings have not been closed
Orig:    CC(=O)C1=C(O)C=C2Oc3c4c(c(C)c(O)c3C24C(C)CC1=O)OC(C)(C)C(C(=O)N1CCOCC1)C45
Fixed!:  CC(=O)C1=C(O)C=C2Oc3c4c(c(C)c(O)c3C24C(C)CC1=O)OC(C)(C)C(C(=O)N1CCOCC1)C

Unmatched close parenthesis
Orig:    Nc1ncc(-c2ccc(N3CCOC4CCNC4)cc3)cc2)C(=O)[nH]1
Fixed!:  N

1 branch has not been closed
Orig:    CC(O)(C#Cc1ccc2c(c1)C(O)(c1cn(C3CC(O)C(CO)O3)c(=O)[nH]c1=O)CC2(C)C
Fixed!:  CC

1 ring opening has not been closed
Orig:    CNCCCn1nc(COc2ccc3c4c(c(O)c(C)cc3c2=O)C(=O)N2CCCC2)c2c(N)ncnc21
Fixed!:  CNCCC

Aromatic system cannot be kekulized
Orig:    Cc1cc(NC(=O)NC(C)c2ccccc2)n2ccsc2n1
Fixed!:  C

Aromatic system cannot be kekulized
Orig:    COc1ccc2c(c1)CCc1c-2c2c(c3c1[nH]c1ccc(O)c1c3=O)C(=O)CCC2
Fixed!:  CO

Aromatic system cannot be kekulized
Orig:    Cn1ccc(Nc2cccc3ccccc23)c(N)n1
Fixed!:  C

Aromatic system cannot be kekulized
Orig:    CN1C(=O)c2c(n[nH]c2=O)Cc2ccccc21
Fixed!:  CN

2 ring openings have not been closed
Orig:    C=C1CCC2=CC(O)C3Oc4c(O)ccc5c4C34CCN(CC1C2C)C3O5
Fixed!:  C=C

Aromatic system cannot be kekulized
Orig:    O=C(O)c1c2c(c3cc(F)ccc3nc1-c1ccccc1)n(C)c(=O)n2C
Fixed!:  O=C(O)c1c2c(c3cc(F)ccc3nc1-c1ccccc1)n(C)c(=O)n2

As can be seen, the trimming can be rather drastic. But one could imagine more subtle methods. Errors about aromatic systems would disappear if Kekule SMILES were used. If ring openings were found to not be closed, then either close them on the last atom or ignore the opening. Ditto for branches.

If we look at SELFIES, we can see some of these approaches applied. Any characters beyond an invalid valence are ignored ("[C][=O][...]" as "[C][=O]"). If a branch is indicated but the number of atoms is fewer than this or information is missing, then the branch is assumed not to exist ("[C][Branch1][O]" treated as "[C]"). Similarly for rings.

I was curious how often these post-hoc corrections occur for a typical dataset. Given the same RNN trained on SELFIES (v1.0.4), 10K SELFIES were sampled and I calculated the % of times where removing the last symbol left the resultant SMILES unchanged (i.e. where at least the last SELFIES symbol is not being interpreted). This gave 10.5%.

Returning to the SMILES failures above, where the NN cannot learn the syntax, does it makes sense to apply post-hoc corrections to ensure validity? For example, if kekulized SMILES are used, then the errors about aromaticity will go away; if there is no change to the distribution of generated aromatic ring systems then this would be a win. The trimming method above is more problematic: two very similar representations could be interpreted as molecules of very different size (causing a discontinuity in the search/latent space). Indeed, given that all characters after an invalid state are ignored, the DNN could end up being trained on noise. A potentially better approach (although still symptomatic of a failure to adequately learn the syntax) would be to handle an uncommon valence or charge state by resampling from the generative distribution in the first place (the partialsmiles library might be helpful here as it can catch such failures as early as possible).

Code

import partialsmiles as ps

if __name__ == "__main__":
    fname = "Regular_SMILES_1K.smi"

    failures = []
    with open(fname) as inp:
        for line in inp:
            smi = line.rstrip()

            ok = False
            msg = None
            try:
                mol = ps.ParseSmiles(smi, partial=False)
                ok = True
            except ps.SMILESSyntaxError as e:
                msg = e.message
            except ps.ValenceError as e:
                msg = e.message
            except ps.KekulizationFailure as e:
                msg = e.message
            if not ok:
                failures.append( (smi, msg) )

    for smi, msg in failures:
        i = len(smi)
        ok = False
        while not ok and i >= 1:
            try:
                mol = ps.ParseSmiles(smi[:i], partial=False)
                ok = True
            except:
                i -= 1
        # Invariant: Either ok or i == 0

        print(msg)
        print("Orig:   ", smi)
        print("Fixed!: ", smi[:i])
        print()

Credits

This blog post originates in a conversation with Janosch Menke, Michael Blakey and John Mayfield that started at the ICCS. Thanks also to Morgan Thomas for data and feedback. Image via Flickr by epictop10.com (licensed CC-BY 2.0).

Friday 3 June 2022

Diagnosing problems with SMILES

For my poster at the upcoming ICCS, I wanted to categorise any problems with the SMILES strings generated by a recurrent neural network. I did this using the partialsmiles library, a validating SMILES parser I wrote a little while ago.

The speciality of this library is dealing with partial SMILES strings as they are being generated - this potentially allows you to choose an alternative token if the original token causes a problem. However, it can equally well be used with full SMILES strings. Reported errors are broken down into three catagories: valence errors, kekulisation failures and syntax errors. The error message describes the specific problem, and the index of the relevant point in the SMILES string is available. As the docs state, errors associated with the semantics of cis/trans stereo symbols are not currently handled but that's not a problem here.

I mentioned valence errors; this involves a check against a table of allowed valences. I edited the defaults to allow hypervalent nitrogen (i.e. valence 5) as it may be present in the training data. Here's a typical output:
smiles_syntax=20 smiles_valence=1 smiles_kek=22
Total errors: 43   %conversion: 95.7
 22 cases of Aromatic system cannot be kekulized
  4 cases of Unmatched close parenthesis
  3 cases of 1 branch has not been closed
  5 cases of 2 ring openings have not been closed
  6 cases of 1 ring opening has not been closed
  1 cases of 3 ring openings have not been closed
  1 cases of Uncommon valence or charge state
  1 cases of Cannot have a bond opening and closing on the same atom
...and here's the code. Note the use of defaultdict, a hidden gem of the Python library which appears in almost all of my scripts:
from collections import defaultdict

import partialsmiles as ps

if __name__ == "__main__":
    verbose = False
    fname = "Regular_SMILES_1K.smi"

    smiles_syntax = smiles_valence = smiles_kek = 0
    msgs = defaultdict(int)
    N = 0
    with open(fname) as inp:
        for line in inp:
            N += 1
            smi = line.rstrip()
            try:
                mol = ps.ParseSmiles(smi, partial=False)
            except ps.SMILESSyntaxError as e:
                if verbose:
                    print(f"SMILESSyntaxError: {e}")
                smiles_syntax += 1
                msgs[e.message] += 1
            except ps.ValenceError as e:
                if verbose:
                    print(f"ValenceError: {e}")
                smiles_valence += 1
                msgs[e.message] += 1
            except ps.KekulizationFailure as e:
                if verbose:
                    print(f"KekulizationFailure: {e}")
                smiles_kek += 1
                msgs[e.message] += 1

    print(f"{smiles_syntax=} {smiles_valence=} {smiles_kek=}")

    tot_errors = smiles_syntax + smiles_valence + smiles_kek
    print(f"Total errors: {tot_errors}   %conversion: {(N-tot_errors)*100/N:0.1f}")

    for x, y in msgs.items():
        print(f"{y:3d} cases of {x}")

Tuesday 26 April 2022

Threading time through Vortex

Vortex (a chemical spreadsheet/visualisation software from Dotmatics) has a plugin system built around Jython. Simply drop a .vpy file into a specific scripts folder, and a menu item immediately appears in the application. Here are some notes on using this to communicate with a webserver.

Code organisation

I found it best to separate Vortex-specific code (in the .vpy files) from supporting code that could be written and tested independently. This also naturally enables reuse of code across plugins. This supporting code I put in a folder adjacent to the scripts folder, and accessed it as follows:

import os
import sys
sys.path.append(os.path.join(vortex.getVortexFolder(), "MYFOLDERNAME"))

Something to note is that the application needs to be restarted to pick up on edits made to the supporting Python codebase. This is in contrast to edits made to the .vpy files, which can be tested immediately.

Access to a JSON parser

Communicating with a webservice is most easily done with JSON. Unfortunately, the 'json' module is missing in the bundled Jython. To install your own, you can go down the Java route and download an appropriate .jar file (see for example Chris Swain here or here). As an alternative, I prefer to use Python and the same module I used back in the days of ye olde Python, simplejson, a library which is still supported. I downloaded and extracted this into the MYFOLDERNAME folder mentioned earlier, so that it was available as:

import simplejson as json
Call a webservice without blocking

Passing several hundred or more molecules to a webservice that is calculating some property can take several seconds or indeed minutes. The simple approach to this looks something like the following. Note that here we use time.sleep() as a stand-in for the webservice call:

def main():
    time.sleep(5) # pretend to contact a webservice

if __name__ == "__main__":
    main()

Unfortunately, this causes the entire application to become unresponsive until the time.sleep() is complete. This is because the main thread of a GUI application is supposed to spend its time listening out for events like you clicking on something; if it's busy doing some other work, then it can't respond to those events. The solution is to run your code on a separate thread:

class MyRunnable(java.lang.Runnable):
    def run(self):
        main()
        
if __name__ == "__main__":
    t = java.lang.Thread(MyRunnable())
    t.start()

This seems to works perfectly. The only problem I found was that it prevented any useful error messages from appearing in the console (accessed via Help/Console - diagnostic) apart from "Exception in Thread - 1" or somesuch. If this happened, I temporarily changed the code to call main() directly.

Create and close a dialog box

Given that the calculation might take a while, it's a good idea to indicate to the user that the calculation has started; for example to avoid the user starting the calculation multiple times while waiting for the result. One solution is to pop up an info box immediately to let the user know that the calculation has started, and then close it if still present at the end of the calculation. To do this, we need to create the dialog box ourselves:

import javax.swing.JOptionPane as JOptionPane

pane = JOptionPane("This might take a while...", JOptionPane.INFORMATION_MESSAGE)
dialog = pane.createDialog(None, "The calculation has begun")
dialog.setModal(False)
dialog.setVisible(True)
...
dialog.setVisible(False) # close it if still present
Conclusion

If you find this useful or have any additional tips/tricks feel free to leave a comment.

[Update 27/04/2022] On Twitter, John Mayfield adds "From a few years ago now but I remember the jython version of the Python http request was really slow and was much much faster to use Java’s libs (still via jython)". Chris Swain pointed to his Cambridge MedChem Consulting website which has a large number of useful scripts.