Monday, 15 August 2022

Notes from the 12th International Conference on Chemical Structures

Following on from my earlier blog post, I recently attended the 12th ICCS in Noordwijkerhout, the Netherlands. This was a really great conference, this time organised by Gerard van Westen and colleagues. If you're interested in attending next time, unfortunately you'll have to wait another 3 years...

The conference started with the awarding of the CSA Trust Mike Lynch Award to Dr Greg Landrum, for his work on RDKit and the community around it. As a CSA Trustee, I had the pleasure of presenting this along with fellow trustee Wendy Warr.

But it wasn't all presenting awards. There was also the cruise on the IJsselmeer to be done, as ably illustrated by Peter Ertl on the right.

And most importantly, the conference itself. Once again I captured my notes as tweets which are available here. Sosei Heptares had three presentations, slides for which are available here:

  • Morgan Thomas (oral presentation): Augmented Hill-Climb improves language-based de novo molecule generation as benchmarked via the open source MolScore platform
  • Dominique Sydow (oral presentation): Integrated Structural Cheminformatics Analysis Tools for Customisable Chemogenomics Driven Kinase and GPCR Drug Design
  • Noel O’Boyle (poster presentation): Application of DeepSMILES to machine-learning of chemical structures

Monday, 25 July 2022

Post-hoc correction of generated representations

The performance of generative models based on SMILES is much improved compared to the early days when only 5-20% of the SMILES generated would be considered valid. In the previous post, an RNN was able to achieve ~96% valid SMILES. Is it possible to increase this still further via post-hoc correction, and if so, is this a good idea?

As a strawman, let's consider a simple approach to ensure that a formerly invalid SMILES becomes valid, which is to trim back the SMILES from the right until only a valid SMILES remains (code at end). Let's look at the results for the failures reported in the previous blog post:

Aromatic system cannot be kekulized
Orig:    Cc1c(NS(=O)(=O)c2cccc3cccnc23)c(=O)n(-c2ccccc2Cl)c1C#N
Fixed!:  C

Unmatched close parenthesis
Orig:    c1cn(-c2ccc3nc(-c4ccc5c(c4)OCO5)nc4ccccc34)[nH]c2=O)cn1
Fixed!:

1 branch has not been closed
Orig:    O=C(NC(C(=O)Nc1cc(Cc2n[nH]c(=O)c3ccccc23)ccc1F)C1CC1
Fixed!:  O=C

2 ring openings have not been closed
Orig:    OCC1CCC2(CC1)CC(Oc1ccccc1)C(C#N)[N+](=O1)[O-]
Fixed!:  OCC

1 ring opening has not been closed
Orig:    CN1C(=O)NC2(CC2(Cl)Br)c1ccccc1
Fixed!:  CN1C(=O)NC2(CC2(Cl)Br)c1ccccc

Aromatic system cannot be kekulized
Orig:    COc1cc(OC(F)F)c(C(=O)NCc2ccn(C)c(=O)c2)c(-c2ccc(N(C)S)cc2)cc1OC
Fixed!:  CO

Aromatic system cannot be kekulized
Orig:    COc1ccc2c(c1)c(Cc1ccc(F)cc1)c(C)[n+]2[O-]
Fixed!:  CO

1 ring opening has not been closed
Orig:    COc1cccc(Oc2ccc3c(c2)C(=O)c2c4c(ccc2CCO)OCO3)c1
Fixed!:  CO

Aromatic system cannot be kekulized
Orig:    O=c1c(Cn2cc(-c3ccccc3)nn2CC2CCCO2)ccn1Cc1ccccc1F
Fixed!:  O

Aromatic system cannot be kekulized
Orig:    Cc1c(Cc2cccc(-c3cc(=O)n(O)c(=O)s3)c2)coc2cc(O)cc(O)c12
Fixed!:  C

Aromatic system cannot be kekulized
Orig:    CC(=O)c1sc2c(c1NC(=O)c1ccc(O)c(OC)c1)c(C)n2C(C)C
Fixed!:  CC(=O)c1sc2c(c1NC(=O)c1ccc(O)c(OC)c1)c(C)n2

1 ring opening has not been closed
Orig:    O=C(Cn1cc([N+](=O)[O-])cn1)N1C2CCC1CC(n1c(=O)c(C(=O)O)nc3cccc1c32)OCOC4
Fixed!:  O=C(Cn1cc([N+](=O)[O-])cn1)N1C2CCC1CC(n1c(=O)c(C(=O)O)nc3cccc1c32)OCOC

Aromatic system cannot be kekulized
Orig:    CC(C)c1cccc(-c2nc3nccc(C(F)(F)F)n3c(=O)[nH]2)c1
Fixed!:  CC

Aromatic system cannot be kekulized
Orig:    CCOc1c(S(=O)(=O)Nc2ccc(F)cc2)c(O)c2ccccc2n1C
Fixed!:  CCOc1c(S(=O)(=O)Nc2ccc(F)cc2)c(O)c2ccccc2n1

Aromatic system cannot be kekulized
Orig:    CC(C)(C)c1ccc(-c2noc(O)n2N)cc1
Fixed!:  CC

1 branch has not been closed
Orig:    Cc1ccc(CN2CCC(C3CC3C(=O)Nc3ccc(S(=O)(=O)N4CCC4)cc32)cc1-n1cnnn1
Fixed!:  C

2 ring openings have not been closed
Orig:    c1ccc(Cn2cnc3c(OCC4CCCN5)nccc32)cc1
Fixed!:

2 ring openings have not been closed
Orig:    CCc1ncc(C(=O)Nc2cc(C3C4CN(Cc5cnn(C)c5C)CCOCC53)c(O)cn2)cn1
Fixed!:  CC

Aromatic system cannot be kekulized
Orig:    C#Cc1ccc2nc3ccc(C(=O)NCCN(C)C)cc3oc2c1
Fixed!:  C#C

Aromatic system cannot be kekulized
Orig:    O=C(c1nc2cc3ccccc3sc2c(=O)[nH]1)N1CCCc2ccccc21
Fixed!:  O=C

Unmatched close parenthesis
Orig:    CC1CN2c3c(nc4c(C(=O)O)nn(-c5cc(F)ccc5F)c4F)n3CC12)OCCO4
Fixed!:  CC

3 ring openings have not been closed
Orig:    COC1CCCc2c1nc1c(c2N)CCc2ccccc2-23
Fixed!:  COC

Uncommon valence or charge state
Orig:    CCOC(=O)C1=C(C)N=C2SC(=C[N+](=O)[O-])NC2=OC1=O
Fixed!:  CCOC(=O)C

Aromatic system cannot be kekulized
Orig:    CN(C)CCn1c(=C2C(=O)Nc3ccc(Br)cc32)c(O)c2ccccc21
Fixed!:  CN(C)CC

Aromatic system cannot be kekulized
Orig:    Oc1ccc(-c2ccc3c(O)c(OC4OCCCO4)c(Cl)c3c2)cc1
Fixed!:  O

Cannot have a bond opening and closing on the same atom
Orig:    Cc1ccc2[nH]c(C(=O)N3CC44C=CC(CC3)c3ccccc34)cc2n1
Fixed!:  C

1 ring opening has not been closed
Orig:    O=C(c1ccccc1)c1cc2c3c(c(OC4OC(CO)C(O)C(O)C4O)c(=O)cc3c(=O)n12)CCCCC3
Fixed!:  O=C(c1ccccc1)c1cc2c3c(c(OC4OC(CO)C(O)C(O)C4O)c(=O)cc3c(=O)n12)CCCCC

Aromatic system cannot be kekulized
Orig:    CCc1cnc(CN(C)CCCn2c(=O)c3cc(F)ccc3n2C)n1
Fixed!:  CC

1 ring opening has not been closed
Orig:    CNC(=O)c1cc2[nH]c(=O)c3c(n2n1)CC1CC3C(C)C(O)CCC12C
Fixed!:  CNC(=O)c1cc2[nH]c(=O)c3c(n2n1)CC1CC3C(C)C(O)CCC1

Aromatic system cannot be kekulized
Orig:    O=[N+]([O-])c1ccc2c(c1)SC(c1nc(Nc3ccccc3[N+](=O)[O-])[nH]c1=O)O2
Fixed!:  O

Aromatic system cannot be kekulized
Orig:    COC1COC(c2c(NC(c3ncccc3Br)c3ccccc3)[nH]n2)C1
Fixed!:  COC

Unmatched close parenthesis
Orig:    O=C(C1CCN(c2cc(N3CCC(n4c(=O)[nH]nc(-c5ccccc5)C4)cc3)OC4)[nH]c2=O)CC1)N1CCCC1
Fixed!:  O=C

Aromatic system cannot be kekulized
Orig:    O=C1COc2c(NC3CCCC3)c(F)cc3c(=O)c(CCc4nc5ccccc5[nH]4)sc(=O)n1c23
Fixed!:  O=C

2 ring openings have not been closed
Orig:    CC(=O)C1=C(O)C=C2Oc3c4c(c(C)c(O)c3C24C(C)CC1=O)OC(C)(C)C(C(=O)N1CCOCC1)C45
Fixed!:  CC(=O)C1=C(O)C=C2Oc3c4c(c(C)c(O)c3C24C(C)CC1=O)OC(C)(C)C(C(=O)N1CCOCC1)C

Unmatched close parenthesis
Orig:    Nc1ncc(-c2ccc(N3CCOC4CCNC4)cc3)cc2)C(=O)[nH]1
Fixed!:  N

1 branch has not been closed
Orig:    CC(O)(C#Cc1ccc2c(c1)C(O)(c1cn(C3CC(O)C(CO)O3)c(=O)[nH]c1=O)CC2(C)C
Fixed!:  CC

1 ring opening has not been closed
Orig:    CNCCCn1nc(COc2ccc3c4c(c(O)c(C)cc3c2=O)C(=O)N2CCCC2)c2c(N)ncnc21
Fixed!:  CNCCC

Aromatic system cannot be kekulized
Orig:    Cc1cc(NC(=O)NC(C)c2ccccc2)n2ccsc2n1
Fixed!:  C

Aromatic system cannot be kekulized
Orig:    COc1ccc2c(c1)CCc1c-2c2c(c3c1[nH]c1ccc(O)c1c3=O)C(=O)CCC2
Fixed!:  CO

Aromatic system cannot be kekulized
Orig:    Cn1ccc(Nc2cccc3ccccc23)c(N)n1
Fixed!:  C

Aromatic system cannot be kekulized
Orig:    CN1C(=O)c2c(n[nH]c2=O)Cc2ccccc21
Fixed!:  CN

2 ring openings have not been closed
Orig:    C=C1CCC2=CC(O)C3Oc4c(O)ccc5c4C34CCN(CC1C2C)C3O5
Fixed!:  C=C

Aromatic system cannot be kekulized
Orig:    O=C(O)c1c2c(c3cc(F)ccc3nc1-c1ccccc1)n(C)c(=O)n2C
Fixed!:  O=C(O)c1c2c(c3cc(F)ccc3nc1-c1ccccc1)n(C)c(=O)n2

As can be seen, the trimming can be rather drastic. But one could imagine more subtle methods. Errors about aromatic systems would disappear if Kekule SMILES were used. If ring openings were found to not be closed, then either close them on the last atom or ignore the opening. Ditto for branches.

If we look at SELFIES, we can see some of these approaches applied. Any characters beyond an invalid valence are ignored ("[C][=O][...]" as "[C][=O]"). If a branch is indicated but the number of atoms is fewer than this or information is missing, then the branch is assumed not to exist ("[C][Branch1][O]" treated as "[C]"). Similarly for rings.

I was curious how often these post-hoc corrections occur for a typical dataset. Given the same RNN trained on SELFIES (v1.0.4), 10K SELFIES were sampled and I calculated the % of times where removing the last symbol left the resultant SMILES unchanged (i.e. where at least the last SELFIES symbol is not being interpreted). This gave 10.5%.

Returning to the SMILES failures above, where the NN cannot learn the syntax, does it makes sense to apply post-hoc corrections to ensure validity? For example, if kekulized SMILES are used, then the errors about aromaticity will go away; if there is no change to the distribution of generated aromatic ring systems then this would be a win. The trimming method above is more problematic: two very similar representations could be interpreted as molecules of very different size (causing a discontinuity in the search/latent space). Indeed, given that all characters after an invalid state are ignored, the DNN could end up being trained on noise. A potentially better approach (although still symptomatic of a failure to adequately learn the syntax) would be to handle an uncommon valence or charge state by resampling from the generative distribution in the first place (the partialsmiles library might be helpful here as it can catch such failures as early as possible).

Code

import partialsmiles as ps

if __name__ == "__main__":
    fname = "Regular_SMILES_1K.smi"

    failures = []
    with open(fname) as inp:
        for line in inp:
            smi = line.rstrip()

            ok = False
            msg = None
            try:
                mol = ps.ParseSmiles(smi, partial=False)
                ok = True
            except ps.SMILESSyntaxError as e:
                msg = e.message
            except ps.ValenceError as e:
                msg = e.message
            except ps.KekulizationFailure as e:
                msg = e.message
            if not ok:
                failures.append( (smi, msg) )

    for smi, msg in failures:
        i = len(smi)
        ok = False
        while not ok and i >= 1:
            try:
                mol = ps.ParseSmiles(smi[:i], partial=False)
                ok = True
            except:
                i -= 1
        # Invariant: Either ok or i == 0

        print(msg)
        print("Orig:   ", smi)
        print("Fixed!: ", smi[:i])
        print()

Credits

This blog post originates in a conversation with Janosch Menke, Michael Blakey and John Mayfield that started at the ICCS. Thanks also to Morgan Thomas for data and feedback. Image via Flickr by epictop10.com (licensed CC-BY 2.0).

Friday, 3 June 2022

Diagnosing problems with SMILES

For my poster at the upcoming ICCS, I wanted to categorise any problems with the SMILES strings generated by a recurrent neural network. I did this using the partialsmiles library, a validating SMILES parser I wrote a little while ago.

The speciality of this library is dealing with partial SMILES strings as they are being generated - this potentially allows you to choose an alternative token if the original token causes a problem. However, it can equally well be used with full SMILES strings. Reported errors are broken down into three catagories: valence errors, kekulisation failures and syntax errors. The error message describes the specific problem, and the index of the relevant point in the SMILES string is available. As the docs state, errors associated with the semantics of cis/trans stereo symbols are not currently handled but that's not a problem here.

I mentioned valence errors; this involves a check against a table of allowed valences. I edited the defaults to allow hypervalent nitrogen (i.e. valence 5) as it may be present in the training data. Here's a typical output:
smiles_syntax=20 smiles_valence=1 smiles_kek=22
Total errors: 43   %conversion: 95.7
 22 cases of Aromatic system cannot be kekulized
  4 cases of Unmatched close parenthesis
  3 cases of 1 branch has not been closed
  5 cases of 2 ring openings have not been closed
  6 cases of 1 ring opening has not been closed
  1 cases of 3 ring openings have not been closed
  1 cases of Uncommon valence or charge state
  1 cases of Cannot have a bond opening and closing on the same atom
...and here's the code. Note the use of defaultdict, a hidden gem of the Python library which appears in almost all of my scripts:
from collections import defaultdict

import partialsmiles as ps

if __name__ == "__main__":
    verbose = False
    fname = "Regular_SMILES_1K.smi"

    smiles_syntax = smiles_valence = smiles_kek = 0
    msgs = defaultdict(int)
    N = 0
    with open(fname) as inp:
        for line in inp:
            N += 1
            smi = line.rstrip()
            try:
                mol = ps.ParseSmiles(smi, partial=False)
            except ps.SMILESSyntaxError as e:
                if verbose:
                    print(f"SMILESSyntaxError: {e}")
                smiles_syntax += 1
                msgs[e.message] += 1
            except ps.ValenceError as e:
                if verbose:
                    print(f"ValenceError: {e}")
                smiles_valence += 1
                msgs[e.message] += 1
            except ps.KekulizationFailure as e:
                if verbose:
                    print(f"KekulizationFailure: {e}")
                smiles_kek += 1
                msgs[e.message] += 1

    print(f"{smiles_syntax=} {smiles_valence=} {smiles_kek=}")

    tot_errors = smiles_syntax + smiles_valence + smiles_kek
    print(f"Total errors: {tot_errors}   %conversion: {(N-tot_errors)*100/N:0.1f}")

    for x, y in msgs.items():
        print(f"{y:3d} cases of {x}")

Tuesday, 26 April 2022

Threading time through Vortex

Vortex (a chemical spreadsheet/visualisation software from Dotmatics) has a plugin system built around Jython. Simply drop a .vpy file into a specific scripts folder, and a menu item immediately appears in the application. Here are some notes on using this to communicate with a webserver.

Code organisation

I found it best to separate Vortex-specific code (in the .vpy files) from supporting code that could be written and tested independently. This also naturally enables reuse of code across plugins. This supporting code I put in a folder adjacent to the scripts folder, and accessed it as follows:

import os
import sys
sys.path.append(os.path.join(vortex.getVortexFolder(), "MYFOLDERNAME"))

Something to note is that the application needs to be restarted to pick up on edits made to the supporting Python codebase. This is in contrast to edits made to the .vpy files, which can be tested immediately.

Access to a JSON parser

Communicating with a webservice is most easily done with JSON. Unfortunately, the 'json' module is missing in the bundled Jython. To install your own, you can go down the Java route and download an appropriate .jar file (see for example Chris Swain here or here). As an alternative, I prefer to use Python and the same module I used back in the days of ye olde Python, simplejson, a library which is still supported. I downloaded and extracted this into the MYFOLDERNAME folder mentioned earlier, so that it was available as:

import simplejson as json
Call a webservice without blocking

Passing several hundred or more molecules to a webservice that is calculating some property can take several seconds or indeed minutes. The simple approach to this looks something like the following. Note that here we use time.sleep() as a stand-in for the webservice call:

def main():
    time.sleep(5) # pretend to contact a webservice

if __name__ == "__main__":
    main()

Unfortunately, this causes the entire application to become unresponsive until the time.sleep() is complete. This is because the main thread of a GUI application is supposed to spend its time listening out for events like you clicking on something; if it's busy doing some other work, then it can't respond to those events. The solution is to run your code on a separate thread:

class MyRunnable(java.lang.Runnable):
    def run(self):
        main()
        
if __name__ == "__main__":
    t = java.lang.Thread(MyRunnable())
    t.start()

This seems to works perfectly. The only problem I found was that it prevented any useful error messages from appearing in the console (accessed via Help/Console - diagnostic) apart from "Exception in Thread - 1" or somesuch. If this happened, I temporarily changed the code to call main() directly.

Create and close a dialog box

Given that the calculation might take a while, it's a good idea to indicate to the user that the calculation has started; for example to avoid the user starting the calculation multiple times while waiting for the result. One solution is to pop up an info box immediately to let the user know that the calculation has started, and then close it if still present at the end of the calculation. To do this, we need to create the dialog box ourselves:

import javax.swing.JOptionPane as JOptionPane

pane = JOptionPane("This might take a while...", JOptionPane.INFORMATION_MESSAGE)
dialog = pane.createDialog(None, "The calculation has begun")
dialog.setModal(False)
dialog.setVisible(True)
...
dialog.setVisible(False) # close it if still present
Conclusion

If you find this useful or have any additional tips/tricks feel free to leave a comment.

[Update 27/04/2022] On Twitter, John Mayfield adds "From a few years ago now but I remember the jython version of the Python http request was really slow and was much much faster to use Java’s libs (still via jython)". Chris Swain pointed to his Cambridge MedChem Consulting website which has a large number of useful scripts.

Sunday, 30 May 2021

Combining protein structure with deep generative models for ligands

Journal of Cheminformatics has just published the first result from a collaboration between ourselves at Sosei Heptares and the Andreas Bender group. Morgan Thomas, the PhD student who did all the work, has presented early versions of this at various AI/Chemistry meetings but it's finally out there:

Morgan Thomas, Robert T. Smith, Noel M. O'Boyle, Chris de Graaf, Andreas Bender. Comparison of structure- and ligand-based scoring functions for deep generative models: a GPCR case study J. Cheminform. 2021, 13, 39.

Deep generative models have shown the ability to devise both valid and novel chemistry, which could significantly accelerate the identification of bioactive compounds. Many current models, however, use molecular descriptors or ligand-based predictive methods to guide molecule generation towards a desirable property space. This restricts their application to relatively data-rich targets, neglecting those where little data is available to sufficiently train a predictor. Moreover, ligand-based approaches often bias molecule generation towards previously established chemical space, thereby limiting their ability to identify truly novel chemotypes.

In this work, we assess the ability of using molecular docking via Glide—a structure-based approach—as a scoring function to guide the deep generative model REINVENT and compare model performance and behaviour to a ligand-based scoring function. Additionally, we modify the previously published MOSES benchmarking dataset to remove any induced bias towards non-protonatable groups. We also propose a new metric to measure dataset diversity, which is less confounded by the distribution of heavy atom count than the commonly used internal diversity metric. 
With respect to the main findings, we found that when optimizing the docking score against DRD2, the model improves predicted ligand affinity beyond that of known DRD2 active molecules. In addition, generated molecules occupy complementary chemical and physicochemical space compared to the ligand-based approach, and novel physicochemical space compared to known DRD2 active molecules. Furthermore, the structure-based approach learns to generate molecules that satisfy crucial residue interactions, which is information only available when taking protein structure into account.

Overall, this work demonstrates the advantage of using molecular docking to guide de novo molecule generation over ligand-based predictors with respect to predicted affinity, novelty, and the ability to identify key interactions between ligand and protein target. Practically, this approach has applications in early hit generation campaigns to enrich a virtual library towards a particular target, and also in novelty-focused projects, where de novo molecule generation either has no prior ligand knowledge available or should not be biased by it.

For further background, a Q&A with Morgan appears over on Andreas's blog.

Monday, 18 January 2021

Data/cheminf/compchem openings at Sosei Heptares

A year and a half into my new life in pharma, and I'm really enjoying it. And now we're looking for new members of the team at Sosei Heptares in Cambridge (UK), with the advertised posts covering everything from data management, cheminformatics through to computational chemistry, both junior and senior.

I've pasted in the basic details of the posts below, but there are more details if you follow the links. Feel free to reach out to me if you have questions (noel.oboyle@soseiheptares.com), or contact Chris de Graaf who heads the Computational Chemistry team.

Computational Chemist – 3 positions at Sosei Heptares (Cambridge, UK) (Research Scientist, Senior Scientist, Principal Scientist)

We are growing our Computer-Aided Drug Design and Cheminformatics/AI capabilities by extending the Sosei Heptares Computational Chemistry team with three additional positions:

Link to advertised positions

These cover all experience levels from recent PhD to a well experienced senior computational chemist in drug discovery. The positions are flexible, so different combinations of skills and/or experience are acceptable for the right candidate, so please forward to those who you feel passionate about joining the Sosei Heptares CompChem team where scientific excellence and passion combine in a friendly fun environment to impact drug discovery projects and create new cutting-edge approaches.

Discovery Data Manager at Sosei Heptares (Cambridge, UK)

In addition, Sosei Heptares is looking to recruit an experienced Discovery Data Manager to support our Research team:

Link to advertised position

This position is an exciting opportunity to work at the interface between Computational Chemistry, Medicinal Chemistry, Molecular Pharmacology, Translational Sciences, and Platform groups to streamline the GPCR structure-based drug discovery process in an industry-leading biotech company.

Closing dates for all applications is 14th March.

Saturday, 24 October 2020

The SMILES reading benchmark - two years on

In August 2017, after attending an InChI meeting at the NIH in Bethesda, I had the idea of putting together a SMILES reading benchmark. I already had the bones of one to test my rewrite of Open Babel's reading of aromatic SMILES, but after attending a workshop led by Greg Landrum on Open File Formats for Chemical Information I decided to tidy it up and broaden the scope.

My goals were to identify issues affecting interoperability, to resolve those issues by working with developers, and to provide a resource to help future implementations avoid problems. This last goal has recently been realised through Rich Apodaca's work on a Rust-based SMILES parser where he gives an extensive write-up on the role of the SMILES benchmark. The benchmark has also been of use to the IUPAC SMILES+ project, which grew out of Greg's workshop at the NIH and is led by Vin Scalfani.

Results and progress were described in a poster at the ICCS in June 2018, and subsequently (with updates) at the ACS in Aug of that year in "A de facto standard or a free-for-all? A benchmark for reading SMILES". I've thought about writing up a paper but I was never really keen - the point wasn't to write a paper, or point out software that had problems, but to improve SMILES. 


Back in the heady days of 2017-18, my approach with the benchmark was to work with, or at least nudge, various software vendors/developers towards improved interoperability. A tricky task when I worked for a software vendor myself, was a developer of a cheminformatics toolkit, and was sometimes neither a customer nor a user. Despite this, the benchmark was reasonably successful...but not completely, and two years down the line I find myself in a different environment relying on different tools, and wondering if some more nudging in the right direction might help.

In this spirit, let's take a look at an example from the ChemDraw results in the benchmark (to be found here), illustrate the problem and work out the solution by hand.

Figure 1 (left) shows entry 26359 in the benchmark. The CDK generates the following aromatic SMILES for this structure: c1(=O)c2c(c3=c1n(nco3)C)cccc2. However, when this SMILES is pasted into ChemDraw, the depiction in Figure 1 (middle) is obtained, which resolves to the structure on the right on hitting Alt+K. No error or warning appeared that might indicate problems when reading the SMILES.

Figure 1

Now let's do this by hand. Figure 2 shows the structure as described by the SMILES string. A key point to note/remember is that a SMILES string exactly describes the hydrogen count on every atom - we 'just' need to work out the bond orders of the aromatic bonds making sure that every atom that needs a double bond gets exactly one.

Figure 2

For the actual details of the algorithm, check out the source code of Open Babel or my partialsmiles project (also the CDK, but that's a different algorithm than described here). But you can think of it like solving Minesweeper - to begin with we tackle the bits we are sure about, before we have to start guessing. The two bonds to the carbonyl carbon must be single bonds; ditto for the bonds to NMe, and to the O in the ring (see here for some details). The remaining bonds to be kekulized are shown in black in Figure 3 (left):

Figure 3

We'll call this point A. Each of the remaining black atoms needs to have a double bond. But which to start with? If we put the first double bond in the wrong place we might end up having to start over. Again, you should start with those you are certain about - and that's those black atoms that have a single black bond. This must be a double bond. Once you've placed those, set the other neighbouring bonds to single, and updated the list of atoms that need a double bond, your structure will look like Figure 3 (middle). 

At this point, there are no black atoms with just a single black bond, so it's time to guess: just choose one and place a double bond. Now update the list of atoms that need a double bond, and go back to point A. Keep repeating until all the bonds are kekulized...or there are no bonds left to choose.

For more than 95% of the cases in the benchmark this will result in a kekulized structure. For the remaining cases, you instead end up with a pair of black atoms that don't have a double bond. To fix this, do a DFS to find an alternating path ('augmenting path') that joins them, and then flip the bond orders along the path. For example, consider the situation below, where I started by placing the double bond along the bond joining the 6-membered rings. To fix, just flip the bond orders from C-C=C-C to C=C-C=C.

Figure 4

The described procedure will successfully kekulize any structure that can be kekulized. Feel free to reach out if you have any questions.