Friday, 3 June 2022

Diagnosing problems with SMILES

For my poster at the upcoming ICCS, I wanted to categorise any problems with the SMILES strings generated by a recurrent neural network. I did this using the partialsmiles library, a validating SMILES parser I wrote a little while ago.

The speciality of this library is dealing with partial SMILES strings as they are being generated - this potentially allows you to choose an alternative token if the original token causes a problem. However, it can equally well be used with full SMILES strings. Reported errors are broken down into three catagories: valence errors, kekulisation failures and syntax errors. The error message describes the specific problem, and the index of the relevant point in the SMILES string is available. As the docs state, errors associated with the semantics of cis/trans stereo symbols are not currently handled but that's not a problem here.

I mentioned valence errors; this involves a check against a table of allowed valences. I edited the defaults to allow hypervalent nitrogen (i.e. valence 5) as it may be present in the training data. Here's a typical output:
smiles_syntax=20 smiles_valence=1 smiles_kek=22
Total errors: 43   %conversion: 95.7
 22 cases of Aromatic system cannot be kekulized
  4 cases of Unmatched close parenthesis
  3 cases of 1 branch has not been closed
  5 cases of 2 ring openings have not been closed
  6 cases of 1 ring opening has not been closed
  1 cases of 3 ring openings have not been closed
  1 cases of Uncommon valence or charge state
  1 cases of Cannot have a bond opening and closing on the same atom
...and here's the code. Note the use of defaultdict, a hidden gem of the Python library which appears in almost all of my scripts:
from collections import defaultdict

import partialsmiles as ps

if __name__ == "__main__":
    verbose = False
    fname = "Regular_SMILES_1K.smi"

    smiles_syntax = smiles_valence = smiles_kek = 0
    msgs = defaultdict(int)
    N = 0
    with open(fname) as inp:
        for line in inp:
            N += 1
            smi = line.rstrip()
            try:
                mol = ps.ParseSmiles(smi, partial=False)
            except ps.SMILESSyntaxError as e:
                if verbose:
                    print(f"SMILESSyntaxError: {e}")
                smiles_syntax += 1
                msgs[e.message] += 1
            except ps.ValenceError as e:
                if verbose:
                    print(f"ValenceError: {e}")
                smiles_valence += 1
                msgs[e.message] += 1
            except ps.KekulizationFailure as e:
                if verbose:
                    print(f"KekulizationFailure: {e}")
                smiles_kek += 1
                msgs[e.message] += 1

    print(f"{smiles_syntax=} {smiles_valence=} {smiles_kek=}")

    tot_errors = smiles_syntax + smiles_valence + smiles_kek
    print(f"Total errors: {tot_errors}   %conversion: {(N-tot_errors)*100/N:0.1f}")

    for x, y in msgs.items():
        print(f"{y:3d} cases of {x}")

No comments: