Monday 30 December 2019

No charge - A simple approach to neutralising charged molecules

There are several reasons you might want to convert a molecule with charges to its neutral form. For example, as a step in standardising a representation for registration, to generate a parent form during registration, to deduplicate vendor-supplied databases, or to simplify SMARTS pattern searches for a particular ionizable group. Here I describe a simple procedure to generate the neutral form of a molecule, and briefly compare it to existing approaches. While a more nuanced approach may be more appropriate to prepare molecules for registration, the described method appears to be suitable for the other applications listed above.

Algorithm
To a first approximation the procedure is simply to identify all atoms with either a +1 or -1 charge, set their charges to zero and adjust their hydrogen counts by -1 or +1 (i.e. we are adding/removing H+). The first minor wrinkle is that +1 charged atoms must have a hydrogen, or otherwise we can't remove H+. The second wrinkle is that we must avoid altering charge-separated representations of multiple bonds, such as nitro which is often represented as [N+](=O)[O-]. I do this by checking whether a neighbor atom has a negative charge (for the +1 atoms) or a positive charge (for the -1 atoms).

Test set
For the purposes of timing implementations of the algorithm and comparing results to another algorithm, I created a test set from a ChEMBL 23 SMILES file I had to hand:
> obabel chembl_23.smi -r -O singlecomp.smi
> obabel singlecomp.smi -aa -s "[!+0]" -O charged.smi
> C:\cygwin\bin\gawk '{print length($0), $1, $2 }' < charged.smi | C:\cygwin\bin\sort -n > tmp.txt
> C:\cygwin\bin\cut -f 2,3 -d " " tmp.txt > charged.sorted.smi

Preamble
First, my usual boilerplate. The code below does not actually use Pybel, but I find it convenient to import both Pybel and openbabel as follows:
from openbabel import pybel
ob = pybel.ob
Implementations
I will describe 4 implementations of the algorithm above which show my progress in optimising the code for speed. Each implementation gives identical results on the test set, but runs faster. Note that hydrogens are assumed to be implicit (this will be the case if SMILES written by Open Babel are used as input, as in the test set above); if not, then they must either be converted to implicit, or the algorithm adjusted to take this into account.

Version 1
This would be the fastest version, if this were written in C++. When applied to the 125,536 molecules in the test set, this version spends 11.3s in the neutralize_v1 function.
def NoNegativelyChargedNbr(atom):
    for nbr in ob.OBAtomAtomIter(atom):
        if nbr.GetFormalCharge() < 0:
            return False
    return True

def NoPositivelyChargedNbr(atom):
    for nbr in ob.OBAtomAtomIter(atom):
        if nbr.GetFormalCharge() > 0:
            return False
    return True

def neutralize_v1(mol):
    """Neutralize charges of +1 or -1"""
    for atom in ob.OBMolAtomIter(mol):
        chg = atom.GetFormalCharge()
        if chg == 1:
            hcount = atom.GetImplicitHCount()
            if hcount >= 1 and NoNegativelyChargedNbr(atom):
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount - 1)
        elif chg == -1: # -ve charge
            hcount = atom.GetImplicitHCount()
            if NoPositivelyChargedNbr(atom): # avoid nitro, etc.
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount + 1)

Version 2
Iterating over all of the atoms of all of the molecules in Python takes some time due to the memory allocation for each atom on the Python side. We can avoid this by using a SMARTS pattern to find just those atoms that have a +1/-1 charge; now the Python code will only have to iterate over the matched atoms, usually zero or a small number.

With this change, only 5.3s is spent in the neutralize function.
pat_v2 = ob.OBSmartsPattern()
pat_v2.Init("[+1,-1]")

def neutralize_v2(mol):
    """Neutralize charges of +1 or -1"""
    pat_v2.Match(mol)
    matches = pat_v2.GetMapList()
    for match in matches:
        atom = mol.GetAtom(match[0])
        chg = atom.GetFormalCharge()
        if chg == 1: # +1 charge
            hcount = atom.GetImplicitHCount()
            if hcount >= 1 and NoNegativelyChargedNbr(atom):
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount - 1)
        else: # -1 charge
            hcount = atom.GetImplicitHCount()
            if NoPositivelyChargedNbr(atom):
                atom.SetFormalCharge(0)
                atom.SetImplicitHCount(hcount + 1)

Version 3
Now that we're using a SMARTS pattern, we might as well jam more of the logic into it. The first is the hydrogen check for +1 charged atoms; the second is the test to ensure that there isn't a neighboring atom with a negative charge (for +1 atoms) or positive charge (for -1 atoms). Unfortunately SMARTS does not provide a direct test for an atom bearing positive or negative charge, so I just end up listing a bunch of charges.

With these changes, the helper functions for checking charges on neighboring atoms are no longer needed, and in addition the 'chg' branches in the code can be unified. As a result, Version 3 is considerably shorter than earlier versions and faster as well, taking 3.5s.
pat_v3 = ob.OBSmartsPattern()
pat_v3.Init("[+1!h0!$([*]~[-1,-2,-3,-4]),-1!$([*]~[+1,+2,+3,+4])]")

def neutralize_v3(mol):
    """Neutralize charges of +1 or -1"""
    pat_v3.Match(mol)
    for match in pat_v3.GetMapList():
        atom = mol.GetAtom(match[0])
        chg = atom.GetFormalCharge()
        hcount = atom.GetImplicitHCount()
        atom.SetFormalCharge(0)
        atom.SetImplicitHCount(hcount - chg)

Version 4
I was going to leave it there, but on checking the OBSmartsPattern API I noticed a method that enables re-use of the result vector (rather than returning a new one each time). This is useful, because it means that it's possible to have a single memory allocation for the vector up-front, and by re-using it, avoid memory allocations for all subsequent molecules (strictly speaking, there may be a handful of reallocations to make it bigger if necessary, but you're talking ~10 memory allocations versus 125K). I also realised that it's probably a good idea to check the Boolean returned by the Match method.

With these changes, Version 4 spends 1.5s in neutralise, a considerable improvement compared to the original.
matchvector_v4 = ob.vectorvInt()

def neutralize_v4(mol):
    """Neutralize charges of +1 or -1"""
    match_found = pat_v3.Match(mol, matchvector_v4)
    if match_found:
        for match in matchvector_v4:
            atom = mol.GetAtom(match[0])
            chg = atom.GetFormalCharge()
            hcount = atom.GetImplicitHCount()
            atom.SetFormalCharge(0)
            atom.SetImplicitHCount(hcount - chg)

Comparison to other algorithms
I wrote this code as a baseline implementation of neutralisation with which to think about the problem, identify corner cases, and consider existing implementations. However, it seems to be good enough as is to be ready for use. Let's look at the alternatives...

Both Francis Atkinson's Standardiser and Matt Swain's MolVS appear to be aimed at preparing a standard form for registering, and think about charge balancing and so forth. This is not the goal here, and this method is not suitable for that purpose. The form of neutralisation described above could be considered a "per-atom" neutralisation, rather than "whole-molecule"; I want all carboxylates present to be turned into COOH regardless of what else is going on in the molecule; in fact, by the time I run the neutralisation I have already chopped off all but the single largest component, and so it's likely to be unbalanced. I should look into those methods further though, and try to understand the rationale...or maybe just ask the authors at the next Cambridge Cheminformatics Network Meeting. :-) I did compare my results to MolVS and apart from charge balancing differences (e.g. for "C[S+](C)CCCC(=O)[O-]"), I note that MolVS handles [N+]([O-])[O-] better (e.g. "c1cc(cc(c1)F)NC(=O)C(=[N+]([O-])[O-])C#N" derived from CHEMBL1651804); I leave it alone because it appears to be a charge-separated form, but MolVS protonates one of the oxygens.

In terms of its intended goal, the code from Hans de Winter included in the RDKit Cookbook is the most similar. One difference is that that code sticks to a specific list of ionizable groups (and thus misses at least some, e.g. hydroxamic acids) whereas the code above is completely general. Comparing the results, I note that the HdW code incorrectly affects some charge-separated forms (e.g. "CC[S+](C)[O-]" to "CC[S+](O)C", "[O-][O+]=O" (ozone) to "O[O+]=O") but some tweaking of the SMARTS patterns will fix this, and there are some oddities (e.g. "[O-][Cl]=O" converted to "O[Cl+]O").

In short, for my purposes, I find that the simple code above does what I need. If I need something more sophisticated down the line, I'll look more closely at what problems the Standardiser and MolVS codes are attempting to solve.

Using this algorithm
If you are interested in applying this algorithm to your own data, you can find the code in a more convenient form at http://github.com/baoilleach/nocharge. If applying to millions of molecules, you may wish to combine this with my multiprocessing receipe to speed things up N times (where N is the number of cores you have).

[Update 02/01/2020]....or...you can wait until it arrives at an Open Babel near you. I've just added a "--neutralize" operation to my development version. When called from Python (see neutralize_v5 at the Github link above) this takes ~0.5s. Note that the C++ code is the same as in the Python v1.

[Update 06/01/2020] Changed H0 to h0 in the SMARTS pattern (doh!) to avoid an error when handling, for example, deuterated N+.