Thursday 28 March 2024

Learning the LINGO

LINGO is a simple similarity measure for SMILES strings, based on shared n-grams between the strings themselves. In many respects the relationship between an ECFP4 fingerprint and a molecule is equivalent to the relationship between a LINGO and a SMILES string. While it has obvious drawbacks, its charm lies in its ease of calculation which lends itself to high-speed implementations by some very clever people (Haque/Pande/Walters, Grant...Sayle).

I'm not going to add an entry to that canon, but I recently realised that the Counter class in Python supports union and intersection, which makes it just perfect for an elegant implementation in Python:

import itertools
import collections

from mtokenize import tokenize

def sliding_window(iterable, n):
    """Collect data into overlapping fixed-length chunks or blocks
    This is one of the recipes in the itertools documentation.

    >>> ["".join(x) for x in sliding_window('ABCDEFG', 4)]
    ['ABCD', 'BCDE', 'CDEF', 'DEFG']
    it = iter(iterable)
    window = collections.deque(itertools.islice(it, n-1), maxlen=n)
    for x in it:
        yield tuple(window)

def lingo(seqA, seqB):
    """Implement a 'count' version of LINGO given tokenized SMILES strings"""
    counts = [collections.Counter(sliding_window(seq, 4)) for seq in [seqA, seqB]]
    intersection = counts[0] & counts[1]
    union = counts[0] | counts[1]
    tanimoto = len(intersection)/len(union)
    return tanimoto

if __name__ == "__main__":
    import doctest

    smiA = "c1ccccc1C(=O)Cl"
    smiB = "c1ccccc1C(=O)N"
    print(lingo(tokenize(smiA), tokenize(smiB)))
For 'mtokenize' I used code for tokenizing SMILES from the last blogpost. As an aside, try renaming to and making the corresponding change to the import statement. See how long it takes you to work out why this causes an error :-).