Friday 30 December 2016

The clockwisdom of SMARTS

Image credit: Jonathan Cohen (CC-BY-NC)

In earlier posts I discussed/investigated how stereochemistry is represented in SMILES. Here I'm going to try to figure out what I thought would be relatively simple, how to write a SMARTS pattern that matches a chiral molecule and all of its superstructures. For example, consider wanting to search for glucopyranose-containing molecules in a database that also contained other hexopyranose epimers. As background, see the Mischievous SMARTS post by John Mayfield.

As a simple example, let's take the molecule represented by the SMILES F[C@@H](Br)Cl. I want to write a SMARTS pattern that matches this, as well as all superstructures. In this context, given that the halides typically are single valent, such superstructures are replacements of the hydrogen by arbitrary R groups.

Now, you may be aware that every SMILES string is also a valid SMARTS pattern. Unfortunately, it is also true that this is rarely the SMARTS pattern that you want. In this particular case, the original SMILES string, when interpreted as a SMARTS query, requires that the C has exactly 1 hydrogen attached. In other words, it won't match any superstructures (except for the elusive 5-valent carbon).

So let's leave out the H to give F[C@@](Br)Cl. This cannot be read as SMILES (at least not without warnings) since a chiral carbon requires four neighbours, but it is a valid SMARTS pattern. The question is what does it match?

The answer is more subtle than I, at least, expected. It will only match molecules that correspond to the following pseudosmiles, F[C@@](X)(Br)Cl, where X is anything including an implicit H.

Equally relevant is what it won't match. If X is F, thereby losing the chirality, then you are out of luck, but I would consider that a perfectly reasonable superstructure. And following on from this, it also won't match any other cases where the stereo is not defined at the carbon.

So in the end, I have come to the view that the best SMARTS pattern to use is F[C@@?](Br)Cl, which also matches the case where stereo is undefined. Better to cast the net wide and if someone really doesn't want to match those cases it is easy to do a search-and-replace.

Friday 23 September 2016

Open Babel 2.4.0 released

As announced by Geoff on the mailing list, Open Babel 2.4.0 is now available to download:
I'm pleased to announce that Open Babel 2.4.0 has finally been released.

This release represents a major update and should be a stable upgrade, strongly recommended for all users.

We intend to move to an annual major release every September, with bug fix releases as needed.

A sample of major new features:
  • Integration of the confab conformer generator
  • Improved partial charge models, including EEM methods, EQeq
  • ECFP radial fingerprints
  • Initial support for ring rotamer / conformer sampling
  • Improved GAFF atom typing and parameterization
  • New PHP scripting bindings
  • Many new file formats, features and bug fixes

For a full list of changes and to download:

Thanks to a cast of many for this release, including:
Alexandr Fonari, Anders Steen Christensen, Andreas Kempe, arkose, Benoit Leblanc, Björn Grüning, Casper Steinmann, Chris Morley, Christoph Willing, Craig James, Dagmar Lenk, David Hall, David Koes, David Lonie, David van der Spoel, Dmitriy Fomichev, Fulvio Ciriaco, Fredrik Wallner, Geoff Hutchison, Heiko Becker, Itay Zandbank, Jean-Noel Avila, Jeff Janes, Joaquin Peralta, Joshua Swamidass, Julien Nabet, Karol Langner, Karthik Rajagopalan, Katsuhiko Nishimra, Kevin Horan, Kirill Okhotnikov, Lee-Ping, Matt Harvey, Maciej Wójcikowski, Marcus Hanwell, Mathias Laurin, Matt Swain, Mohamad Mohebifar, Mohammad Ghahremanpour, Noel O'Boyle, Patrick Avery, Patrick Fuller, Paul van Maaren, Peng Bai, Philipp Thiel, Reinis Danne, Roger Sayle, Ronald Cohen, Scott McKechnie, Stefano Forli, Steve Roughley, Steffen Moeller, Tim Vandermeersch, Tomas Racek, Tomáš Trnka, Tor Colvin, Torsten Sachse, Yi-Shu Tu, Zhixiong Zhao

On a personal note, in particular I'd like to thank Stefano, Steve and Steffen for contributing; not to mention Matt, Matt, Mathias and Maciej and David, David, David and David. Without them, ordering the authors by first name would not have so richly paid off.

As Chris Morley has taken a step back from the project, I did the Windows release for the first time. One change is that now we have a 64-bit version along with the 32-bit. I've also moved to supporting "pip install openbabel" as the primary means of installing the Python bindings - there's already been some work on this for Mac/Linux by Matt Swain and others.

If you have any comments/criticisms or need help, the best place to go is to our mailing list ( or file bugs on Github (click the green "New Issue").

Sunday 21 August 2016

My new thing - providing manuscript images as PDFs

My latest oeuvre (on the topic of which fingerprint is best) was published by J. Cheminf. a few weeks ago. For the first time, instead of providing the images as PNGs, I submitted them as PDFs.

You see, John had worked me over. At the start, I thought of a PDF as the bad boy of the journal publishing scene, the hamburger and not the cow. What appears at first as text arranged into sentences, is just a haphazard arrangement of glyphs which through some trickery of the eye coalesces into scientific discourse. To generate a PDF myself would be to add to this madness.

But the thing is, when you strip a PDF down to its essentials, it's a relative of a PostScript file (details omitted due to ignorance), a vector graphics format. A more popular vector graphics format is an SVG file, but this is not supported by most publishers and so I spend a lot of time calculating DPI and inches per column and then generating a PNG. But they do often support PDFs, and these can readily be generated (with a bit of care) from many different programs. And all other things being equal, the best quality images will be generated by providing a vector graphics format as the publisher can resize it without any loss of quality.

Below I provide details about how I generated the PDFs, but let's look at their handling by Journal of Cheminformatics. This journal provides three views of the paper, a HTML page, an ePUB (which I won't discuss further) and a PDF. The HTML version contains embedded PNGs, they are a little small for my taste (maybe my fault - I don't know) but they are readable. So somehow they were able to convert the PDFs to images of whatever size they wanted for the HTML page. The PDF is a bit more interesting, as the images are now included as vector graphics. That is, if you keep zooming in on an image in the PDF, the lines remain sharp (in contrast to the PNGs in the HTML version).

So, in short, there seems little downside to providing PDFs, and much to gain. I'd be interested in hearing the viewpoint of anyone involved with the publishing side of things.

1. When using matplotlib to generate graphs, just give the file a .PDF extension, e.g. plt.savefig("overallperformance_%s.pdf" % benchmark, dpi=300)
2. When using Inkscape, save as PDF.
3. The hardest part was the chemical structures. I tried a variety of recipes with two different commercial programs. In the end, although ChemDraw's SVG export had the heteroatoms all over the place, the EMF export was openable by Inkscape and then I could save as PDF. (Apparently you can go direct to PDF from ChemDraw on a Mac.)

Tuesday 2 August 2016

Open Source Cheminformatics Toolkits - they keep on coming

It has been just over three years since I last surveyed the world of open source cheminformatics toolkits. So what's new?

  • Kekule.js - This is a JavaScript toolkit by Chen Jiang with an associated publication in JCIM this year. JavaScript cheminformatics is still in its infancy, and it's good a see a new player in this area. It's currently at version 0.7. Interestingly, it appears to include in _extras, a JavaScript version of Open Babel created using Emscripten.
  • OpenChemLib - This Java toolkit was developed by Thomas Sander at Actelion, and is the engine behind Data Warrior. It has since been converted to JavaScript (using GWT) by Luc Patiny and used, for example, in the impressive Wikipedia Chemical Structure Explorer.
  • Lilly MedChem Rules - Strictly speaking, this is not obviously a toolkit, but a commandline Ruby application. However, there is a C++ cheminf toolkit sitting behind that application, which was developed by Ian Watson at Eli Lilly.

Let me know if I've missed anything. For a more comprehensive overview of Open Source Molecular Modelling see the very recent paper by Pirhadi et al, which has an associated Github repo for keeping the information up to date.

Monday 18 July 2016

Your SD file will never be the same again...with ASCII depiction

A recent blog post showed how to depict a record in an SD file from within Vim. This of course is no help to those readers who have yet to successfully exit from a Vim session. But what if there was no need to create an ASCII depiction...because it was already there?

Yes, that's right, you have no clue what I'm talking about. What I'm saying is, why not bung in an ASCII depiction of the molecule in a property field? Well, apart from it being a bonkers idea that'll bloat the SD file to hitherto unimagined sizes (but think of the improved compression!), I can't think of any reason not to do this. It is my belief that this could finally unleash the untapped potential of ASCII depiction. And so I've added an option to the SD file writer in Open Babel to do exactly this.

John is fond of quoting Jurassic Park's "your scientists were so preoccupied with whether or not they could that they didn't stop to think if they should." I don't know why I just mentioned that.

Thursday 14 July 2016

Even more molecular depiction in Vim

Previous posts focused on popping up an ASCII depiction in Vim, but what doing something with the PNG output from Open Babel? Is there any way this can be viewed from Vim? Aged readers may remember a previous blog post of mine that looked into PNG to ASCII conversion, and that would be one approach.

A more direct approach is to use Vim's built-in capabilities to view bitmap files. Well, to be exact, a specific type of old-skool bitmap called an XPM. This has a very simple format and so Vim can show the entire contents of the file and use syntax-highlighting to visualise it. The alleged use for this is to enable people to directly edit bitmaps - that's right, someone out there is using Vim as a bitmap editor.

The only problem is generating the XPM file in the first place. We could probably do this directly from Open Babel as the format is fairly simple (e.g. as an option to the PNG writer) but in a certain light that just might ("just might", mind you) be viewed as feature creep. So instead, we can use ImageMagick's convert to do the job (which is available cross-platform). As before the required script is shown below.

Once I got it working I got to thinking, "well, that's a 79x79 bitmap it's showing, which is pretty small but what if I reduce the font size? aha! - then I can show an arbitrary sized bitmap and show much better detail". At which point I realised that in any environment where I can change the font size in Vim, I should probably just pop up an image viewer to display the PNG (left as an exercise for the reader).

In the end, are these depictions better than the ASCII ones? Meh, probably not - which I think was one of the conclusions also from my previous foray into PNG to ASCII conversion. Oh well.

noremap <silent> <leader>d :call SmiToPng(77)<CR>
function! SmiToPng(width)
  let smiles = expand("<cWORD>")
  " Strip quotation marks and commas
  let smiles = substitute(smiles, "[\"',]", "", "g")
  " Handle escaped backslashes, e.g. in C++ strings
  let smiles = substitute(smiles, "\\\\", "\\", "g")

  botright new
  setlocal buftype=nofile bufhidden=wipe nobuflisted noswapfile nowrap
  let fname = tempname().'.png'
  call system('obabel -:'.smiles. ' -O '.fname.' -d -xm -xp '. a:width)
  execute '$read ! "C:\Program Files (x86)\ImageMagick-6.5.8-Q16\convert.exe" '.fname.' xpm:-'
  setlocal filetype=xpm
  execute "normal! ggd/pixels\<cr>dd"
  silent! g/\v^"(\S)\1+",?/d
  execute "normal! Gdd"
  setlocal nomodifiable

The 7th Joint Sheffield Conference on Chemoinformatics - a real tweet

Just back from the Sheffield meeting, which takes place every 3rd year. Great meeting as ever - tribute was paid to John Holliday for the lion's share of the organisation. I got to meet some old friends and some new. For the first time at a meeting I decided to live-tweet the talks, joining such Twitter luminaries as Wendy Warr, Mireille Krier, Nathan Brown, and Jérémy Besnard.

It worked out quite well, and kept me completely engaged and awake. When you are aware that what you write is instantly publicly visible, you really make an effort to follow method descriptions etc so that you can adequately describe what's going on. To speed things up I decided to avoid editorialising; if the author described their method/result as the best thing since sliced bread, I dutifully reported a major advance in the field of baked goods even if I was thinking "bread is dead, baby, bread is dead". I have since learned that this is referred to as journalism.

With about 750 tweets covering 27 talks (I missed one due to flat batteries), I averaged about 27 tweets per talk, which may be just over one per slide. Afterwards I asked on Twitter whether people were annoyed or found my avalanche of tweets useful; based on 13 respondents, the results were 3 to 1 in favour of the tweets. If I do a repeat performance, next time I'll give a heads-up so people can mute me if uninterested.

I don't like my efforts disappearing into the void, so I've archived the complete list of #ShefChem16 tweets from all attendees and remotes that used that hashtag. You can relive the build-up, the talks themselves, the scones/doughnuts, the conference dinner, not to mention the queuing for taxis to the station. The talks and posters are being made available by-and-by on the conference website so you might find it interesting to look at the tweets in combination with the slides.

Notes on creating the download of tweets:
I tried to do this the hi-tech route via the Twitter API, but I think it's impossible if there were more than 100 tweets in a day. The API is geared towards streaming not historical analysis. In the end, I went to the Twitter website, searched for #ShefChem16, hit "All tweets", zoomed out and kept hitting Page Down until all the conference tweets were shown. Next I saved the generated HTML via Firebug (right click on the <body> element and choose "Copy HTML"), and extracted the tweets with the following script. Unfortunately, although it's possible to know to whom a reply has been made, the corresponding tweet id does not seem to be available so I didn't bother handling replies in a special way.

# vim: set fileencoding=utf-8 :
from bs4 import BeautifulSoup as bs

soup = bs(open("shefchem16.html"), "lxml")


data = []
name = None
for tag in soup.find_all("div"):
    if not tag.get("class"):
    if "stream-item-header" in tag.get("class"):
        name = tag.a['href'][1:]
    if "js-tweet-text-container" in tag.get("class"):
        tweet = tag.get_text().encode("utf-8").replace(" …", "")
        data.append("%10s %s" % (name[:HANDLESIZE], tweet.strip().replace("\n", "\n"+" "*(HANDLESIZE+1))))

with open("tmp.txt", "w") as f:
    for d in reversed(data):

Image credit: Egon Willighagen on Twitter

Tuesday 12 July 2016

Basic Graphviz input file generator in Python

Generating a Graphviz Dot input file is fairly simple, so I tend to code it up myself rather than use an existing library. However the need to normalise the node names complicates things a little bit. Here's a copy of the one I wrote today. Note that you may need to beef up the normalisation if your node names contain additional non-alphanumeric characters. (Note to future self: next time just autoincrement the node labels instead of trying to use a normalised form.)
class Graph:
    def __init__(self):
        self.lookup = {}
        self.edges = []
    def add_edge(self, x):
        if x in self.lookup: return
        nx = x
        for y in " -()[]":
            nx = nx.replace(y, "_") # normalise the label
        self.lookup[x] = nx
    def add(self, x, y):
        self.edges.append( (x, y) )
    def write(self):
        tmp = ["digraph {"]
        for x, y in self.edges:
            tmp.append( '%s -> %s;' % (self.lookup[x], self.lookup[y]))
        for x, y in self.lookup.iteritems():
            tmp.append('%s [label="%s"]' % (y, x))
        return "\n".join(tmp)

Friday 8 July 2016

Syntax highlighting for SMILES files in Vim

An image in a recent blogpost showed how the default syntax highlighting (*) in Vim looks when applied to SMILES files. A splash of pink here and there to brighten things up on a dull molecule. It's definitely doing something, but what it's doing I've no idea. So I've decided up with this I will not put any more.

Here's a simple syntax highlighting style for SMILES files. It supports three different colours for comments, the SMILES itself, and titles. If anyone has the necessary regexp-fu, it might be worth looking into highlighting different bracket levels in the SMILES string.

To use, save this file into vimfiles/syntax/smi.vim, and then use "set filetype=smi" to turn on the highlighting. To automatically do this for *.smi files, create a file vimfiles/ftdetect/smi.vim containing the line "autocmd BufRead,BufNewFile *.smi set filetype=smi".

I also looked into doing folding for such files, that is, to use any comment as a header and then fold the subsequent SMILES. This is a format I use for storing matched series, for example. It works (see foldexpr), but it's too slow for a large file, which defeats the purpose.
" Vim syntax file
" Language: SMILES strings
" Maintainer: Noel O'Boyle
" Latest Revision: 16 June 2016

if exists("b:current_syntax")

syn match smiTitle /\v\s+.*$/hs=s+1
syn match smiString "\v^\S*" nextgroup=smiTitle skipwhite
syn match smiComment "\v^#.*$"

let b:current_syntax = "smi"

hi def link smiComment     Comment
hi def link smiString      Identifier
hi def link smiTitle       Include

* As the Vim documentation says, apparently it's actually lexical highlighting, but everyone calls it syntax highlighting so...

Thursday 16 June 2016

More molecular depiction in Vim

The reception for my previous blog post on ASCII depiction in Vim was somewhat muted even, dare I say, lukewarm. Buoyed by this response, I updated that script to handle quoted SMILES strings (as found in my Python scripts, for example). And here I provide similar depiction support for SDF files ("\a" and "\A" for small and big depictions).

Disclaimer: Due to my use of complex regular expressions, this script is not suitable for use by Andrew Dalke, nor for any SDF file provided by same. If this script is used on a SDF file provided by Andrew Dalke, it may result in the summoning of Cthulhu. You have been warned.

function! SDFToAscii(width)
  let prev = search("$$$$", "nbW") + 1
  let end = search("$$$$", "nW")
  if end == 0
    let end = line('$')
  silent execute prev.",".end."yank"
  let output = system("obabel -isdf -oascii -d -xa 2.0 -xw ".a:width, @")
  botright new
  setlocal buftype=nofile bufhidden=wipe nobuflisted noswapfile nowrap
  put =output
  setlocal nomodifiable
noremap <silent> <leader>a :call SDFToAscii(79)<CR>
noremap <silent> <leader>A :call SDFToAscii(189)<CR>

Monday 13 June 2016

SMILES depiction in Vim

I spend a lot of time scrolling through files of SMILES in Vim, and copying and pasting SMILES into viewers. This is now a thing of the past. I bring you (from the depths of time) molecule depiction within Vim.

Using the script below (paste it into your .vimrc), I just hit "\s" with the cursor on a SMILES string, and up pops its ASCII depiction (":q" to close). Need a bigger image? "\S" is your friend. This is my first Vim script ("and I hope it's your last" sez you) so if you have any improvements, let me know.

[Update 16/06/2014]: Script updated to handle quoted SMILES strings and those with escaped backslashes.
noremap <silent> <leader>s :call SmiToAscii(79)<CR>
noremap <silent> <leader>S :call SmiToAscii(189)<CR>
function! SmiToAscii(width)
  let smiles = expand("<cWORD>")
  " Strip quotation marks and commas
  let smiles = substitute(smiles, "[\"',]", "", "g")
  " Handle escaped backslashes, e.g. in C++ strings
  let smiles = substitute(smiles, "\\\\", "\\", "g")
  botright new
  setlocal buftype=nofile bufhidden=wipe nobuflisted noswapfile nowrap
  execute '$read ! obabel -:'.shellescape(smiles, 1). ' -o ascii -d -xa 2.0 -xw '. a:width
  setlocal nomodifiable

Saturday 21 May 2016

Open Babel presentation at MIOSS 2016

Thanks again to George Papadatos, Mark Forster, et al for the invitation to present at the 2016 reboot of MIOSS (Molecular Informatics in Open Source Software). You may remember the earlier one back in 2011.

I gave a presentation on the Open Babel project. While I have been largely quiescent over the past 2 years, there has been much happening besides. Rather than cover features, I tried to present some interesting stats or views of the overall project. For example, the background of some of the current developers, which file formats were recently added, and so forth. A few of my slides discuss industry involvement in the project - this is because the MIOSS workshop is 50% industry and 50% open source developers.

For info on other presentations from the meeting, see the Lanyrd page (I'll update here when the rest of the talks become available) or search #MIOSS2016 on Twitter.

Tuesday 19 April 2016

Static static code analyser

I've just been refactoring/rewriting some C++ code and noticed that I've left some static functions around that are unused. "static" in this context means functions that are only accessible within the same file (it's a good idea to put static in front of every function until the point you need to call it from some other file). However, if nothing in the same file calls them, then they contain dead code. Unfortunately neither cppcheck nor MSVC's static code analyser will flag up static functions that are not called, and so I've written my own Python script to do so. It's fairly basic but worked for me.

Update (19/04): Doh! Adding --enable=all to cppcheck will find most (but not all) of these cases.

import os
import sys
import glob

def FindStaticFunctions(fname):
    fns = []
    for line in open(fname):
        tmp = line.strip()
        if not tmp.startswith("static"): continue
        if tmp.endswith(";"): continue
        broken = tmp.split()
        for i, x in enumerate(broken):
            if "(" in x:
                name = x[:x.find("(")].replace("*", "")
            elif i < len(broken)-1 and broken[i+1][0]=='(':
                fns.append(x.replace("*", ""))

    return fns

def FindMissingReferences(fname, staticfns):
    lines = open(fname).readlines()
    for staticfn in staticfns:
        refs = 0
        for line in lines:
            if line.find(staticfn) >= 0:
                refs += 1
        if refs <= 1:
            print "(%s) %s: %d refs" % (os.path.basename(fname), staticfn, refs)

if __name__ == "__main__":
    if len(sys.argv) < 2:
        sys.exit("You need to specify a directory")
    dirname = sys.argv[1]
    if not os.path.isdir(dirname):
        sys.exit("%s is not a directory name" % dirname)

    for cppfile in glob.glob(os.path.join(dirname, "*.cpp")):
        staticfns = FindStaticFunctions(cppfile)
        FindMissingReferences(cppfile, staticfns)

Wednesday 13 January 2016

Do BEDROC results correlate with AUC?

Following up on a post over at NM where I was perhaps overly harsh on AUC, it is natural to instead consider early recognition metrics such as EF, BEDROC and so forth as potentially better measures of performance in virtual screens. Here I want to pour cold water on the idea that results for AUC and BEDROC are highly correlated and so we might as well just use the more straightfoward of the two (i.e. AUC).

In a 2010 review of current trends by Geppert, Vogt and Bajorath [1], they interpret the "What do we know and when do we know it?" paper by Nicholls [2] as presenting "evidence for a strong correlation between AUC and BEDROC [3], suggesting AUC as a sufficient measure for virtual screening performance." Now, Nicholls doesn't quite say that but I can see that Figure 8 and 9 in that paper could be interpreted in that way. For the Warren study [4] data shown, the Pearson R2 between the two is 0.901. It is unfortunate that the correlation is only shown for the averaged data (Figure 9) - it would be nice to see the Spearman correlation for Figure 8. But either way, it's clear that they are highly correlated.

So a large AUC tends to imply a large BEDROC (and v.v), and so forth. But the thing is, that's not quite what we are interested in. The real question is whether the relative ordering of methods obtained with AUC are the same as those obtained with BEDROC - if it is, then we might as well use AUC. If not, we'll have to exercise those little grey cells (I've been reading a lot of Agatha Christie recently) and figure out which is best. In other words, it could be that methods A and B both have a high AUC, and so have a high BEDROC, but does A>B in AUC space imply A>B in BEDROC space?

As it happens I have the results to hand (as one does) for AUC (*) and BEDROC(20) for the Riniker and Landrum dataset [5] (88 proteins, 28 fingerprints, 50 repetitions). This is a fingerprint-based virtual screen as opposed to a docking study, but the conclusions should be the same. The 4400 Spearman correlations between the AUC and BEDROC results are shown here as a histogram (bins of 0.05):

The median value of the Spearman correlation is 0.713. This is not that high (remember that the square is 0.51) indicating that while there is a moderate correlate between the orderings obtained by BEDROC and AUC, they are not highly correlated.

One more nail in the AUC's coffin I hope. Grab your hammer and join me!

[1] Current Trends in Ligand-Based Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation. Hanna Geppert, Martin Vogt, and Jürgen Bajorath. J. Chem. Inf. Model. 2010, 50, 205–216.
[2] What do we know and when do we know it? Anthony Nicholls. J. Comput. Aided Mol. Des. 2008, 22, 239–255.
[3] Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem. Jean-François Truchon and Christopher I. Bayly. J. Chem. Inf. Model. 2007, 47, 488-508.
[4] A critical assessment of docking programs and scoring functions. Warren GL, Andrews CW, Capelli AM, Clarke B, LaLonde J, Lambert MH, Lindvall M, Nevins N, Semus SF, Senger S, Tedesco G, Wall ID, Woolven JM, Peishoff CE, Head MS. J. Med. Chem. 2006, 49, 5912-5931.
[5] Open-source platform to benchmark fingerprints for ligand-based virtual screening. Sereina Riniker and Gregory A. Landrum. J. Cheminf. 2013, 5, 26.

* I'm lazy so I just used the mean rank of the actives instead of integrating ROC curves. You get an identical ordering (see Eq. 12 in BEDROC paper [3] for the relationship). And yes, I did check.