Noel O'Blog: 2009

Friday, 18 December 2009

Chemical Identifier Resolver + TwirlyMol = Easily add molecules to a webpage II

Noticed anything different about the TwirlyMols over at Carbon-based Curiosities, Chemical Quantum Images, or Node in the Noosphere? (Here's a hint - perhaps they look more awesome?)

Yes, indeed, TwirlyMol now uses proper spherical shading for the atoms. In addition, if you are using any of the 5 or so browsers that are not Internet Explorer (shame on you, IE), you will see that distant atoms are now wreathed in a fog (really it's a representation of the uniform electron gas). Through the magic of the internets, any TwirlyMol using the Chemical Identifier Resolver will suddenly look a whole lot better.

We've also addressed some minor bugs: shadows in wrong place, loading issues with multiple molecules, slow loading, issues with Opera.

Markus has also been doing some work on caching. To take advantage of this new work (and shave several seconds off load times), use the following new HTML:

<div id="DIVNAME" height="200" width="200"></div>

<script src="http://cactus.nci.nih.gov/chemical/structure/
             CHEMICAL_IDENTIFIER/twirl_cached/DIVNAME"
        type="text/javascript"></script>

Here's an example of it in action, where the CHEMICAL_IDENTIFIER is restasis and the DIVNAME is foggymol:

Wednesday, 16 December 2009

Cheminformatics Tutorial using Python and Silverlight II

Part I

I've updated the Interactive Cheminformatics Tutorial by adding sections on descriptors and molecular fingerprints, and removing the original Python tutorial (following Michael Foord's suggestion). In summary, the tutorial could now form the basis of a course on cheminformatics.

Currently the tutorial focuses on the possibilities of the Cinfony API but a possible improvement would be to include more didactic material, for example describing SMILES strings, their syntax and so forth.

One of the nice things about the Chemical Identifier Resolver is that it allows you to reference molecules by name which really brings a bit more life to the examples. Using this I've tried to think up interesting examples (e.g. "What's the Tanimoto similarity of aspirin to Dr. Scholl's Wart Remover?"). Additional suggestions are very welcome.

Friday, 20 November 2009

Chemical Identifier Resolver + TwirlyMol = Easily add molecules to a webpage

Markus Sitzmann of the NCI/CADD team has been busy. He has combined the Chemical Identifier Resolver with TwirlyMol to enable you to convert any chemical identifier to a 3D model that can you interact with in your webpage. I'm very excited about this as I think that people will find this very useful.

Just put this in your webpage or blog post (note however that Blogger preview does not show the Twirlymol):

<div id="DIVNAME" height="200" width="200"></div>

<script src="http://cactus.nci.nih.gov/chemical/structure/
             CHEMICAL_IDENTIFIER/twirl?div_id=DIVNAME"
        type="text/javascript"></script>

Replace DIVNAME with a unique name, and replace CHEMICAL_IDENTIFIER with any of the chemical identifiers accepted by the Chemical Structure Resolver; for example, a common name for a chemical, an InChI, or a SMILES string. More details over at the /Chemical/Structure blog. For now, let's just see it in action.

Replacing DIVNAME with 'buckyball' and CHEMICAL_IDENTIFIER with 'buckminsterfullerene' gives the following (go on, give it a twirl! - right mouse button to zoom in):

That was too easy - let's take one of Henzy Rzepa's crazy Mobius aromatic molecules. Steven Bachrach has written a review of some and very thoughtfully has included the InChIs. Replacing DIVNAME with 'crazymolecule' and CHEMICAL_IDENTIFIER with 'InChI=1/C14H14/c1-2-6-10-14-12-8-4-3-7-11-13(14)9-5-1/h1-14H/b2-1-,4-3-,9-5-,10-6-,11-7-,12-8-/t13-,14+' we have:

3D Nanoputians anyone? Here's the SMILES for a NanoKid: "c1(C2OCCO2)c(C#CC(C)(C)C)cc(C#Cc2cc(C#CCCC)cc(C#CCCC)c2)c(C#CC(C)(C)C)c1". Happy twirling.

Wednesday, 18 November 2009

My beaker overfloweth - New chemistry Q&A sites

Stackoverflow is one of the best Question & Answer websites for computer programming. It uses a carefully designed social model to build a community where people compete to give the best answer to questions in order to be rewarded with a better response to their own questions.

Recently, the people behind Stackoverflow have opened up the software to allow people to set up their own websites...but just for a beta period (money will then be required). Several chemistry 'stackoverflows' have already been set up. Here are a few I've heard about:

BlueObelisk: Questions about cheminformatics and computational chemistry leaning towards the open source or open data side of things. Update (07/10/2010): This website has moved to Shapado.
Chempedia Lab: Questions about experimental chemistry.
Chemistry: General chemistry (?)

These sites are all new so you won't find many questions there already. But give them a go. Go there and ask a question or two (even if you already know the answer), answer a question or two, and check back in a day to see what happens. You can log in with your Gmail address (among others) but do note that questions are not anonymous.

Such websites require a community. Some will gain such a community and flourish, others won't and will fail. In the meanwhile, go get some answers.

Image credit: Question Everything (Nullius in verba) Take nobody's word for it by Duncan Hull (CC BY 2.0)

Monday, 16 November 2009

Cheminformatics Tutorial using Python and Silverlight

Recently I introduced Webel, a Python cheminformatics module that runs entirely on web services. One of the advantages of such a module is that it can be used in places where it is difficult to install a traditional cheminformatics toolkit. Like in your browser.

It turns out that Silverlight ("Microsoft's Flash") provides a Python interpreter that runs in your browser. Using this, Michael Foord (of IronPython in Action) has developed an interactive Python interpreter which you can use at trypython.org. It consists of two windows, one with a Python tutorial and the other with a Python prompt so that you can work through the tutorial.

After some little work, I present Try Python...with Cheminformatics. This adds Webel as well as a short tutorial that introduces many of its features. With a few more tutorials that cover SMILES, InChI and so on in more detail, this could be useful for teaching purposes as well as bridging the gap to having students develop their own Python scripts that use the CDK, OpenBabel or the RDKit.

Here is the obligatory screenshot (click for a larger version):

Tuesday, 10 November 2009

In memory of Warren DeLano

Many of you will have heard the sad news about Warren. He passed away suddenly on November 3rd. His contribution to science through the development of PyMol is known to all of us.

I only met him once, but I was struck both by his enthusiasm for new ideas and his belief in open source software. He believed that such software was an enabler of science and its development should be supported.

His family are collecting memories and photographs of Warren, and have set up a fund in his memory to support achievements in Open Source scientific software:

Through PyMol and Open Source software, Warren DeLano exhibited the genius and generosity of science at its best. But he still had so much to give. In memory of his passing, we are creating a foundation to ensure that achievements in the field of Open Source scientific software are encouraged, recognized and rewarded. Warren was committed to the development of Open Source programs and how they would benefit humanity by allowing science to flourish in a collaborative environment. Your contribution will help us keep Warren's commitment alive.

Ní bheidh a leithéid arís ann.

Thursday, 5 November 2009

Introducing Webel - A cheminformatics toolkit built solely on webservices

I'd like to introduce a new Cinfony module, Webel. Like the other components of Cinfony, Webel implements a standard API (see for example, the Pybel API) that covers a large proportion of common cheminformatics operations including reading/writing SMILES strings and InChIs, calculation of molecular weight and formula, molecular fingerprints, SMARTS searching, and descriptor calculation.

However, unlike the other components, Webel runs entirely off web services. All cheminformatics analysis is carried out using Rajarshi's REST services (which use the CDK and are hosted at Uppsala) and the NIH's Chemical Identifier Resolver (by Markus Sitzmann, and which uses Cactvs for much of its backend).

To use Webel, all you need to do is download webel.py, and type "import webel" at a Python prompt (see example code below - it's basically the same as using Pybel if you're familiar with that).

So what are the advantages of running off webservices? First, as should be clear, there is the ease of installation. This means that Webel could easily be bundled in with some other software to provide some useful functionality. Second, Webel can still be used in environments where installation of a cheminformatics toolkit is simply not possible (more on this next week!). Third, webservices may provide additional functionality not available elsewhere (e.g. the Chemical Resolver provides name-to-structure conversion as well as InChIKey resolution). Fourth, webservices are accessed across HTTP rather than through some type of language binding. As a result, Webel works equally well from CPython, Jython or IronPython. And finally, it's just a cool idea. :-)

If you can think of any other advantages or potential applications, I'd be interested to hear them. In the meanwhile, here's some code that calculates the molecular weight of aspirin, its LogP, its InChI, gives alternate names for aspirin, and creates the PNG above:


import webel

mol = webel.readstring("name", "aspirin")
print "The molecular weight is %.1f" % mol.molwt
print "The InChI is %s" % mol.write("inchi")
print "LogP values are: %s" % mol.calcdesc(["ALOGPDescriptor"])
print "Aspirin is also known as: %s" % mol.write("names")
mol.draw(filename="aspirin.png", show=False)

...which gives...

C:\Tools\cinfony\trunk\cinfony>python example.py
The molecular weight is 180.2
The InChI is InChI=1/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)
/f/h11H AuxInfo=1/1/N:5,3,4,1,2,12,6,7,11,9,8,10,13/E:(11,12)/F:5,3,4,1,2,12,6,7
,11,9,10,8,13/rA:21CCCCCCCOOOCCOHHHHHHHH/rB:;a1;a2a3;;a1;a2a6;;;;s6d8s10;s5d9;s7
s12;s10;s1;s2;s3;s4;s5;s5;s5;/rC:6.3301,-.56,0;4.5981,-1.56,0;6.3301,-1.56,0;5.4
641,-2.06,0;2,-.06,0;5.4641,-.06,0;4.5981,-.56,0;4.5981,1.44,0;2.866,-1.56,0;6.3
301,1.44,0;5.4641,.94,0;2.866,-.56,0;3.7321,-.06,0;6.3301,2.06,0;6.8671,-.25,0;4
.0611,-1.87,0;6.8671,-1.87,0;5.4641,-2.68,0;2.31,.4769,0;1.4631,.25,0;1.69,-.596
9,0;
LogP values are: {'ALOGPDescriptor_ALogp2': 0.10304100000000004, 'ALOGPDescripto
r_AMR': 18.935400000000001}
Aspirin is also known as: ['2-Acetoxybenzoic acid', '50-78-2', '2-Acetoxybenzene
carboxylic acid', 'Acetylsalicylate', 'Acetylsalicylic acid', 'Aspirin', ...
'Claradin', 'Clariprin', 'Colfarit', 'Decaten', 'Dolean pH 8', ...
'Acetylsalicylsaure [German]', 'Acide acetylsalicylique [French]', ...
'A6810_SIGMA', 'Spectrum5_000740', 'CHEBI:15365',...]

Wednesday, 4 November 2009

In I go with Indigo, the new open source cheminformatics toolkit

SciTouch LLC have just announced the release of a dual licensed (GPL or commercial) cheminformatics toolkit, Indigo. See Depth-First and Rajarshi for some initial reactions.

It's a C++ toolkit, and right now what seems to be available are several .NET wrappers that enable specific uses as well as an Oracle cartridge. Access from Python, etc. is on the to-do list, and hopefully this will also give access to the core Molecule object so that all aspects of the toolkit will be available.

Charlie Zhu has already written an example application using C#. Rather than wait for CPython bindings, I installed IronPython and used it to access Indigo's .NET libraries (Dingo, in this case) to do a SMILES to png conversion:

C:\Tools\Indigo\dingonet-1.0-3669>"C:\Program Files\IronPython 2.6\ipy.exe"
IronPython 2.6 (2.6.10920.0) on .NET 2.0.50727.3603
Type "help", "copyright", "credits" or "license" for more information.
>>> import clr
>>> clr.AddReference("dingonet")
>>> import indigo
>>> dir(indigo)
['Dingo', 'DingoException']
>>> dingo = indigo.Dingo()
>>> dir(dingo)
['Dispose', 'Equals', ......, 'getResult', 'isEmpty', 'loadMolecule', 'loadMolec
uleFromFile', 'loadReaction', 'loadReactionFromFile', 'render', 'renderToBitmap'
, 'renderToMetafile', 'setAAMColor', 'setBackgroundColor', 'setBondLength', 'set
Coloring', 'setHighlightBold', 'setHighlightColor', 'setImageSize', 'setImplicit
HydrogenMode', 'setLabelMode', 'setLoadHighlighting', 'setLogPath', 'setMarginFa
ctor', 'setOutputFile', 'setOutputFormat', 'setOutputHDC', 'setOutputPrintingHDC
', 'setRelativeThickness', 'setStereoOldStyle']
>>> dingo.loadMolecule("CC(=O)Cl")
>>> dingo.setOutputFile("test.png")
>>> dingo.setOutputFormat("png")
>>> dingo.render()
>>> ^Z

Sunday, 25 October 2009

How to correct 3D coordinates at stereocenters

Given a set of 3D coordinates for a molecule, and whether the stereochemistry at particular atoms is correct or not, how would you fix any errors?

This is a problem that I've been working on for the 3D builder in OpenBabel. Given a connection table (e.g. a SMILES string), OpenBabel builds up the structure of a molecule using some basic geometric rules as well as ring templates (SMARTS strings for rings, and associated coordinates). Afterwards, the stereochemistry is corrected where necessary.

Well, for any tetrahedral center with at least two non-ring bonds, those two bonds can be swapped to correct stereochemistry. For the special case of a spiro atom (an atom with four ring bonds which, if broken, split the molecule into three fragments), one of the rings involved can be rotated 180 degrees to correct the stereochemistry.

How about for a stereocenter with three ring bonds? This is typically found where two rings join along an edge, or in bridged ring systems. Well, that's a bit tricky as you can't swap bonds around. But what you can do is invert the coordinates of the entire ring system. Of course, the ring system may contain more than one stereocenter (actually, I think such a ring system is guaranteed to contain at least one other stereocenter) in which case it will not always be possible to satisfy the stereochemistry at all centers simultaneously.

This is as far as I've currently gotten.

The next step is to include some stereochemistry information in the ring templates themselves. That is, to include different versions of the ring templates for the various stereochemistry arrangements. This should increase the coverage of ring systems that OpenBabel can successfully handle.

Of course, there is a limit to how far one can get with ring templates, but it'll be interesting to find out where that limit is.

Image credit: nickzeff

Avogadro is 1.0 today

The 1.0 release of Avogadro has just come out as announced by Geoff, reported by Depth-First, blogged by Marcus (check out the video), and interviewed and microblogged by SourceForge.

To quote avogadro.sf.net:

Avogadro is an advanced molecular editor designed for cross-platform use in computational chemistry, molecular modeling, bioinformatics, materials science, and related areas. It offers flexible rendering and a powerful plugin architecture.

Why am I interested in this? Well, firstly it's useful for comp chem, an area in which I still dabble a bit. Secondly, it's going to become more useful for cheminformatics with time (will need to add handling for multi-mol sdf files first). And thirdly, many new features of OpenBabel have been added to address requirements for Avogadro such as 3D conformer generation from SMILES and forcefields, both of which I now use regularly.

Well done, and best of luck to all involved. And what better release date? 6:02 on the 23rd of the 10th.

Wednesday, 21 October 2009

Really really final deadline extended to 23rd Oct for CINF symposia

It seems that all of the CINF symposia have had their final deadlines extended to this Friday, 23rd October. So it's your last chance (again) to send in an abstract to the Visual Analysis of Chemical Data symposium, or any of the other symposia listed on the CINF website. For anything that doesn't fit a specific symposium, there's General Papers (I've one in here myself). The COMP division also has several symposia of interest to cheminformaticians (I'd link to the list of symposia but their website doesn't list them).

Tuesday, 13 October 2009

One week left to submit - Symposium on Visual Analysis of Chemical Data (ACS Spring 2010)

Final Call for Papers:
Visual Analysis of Chemical Data
239th ACS National Meeting
San Francisco, March 21-25, 2010
CINF Division

Update(20/Oct): Closing date now 23rd Oct.

Dear Colleagues,

The submission deadline of 23rd Oct is approaching for an upcoming symposium focusing on innovative methods for visual representation and analysis of chemical data. Just as Edward Tufte has championed maximizing clarity and information content in statistical graphics, there is a need for methods to display chemical information that will maximize understanding, and allow rapid analysis and decision making.

We invite you to submit contributions that address various aspects of visualization of chemical data (such as structures, SAR data, literature, patents) including, but not limited to, the following topics:

With an ever increasing pool of descriptors, along with new and more sophisticated machine learning methods, QSAR models are becoming more difficult to interpret. How can information on model reliability, the presence of activity cliffs, and the range of applicability of a model and other relevant model properties be easily depicted?
Recently, virtual worlds 3D such as Second Life have presented new opportunities and challenges for the representation of chemical data. What is the potential of such a medium in education and communicating with the chemistry community?
Social software allows for rapid and convenient sharing of chemical data. Examples include Google Spreadsheets, ManyEyes, DabbleDB, and wikis, including Wikipedia. What are the implications for chemical research and education?
The visualization of the contents of large chemical datasets presents particular problems. How can an overview of the dataset be visualized so that it presents both the nature of the contents as well as the degree of diversity and similarity within the dataset? How can different datasets be visually compared?
Depicting 3D chemical information in 2D involves a loss of information. However, innovative 2D visualization methods can restore the most relevant information.
Chemical information comprises a diverse array of data types including chemical structures and diagrams (2D and 3D), associated assay results, conformations, QSAR models and their predictions. The visualization and integration of all these data into a single interface that aids interpretation and analysis is a continuing challenge.

We would also like to point out that sponsorship opportunities are available.

The on-line abstract submission system (PACS) will be open for submissions until 23rd October.

Please contact Andrew, Jean-Claude or myself if you have any questions.

Yours sincerely,
Noel O'Boyle

On behalf of the symposium organizers:

Dr. Jean-Claude Bradley,
Drexel University, PA
bradlejc@drexel.edu

Dr. Andrew Lang,
Oral Roberts University, OK
alang@oru.edu

Dr. Noel O’Boyle,
University College Cork, Ireland
n.oboyle@ucc.ie

Image credit: process/rum do/radial by Henry Cooke (CC BY-SA-NC 2.0)

Thursday, 8 October 2009

Browser-based chemistry is here - its name is ChemDoodle Web Components

So...what's to say? Just check out ChemDoodle Web Components. It's Javascript. It's Open Source. It's running in your browser. It's doing funky chemistry.

Don't think it's going to affect you? Hear that noise? That's a paradigm shift.

Let's chart a brief timeline of what has led up to this:

1995 Nov - JavaScript (then LiveScript) first released
2008 Jul - Rich surveys all prior work at the intersection of Javascript and Chemistry, and identifies where Javascript can make the most impact on the web
2008 Oct - blahbleh implements a Javascript 3D molecular editor and viewer, molecools
2008 Dec - Duan Lian uses GWT to translate Rich Apodaca's lightweight Java cheminformatics toolkit, MX, into Javascript (website, demo)
2009 Jan - I develop a Javascript 3D molecule viewer, TwirlyMol
2009 Jan-Feb - Duan Lian releases a preview of the world's first Javascript molecular editor, jsMolEditor
2009 Aug - Kevin Theisen releases ChemDoodle Web Components

Sunday, 4 October 2009

Keep your publication list up to date with Javascript and Google Spreadsheet

Adding a new publication to a HTML page is a fiddly business, especially if you want to add some markup or links. This might explain why there are so many websites of scientists whose last publication appears to be four years ago. If only adding a new publication were as easy as, oh, let's say...as easy as adding a row to a spreadsheet.

Well, you're in luck. The following procedure makes it as easy as just that. You can maintain the same list of publications on several web sites all of which will automatically be kept up-to-date. If you're familiar with Javascript and CSS, you can also easily change the markup used and its appearance. The result should look something like the following image:

Here's how it's done:
(1) Create a google spreadsheet, and use the same column names as shown in this spreadsheet.
(2) Add some information on your papers. Again, see the example spreadsheet for the format (note especially the author list format).
(3) Click on Share/Publish as Web Page, and make note of the key (i.e. the text between "key=" and "&single").
(4) Download addpapers.js, edit the line 'me = "N. M. O'Boyle";', and the line with the email address baoilleach@gmail.com, and put it in the same directory as a HTML page, papers.html (for example).
(5) Edit papers.html to load addpapers.js in its HEAD ("<script type='text/javascript' src='addpapers.js'></script>")
(6) Download publishious.css, and put it in the same directory as papers.html.
(7) Edit papers.html to apply publishious.css in its HEAD ("<link rel='stylesheet' media='all' type='text/css' href='publishious.css' />").
(8) Add the following to papers.html after replacing MYKEY by the value of the key for your spreadsheet:

<div id="paperentries"></div>
<script src="http://spreadsheets.google.com/feeds/list/MYKEY/od6/public/values?alt=json-in-script&callback=handlejson" 
type="text/javascript"></script>

Hopefully that works. If it doesn't, check your browser's error console (in Firefox Tools/Error console) for some idea of the problem.

It's probably not a good idea to rely totally on Google spreadsheets, so what I do is view the generated HTML code using the Web Developer plugin and paste it into the HTML page as the content of the paperentries div. That way, even if Google spreadsheets goes down (or changes its API), a couple of papers will still appear.

Feel free to adapt this code for your own use, although I'd appreciate if you could add a comment below with a link to the resulting webpage.

Wednesday, 23 September 2009

ANN: Cinfony 0.9 released

The most anticipated software of the year, the one we've all been waiting for, has just been released. But enough about Google Wave - Cinfony 0.9 is now available.

Cinfony allows you to access RDKit, the CDK and OpenBabel from Python (and Jython) all with the same API.

What this all means is that you can quickly get up and running testing out any of these three toolkits. The Cinfony API makes it easy to read in a file and carry out some basic manipulations, and for anything more complicated you still have access to the underlying toolkit. For a complete description, see the docs.

This release supports OpenBabel 2.2.3, CDK 1.2.3 and RDKit Q2_2009.

Wednesday, 9 September 2009

Running the Windows OpenBabel GUI under Linux on the Windows desktop - Need some Wine?

A friend of mine (Ed Cannon) recently showed me the OpenBabel GUI running on Linux. The surprising thing about this is that OpenBabel currently does not have a Linux version of the GUI (Update 16/Sept/09: Now it does). He was running our Windows release on Linux using Wine, the Windows emulator ("sudo apt-get install wine"). Cool, I thought - I didn't realise that that would even work. Cue blog post.

To get a screenshot I needed Linux. As I described earlier, it's easy (and free) to run Linux on Windows using VMWare Player. This time I installed an Ubuntu 9.04 image. And then (after running "sudo vmware-config-tools.pl") I discovered a new feature called Unity mode. This allows you to use the virtual machine to start Linux applications that appear directly on your Windows desktop (rather than enclosed in a Linux desktop). So I decided to get a screenshot of the Windows OpenBabel GUI running under Wine/Linux together with it running natively on Windows.

The only catch is that I wasn't able to screen capture the Linux application in Windows so in the end, despite all my hard work, I had to Gimp two images together. The resulting image is accurate though, and you can click for a larger version.

Monday, 7 September 2009

Moving to pastures new, but still in the same field

Friday was my last day at the Cambridge Crystallographic Data Centre (CCDC). I've been a postdoc there for the last three years working on the GOLD protein-ligand docking software, specifically on scoring function improvements for virtual screening. It has been a great learning experience, and I've enjoyed working there a lot.

Recently I was awarded my first grant, a career development fellowship from an Irish funding agency, the Health Research Board. From today I will be an HRB Postdoctoral Fellow based in University College Cork (UCC) working on pharmacophore software based on OpenBabel.

I am very grateful to the HRB for giving me to chance to do this, and I'm really looking forward to getting started on this project. It's early days yet, but I am very much interested in collaborations with experimental drug design groups especially those working in the absence of protein structural data (for example, GPCRs). Feel free to drop me an email at baoilleach@gmail.com.

Image credit: Kman999

Thursday, 3 September 2009

Now open for submissions - Symposium on Visual Analysis of Chemical Data (ACS Spring 2010)

Second Call for Papers:
Visual Analysis of Chemical Data
239th ACS National Meeting
San Francisco, March 21-25, 2010
CINF Division

Dear Colleagues,

We are now accepting papers for an upcoming symposium focusing on innovative methods for visual representation and analysis of chemical data. Just as Edward Tufte has championed maximizing clarity and information content in statistical graphics, there is a need for methods to display chemical information that will maximize understanding, and allow rapid analysis and decision making.

We invite you to submit contributions that address various aspects of visualization of chemical data (such as structures, SAR data, literature, patents) including, but not limited to, the following topics:

With an ever increasing pool of descriptors, along with new and more sophisticated machine learning methods, QSAR models are becoming more difficult to interpret. How can information on model reliability, the presence of activity cliffs, and the range of applicability of a model and other relevant model properties be easily depicted?
Recently, virtual worlds 3D such as Second Life have presented new opportunities and challenges for the representation of chemical data. What is the potential of such a medium in education and communicating with the chemistry community?
Social software allows for rapid and convenient sharing of chemical data. Examples include Google Spreadsheets, ManyEyes, DabbleDB, and wikis, including Wikipedia. What are the implications for chemical research and education?
The visualization of the contents of large chemical datasets presents particular problems. How can an overview of the dataset be visualized so that it presents both the nature of the contents as well as the degree of diversity and similarity within the dataset? How can different datasets be visually compared?
Depicting 3D chemical information in 2D involves a loss of information. However, innovative 2D visualization methods can restore the most relevant information.
Chemical information comprises a diverse array of data types including chemical structures and diagrams (2D and 3D), associated assay results, conformations, QSAR models and their predictions. The visualization and integration of all these data into a single interface that aids interpretation and analysis is a continuing challenge.

We would also like to point out that sponsorship opportunities are available.

The on-line abstract submission system (PACS) is now open for submissions until 19th October.

Please contact Andrew, Jean-Claude or myself if you have any questions.

Yours sincerely,
Noel O'Boyle

On behalf of the symposium organizers:

Dr. Jean-Claude Bradley,
Drexel University, PA
bradlejc@drexel.edu

Dr. Andrew Lang,
Oral Roberts University, OK
alang@oru.edu

Dr. Noel O’Boyle,
University College Cork, Ireland
n.oboyle@ucc.ie

Image credit: prehensile

Wednesday, 26 August 2009

Using OpenBabel from Java

OpenBabel 2.2.3 has just been released and with that, a new release of the Java bindings. The full details on using OpenBabel from Java are available on our wiki. On Windows openbabel.jar is included with the OpenBabel GUI so no additional installation is necessary. You just start Eclipse, add the jar file and away you go.

The following example shows how to use OpenBabel from Java. It includes an example of file format conversion, iteration over atoms, and using the SMARTS matcher.

import org.openbabel.*;

public class Test {

   public static void main(String[] args) {
       // Initialise
       System.loadLibrary("openbabel_java");

       // Read molecule from SMILES string
       OBConversion conv = new OBConversion();
       OBMol mol = new OBMol();
       conv.SetInFormat("smi");
       conv.ReadString(mol, "C(Cl)(=O)CCC(=O)Cl");
     
       // Print out some general information on the molecule, atoms
       conv.SetOutFormat("can");
       System.out.print("Canonical SMILES: " + conv.WriteString(mol));
       System.out.println("The molecular weight of the molecule is "
                  + mol.GetMolWt());
       for(OBAtom atom : new OBMolAtomIter(mol)) {
           System.out.println("Atom " + atom.GetIdx() +
                              ": atomic number = " + atom.GetAtomicNum() +
                              ", hybridisation = " + atom.GetHyb());
       }

       // What are the indices of the carbon atoms
       // of the acid chloride groups?
       OBSmartsPattern acidpattern = new OBSmartsPattern();
       acidpattern.Init("C(=O)Cl");
       acidpattern.Match(mol);
     
       vvInt matches = acidpattern.GetUMapList();
       System.out.println("There are " + matches.size() +
                          " acid chloride groups");
       System.out.print("The carbon atoms of the matches are: ");
       for(int i=0; i<matches.size(); i++)
           System.out.print(matches.get(i).get(0) + " ");
   }
}

The output is as follows:

Canonical SMILES: ClC(=O)CCC(=O)Cl
The molecular weight of the molecule is 154.97935999999999
Atom 1: atomic number = 6, hybridisation = 2
Atom 2: atomic number = 17, hybridisation = 0
Atom 3: atomic number = 8, hybridisation = 2
Atom 4: atomic number = 6, hybridisation = 3
Atom 5: atomic number = 6, hybridisation = 3
Atom 6: atomic number = 6, hybridisation = 2
Atom 7: atomic number = 8, hybridisation = 2
Atom 8: atomic number = 17, hybridisation = 0
There are 2 acid chloride groups
The carbon atoms of the matches are: 1 6

Note: although using OpenBabel from Eclipse on Windows works fine, some users have reported problems on Linux with the default OpenBabel build. You probably need to build OpenBabel statically on Linux if you want to use it from Eclipse, but I haven't tested this. In any case, you can just compile it from the command line.

Thursday, 20 August 2009

MolCore - a new beginning for OpenBabel and RDKit

Is it possible to design something exactly right first time? In the world of software design, the answer is no. There are some design decisions whose impact you will only realise years down the line, perhaps as you try to extend the software to handle unforeseen uses. At that point, you're stuck with design decisions that you cannot easily change without major work.

A case in point - in OpenBabel, atoms are numbered from 1 but bonds from 0. Bug heaven.

A few weeks ago the first steps were made in sorting out these sorts of issues; a new project, MolCore, was registered on SourceForge with the goal of developing a common Molecule object for both RDKit and OpenBabel. This will largely be based on RDKit code, but will pool together the collective wisdom of developers on both sides regarding things they wished had been done differently.

As ever with an open source project, all the discussion occurs in public so if interested check out the wiki pages and subscribe to the mailing list.

Monday, 17 August 2009

How does rescoring improve results in docking?

Despite more than a decade of research into improved scoring functions, a scoring function that can accurately predict binding affinities remains an elusive goal. Even the simpler problem of identifying ligands from a data set of inactive molecules is a challenge for modern scoring functions, although for a given protein a particular scoring function may work very well. While there is certainly a need for the development of improved scoring functions with better performance over a wider range of protein families, it is also important to make the maximal use of currently available scoring functions. One of the ways to do this is to combine existing scoring functions in a so-called rescoring experiment.

Testing Assumptions and Hypotheses for Rescoring Success in Protein−Ligand Docking Noel M. O'Boyle, John W. Liebeschuetz and Jason C. Cole, Journal of Chemical Information and Modeling, 2009, ASAP.

A rescoring experiment simply involves taking the docking poses found by Scoring Function A, and assessing them (after local optimization if you want to avoid artifacts) with Scoring Function B. Compared to the length of time a docking requires, rescoring is almost instant. Although rescoring has the potential to improve results in a virtual screen, it won't always. This means that it is important to understand the underlying reasons for success in rescoring. This would then allow the choice of appropriate Scoring Functions A and B.

JCIM has just published some work of mine in which I investigate two hypotheses for rescoring success:

That rescoring success occurs due to some consensus effect between the two scoring functions that eliminates false positives
That rescoring success occurs due to complementary between the scoring functions; that is, the first scoring function is better at pose prediction, while the second is better at scoring actives relative to inactives

As far as I am aware, this is the first study to investigate why rescoring can improve results in a virtual screen.

A cheminformatics journal by any other name...

Over at Wiley, QSAR and Combinational Science is retiring to make way for Molecular Informatics from 2010. The website is molinf.com.

The journal's scope includes but is not limited to the fields of drug discovery and chemical biology, protein and nucleic acid engineering and design, the design of nanomolecular structures, strategies for modeling of macromolecular assemblies, molecular networks and systems, pharmaco- and chemogenomics, computer-assisted screening strategies, as well as novel technologies for the de novo design of biologically active molecules. As a unique feature Molecular Informatics will publish so-called "Methods Corner" review-type articles which will feature important technological concepts and advances within the scope of the journal.

Apparently there's an "open access" option but I cannot find any details.

Wednesday, 22 July 2009

Services built around open source software

Computer-aided chemistry is used today by all the major high-technology companies that are active in chemistry. Just like the meteorologist uses computers to forecast the weather, computers can be used to simulate and predict properties of molecules. This approach is documented to give companies and scientists a high return on investment. But few companies have the resources and skills to make it a reality. The cost of hardware, software, and specialized scientists makes this approach unattainable to most. hBar Lab addresses this problem by putting the required technology online. With hBar Lab there is:
No need for expensive hardware
No upfront payment for software
User-friendly interface makes it accessible for everyone, no specialized scientist necessary.

Source: hBar Lab - Computer-aided Chemistry On Demand

Support and consulting have always been ways of deriving income from open source software, but the web introduces new possibilities centered around web services. I have recently become aware of hBar Lab, whose web application is built entirely on open source software (MPQC, OpenBabel, Jmol) and who perform on-demand calculation of molecular properties:

The user login, select the property, e.g. ionization energy or geometry, and the molecule of interest, and then submit the query. The required calculations are seamlessly executed on computers in the background and once the calculations are done, the results will be returned in the user's inbox. It is as simple as that.

An interesting idea.

TwirlyMol - Status update re world domination

TwirlyMol was the world's first Javascript molecular viewer with shadows. It has been described as "and of course the shadows are cool" by Felix of Chemical Quantum Images.

Although TwirlyMol was only released into the wild to fend for itself in January, it has swiftly outpaced Chime and is rapidly approaching Jmol-like levels of deployment.

Well, almost. At least one ~~other~~ person is using it anyway. As part of a chemistry education project at the University of Wisconsin, TwirlyMol is being used on the ChemPrime wiki and on a student education portal, both of which look like two interesting resources under development. However, you should be warned - the TwirlyMol shadows have been removed!

TwirlyMol is freely available under a do-what-you-want-with-it license. You can even (*sob*) remove the shadows.

Wednesday, 15 July 2009

ANN: Symposium on Visual Analysis of Chemical Data (ACS Spring 2010)

Update 06/Sept/09: See second call for papers.

First Call for Papers:
Visual Analysis of Chemical Data
239th ACS National Meeting
San Francisco, March 21-25, 2010
CINF Division

Dear Colleagues,

We wish to announce an upcoming symposium focusing on innovative methods for visual representation and analysis of chemical data. Just as Edward Tufte has championed maximizing clarity and information content in statistical graphics, there is a need for methods to display chemical information that will maximize understanding, and allow rapid analysis and decision making.

We invite you to submit contributions that address various aspects of visualization of chemical data (such as structures, SAR data, literature, patents) including, but not limited to, the following topics:

With an ever increasing pool of descriptors, along with new and more sophisticated machine learning methods, QSAR models are becoming more difficult to interpret. How can information on model reliability, the presence of activity cliffs, and the range of applicability of a model and other relevant model properties be easily depicted?
Recently, virtual worlds 3D such as Second Life have presented new opportunities and challenges for the representation of chemical data. What is the potential of such a medium in education and communicating with the chemistry community?
Social software allows for rapid and convenient sharing of chemical data. Examples include Google Spreadsheets, ManyEyes, DabbleDB, and wikis, including Wikipedia. What are the implications for chemical research and education?
The visualization of the contents of large chemical datasets presents particular problems. How can an overview of the dataset be visualized so that it presents both the nature of the contents as well as the degree of diversity and similarity within the dataset? How can different datasets be visually compared?
Depicting 3D chemical information in 2D involves a loss of information. However, innovative 2D visualization methods can restore the most relevant information.
Chemical information comprises a diverse array of data types including chemical structures and diagrams (2D and 3D), associated assay results, conformations, QSAR models and their predictions. The visualization and integration of all these data into a single interface that aids interpretation and analysis is a continuing challenge.

We would also like to point out that sponsorship opportunities are available.

The on-line abstract submission system (PACS) will be open for submissions from 24th August. A second announcement will be made at that time.

Please contact Andrew, Jean-Claude or myself if you have any questions.

Yours sincerely,
Noel O'Boyle

On behalf of the symposium organizers:

Dr. Jean-Claude Bradley,
Drexel University, PA
bradlejc@drexel.edu

Dr. Andrew Lang,
Oral Roberts University, OK
alang@oru.edu

Dr. Noel O’Boyle,
Cambridge Crystallographic Data Centre, U.K.
oboyle@ccdc.cam.ac.uk

Image credit: prehensile

Tuesday, 7 July 2009

Sledgehammer, meet nut - Using Eclipse for Python

I usually use gvim or IDLE to edit Python files, but today I thought I'd try something a bit more heavyweight: Eclipse. Eclipse is widely used in the Java world. It's open source and freely available, and most importantly there is a Python plugin for Eclipse called PyDev.

So what does Eclipse have that IDLE doesn't? Well, integration with the Python debugger for a start. Also, this sort of code completion is quite handy (click for a larger image):

It also has nice integration with PyLint (see the bottom pane in the following figure) which catches various errors (e.g. mispelled variables) before you run a script:

Here are some notes:

I followed these installation instructions and then sped through the manual.
Pydev currently supports Eclipse 3.2 to 3.4. It took a while to find an Eclipse download page with version 3.4 but here it is. I installed Eclipse SDK 3.4.2.
Start Eclipse, and click on Help/Software Updates. Add http://pydev.sourceforge.net/updates/ to the list of update sites. Tick the box and click Install to install PyDev.
Following the details at http://www.fabioz.com/pydev/manual_101_interpreter.html, I added a Python interpreter (Name="Python 2.5", Executable="C:\Python25\python.exe").
Installing pylint on Windows is a pain, so I used easy install:
```
C:\Python25\Scripts\easy_install.exe pylint
```
In the PyLint configuration, you need to specify the location of lint.py. Mine was at C:\Python25\Lib\site-packages\pylint-0.18.0-py2.5.egg\pylint\lint.py.

Monday, 29 June 2009

I'll fix the bug...but only if you give me a public domain test file

Recently, Avogadro/OpenBabel have been increasing their support for computational chemistry log files. I am hoping that they will learn from our experience at GaussSum/cclib.

GaussSum was the first Python program I ever wrote, and still bears the hallmarks. When I first started GaussSum (a program which analyses the results of comp chem calculations), I would use the test cases from users to fix bugs. Then over time, I'd lose the test cases as I moved from computer to computer. I couldn't place the test cases in my version control system as the test cases might have been the results of someone's research, and they mightn't be happy to see them publicly available.

Things came to head when dealing with the parsing of vibrational frequencies in the various versions of GAMESS. It turned out that each version of GAMESS (PC-GAMESS, WinGAMESS and GAMESS US) had slightly different output for vibrational frequencies. I ended up bouncing between code that worked for WinGAMESS but not GAMESS and vice versa, depending on who sent me the last bug report. In other words, I was wasting my time fixing bugs which might reappear later. It was around this time that (a) I realised I needed a test suite, and (b) I needed public domain test files, so I could use them in my test suite.

The parser used by GaussSum is now available as a separate project, cclib, and is developed in collaboration with Adam Tenderholt and Karol Langner. This time I put a lot of thought into the test suite, and I think we've done very well. The parsers are initially developed using a set of calculations which are the same for each comp chem package; our test suite ensures that the same results are found in each case and that the units are consistent. We only fix bugs for which a public domain test file is provided ("I place this file in the public domain" is all we need to hear), and regression tests are easily added to the test suite. Our test suite has the final say on commits; commits are reverted if they cause an existing test to fail. This guarantees that cclib can only improve over time.

The inevitable consequence of this policy is that some reported bugs don't get fixed. Sometimes the reporter simply does not respond to the query to place it in the public domain. On two occasions, the reporter was working in a pharmaceutical company and felt it was more hassle than it was worth to do the necessary paperwork to place it in the public domain. So it goes... On the other hand, we do now have a set of more than 200 comp chem log files which go a long way to ensuring that our parsers can handle anything that is thrown at them. The best way of getting these files is to check the data directory of cclib out of subversion and run wget.sh.

In conclusion, if you are thinking of writing software that handles comp chem files, either try to collaborate with others who are working on the same problem (e.g. cclib or OpenBabel), or at the very least take into account some of the comments here. Otherwise, you are simply building a house of cards.

Friday, 19 June 2009

Using PyActiveResource to access ChemCaster

ChemCaster, from Rich's Metamolecular, is a platform for developing web-based cheminformatics applications. The advantage of such a system is that the user does not need to install any special software, nor does the application developer need to maintain a server.

Rich invited me to take it for a spin, so I signed up for a trial account and moved quickly on to my first problem, how do I access the API through Python?

It turns out that RESTful APIs tend to have common patterns, a fact which is taken advantage of by Active Resource, a Ruby library for defining classes which directly map onto the objects implied by a RESTful API. Or something like that - I neglected to read any documentation. Instead I just took Rich's example and tried to code it up in Python using PyActiveResource (this is a documentation-free project so using it is quite exciting).

Et voilá

Tuesday, 9 June 2009

From zero to Zotero - One man's journey out of PDF hell

Zotero is a reference management software. Sorry, let me correct that - Zotero is THE reference management software. I had tried Zotero before, and it certainly looked good; but frankly I couldn't figure out how to get it to work and so reverted to my usual system, the 'zero' of the title. Hearing the news that Endnote vs. Zotero was just thrown out of court, I decided to try it again.

And it's just amazing.

Let me begin by describing a typical workflow:
(1) Go to the summary page for an ACS paper online
(2) Click on the icon that appears in the address bar (looks like a sheet of paper with writing).

That's it. You've just saved the PDF, the HTML full-text and the paper's metadata.

If you've created an account on zotero.org (free of course!), you can synch your library so that multiple computers can share the same data. And best of all you can also synch the attachments (i.e. PDFs, HTML pages) if you have a WebDAV account (e.g. from your university or in my case, JungleDisk Plus/Amazon S3). If that wasn't enough, it also integrates with Word to make it easy to prepare a publication (~~though I haven't tested this~~ Update: it works just fine, but you first need to install the bibliographic styles you need from Zotero settings/Preferences/Styles/Get additional styles).

In other words, Zotero makes it easy to download papers, back them up, make them accessible from any computer and reference them in papers.

Zotero is open source and freely available from www.zotero.org.

Notes: I'm using Zotero 2.0b5. In the Zotero preferences (click on the gear icon), choose "Automatically attach PDFs and other files when saving items" in the General Tab. JungleDisk and Amazon cost money (we're talking around $1.50 a month), but there may be free alternatives for WebDAV. For any websites that aren't currently supported by Zotero, adding new translators has been made easy. All of the JavaScript files for the translators are stored in a folder on your computer and can easily be extended or added to. That said, I've had no trouble downloading PDFs from Sciencedirect, ACS, RSC, Wiley or BMC.

Image credit: jazzmodeus

Friday, 5 June 2009

The best time to optimise

As a scientist, I worry more about bugs in software than about speed. Changing correct code to improve speed can introduce errors as well as make it unreadable for others. Sometimes though it's nice to find cases where simple changes can improve the performance.

The 3D structure generation code in OpenBabel uses templates to handle the geometry of rings. There are about 2500 templates, which are represented by SMARTS patterns and associated coordinates (see fragments.txt in the distribution). The SMARTS patterns are ordered from large to small. Now, testing 2500 SMARTS patterns against a molecule takes a wee while so I was interested in seeing whether the process could be speeded up.

To begin with, I timed the code for a test set of 1000 PubChem molecules: it took 60ms per structure. Considering that the easiest way to speed something up is to avoid doing it in the first place, I changed the loop to terminate once all ring atoms had been matched. This brought it down to 38ms per structure. Then I changed it so that it skipped any SMARTS patterns that had more atoms than the number of ring atoms in the molecule: now down to 30ms. This is now within an order of magnitude of greased lightning.

In fact, I could have done slightly better than this; I could have skipped any SMARTS patterns with more atoms than the number of atoms in the largest isolated ring system in the molecule. Calculating this value is a bit of work though and may offset the associated performance gain, and so this has been left as an exercise for the reader.

How else could this code be speeded up? Well, the SMARTS matcher can itself be improved. It currently uses an exhaustive depth-first search algorithm instead of something more optimal like Vflib2. This would improve performance across the board as the SMARTS code is widely used for a variety of tasks. Alternatively, the SMARTS patterns could be fingerprinted based on particular common patterns, e.g. 5-membered rings. If a molecule had no 5-membered rings, such patterns could be skipped.

To begin with, though, the code should be profiled more precisely. It may be that 25 of those 30ms have nothing to do with this loop. In that case, further optimisations may be more work than they are worth.

These are the sorts of small studies that would fit nicely into a summer project for an undergrad computer science or chemistry student. If you want to sponsor OpenBabel development in this way, contact us.

Image credit: jpctalbot

Wednesday, 27 May 2009

The RSC - Value for money?

I don't usually advertise for chemical societies, but in these recessionary times I thought the following might be of interest to some readers.

RSC members have:

free access to Wiley, Elsevier, and Springer chemistry journals
free access to 913 chemistry e-books from a variety of sources
20% off Pearson Education Books, 30% off Wiley, 35% off Blackwell
and most importantly, £5 off Pizza Express Club membership

Sure, chemistry societies organise conferences, enable networking, provide travel grants, and lobby politicians; but any society that doesn't look after its most vulnerable members by providing discounted pizza is not a society I want to be a member of.

Thursday, 21 May 2009

Have your hamburger and eat it - Edit molecules in PDFs II

In Part I, I showed how to hack some code together that allowed you to paste images directly from the clipboard (e.g. from a PDF) into Beda's BKChem, a 2D drawing program. The magic conversion from image to chemical was done by Igor's OSRA.

Well, Igor has taken this idea and run with it. The latest version of OSRA now includes plugins for BKChem, Symyx Draw, MolSketch and Pipeline Pilot.

If you use the Windows installer, the Symyx Draw plugin is automatically installed and adds an "Import Structures from OSRA" option to the File menu. The first time you choose it, you will need to change the path to something like "C:\Program Files\osra\osra.exe" under "Settings...". Here's the plugin in action:

Note that the other plugins appear to be only available from the Windows .zip release.

Saturday, 16 May 2009

How do enzyme mechanisms evolve?

Evolution is a fascinating topic. Although the principal mechanism by which evolution occurs is quite simple to understand, namely the introduction of changes (mutations) into the DNA, the consequences that follow are enormous.

The term selective pressure is used to describe an imaginary operator that affects the incidence of particular mutations in a population. What makes evolution difficult for me to get my head around is that selection operates on many levels. In a population, a particular physical characteristic might be more advantageous (think of the famous finches) or more attractive. In your DNA, a particular mutation might preserve the amino acid coded for, or it may change to another amino acid that does not affect the protein's function. On the other hand, if the amino acid is involved in the catalytic action of the protein it's going to be conserved, right? But then how do new mechanisms evolve?

My former postdoc supervisor, Dr. John Mitchell, is currently advertising a PhD position on "Modelling the Evolution of Enzyme Catalysis" at the University of St. Andrews. I'm particularly interested in this project as it builds on earlier work I carried out in the Mitchell Group along with Gemma Holliday and Daniel Almonacid. Here's an excerpt from the project description:

We will create a simulation using a population of model enzyme-catalysed reactions, mimicking a state early in evolutionary history, and allow them to evolve in EC space. The reactions will consist of steps and be represented, in a manner familiar from genetic algorithms, by "chromosomes" describing the chemical properties of each step. Parameters will control the likelihood of different kinds of evolutionary event, such as a change of substrate with the same underlying chemical mechanism, taking place. The simulations will be calibrated, and then compared with the results from a study of real-world convergent and divergent evolution.

Cool. Closing date 31 July.

Image credit: Colin Purrington

Tuesday, 5 May 2009

Manipulating PDFs with Open Source Tools - Part II

Part I

There are a couple of journals that have rather generous margins around the text. I prefer to print out two pages per sheet to avoid wasting paper, so it would be nice to be able to remove those margins and increase the size of the text instead.

pdfcrop by Heiko Oberdiek allows you to do just that. It's included with Debian Linux, and "sudo apt-get install texlive-pdfetex" will install it. On other (lesser) Linux distributions you may have to check CPAN or CTAN. Once installed, a straightforward "pdfcrop withmargin.pdf nomargin.pdf" will do the conversion.

Update: A reader points out that Tex Live 2008 includes pdfcrop and is available for both Windows and Linux.

Note that pdfcrop should not be confused with PDFCrop. There's also a patch of pdfcrop called pdfcrop2, but I think that covers the current crop.

And spare a thought for Fermat.

Image credit: B.G. Johnson

ChemPad - Protecting you from chemical structure software

A colleague has just pointed me to ChemPad, an interesting piece of software that allows chemists to draw chemical diagrams directly into a computer on a Tablet PC. It was designed as a tool to enable undergraduate chemists to quickly enter chemical structures and generate 3D diagrams with which they could gain a deeper understanding of structures. Hash and wedge bonds are understood and are converted faithfully to 3D, for example.

From the point of view of teaching chemistry undergraduates, such software may be preferable to ChemDraw and friends as it allows them to develop the skill of drawing diagrams by hand. Of course, not everyone has a tablet PC but one could imagine similar software for the iPhone or just a regular PC driven by a mouse.

ChemPad is free but not open source, and Windows only. The ChemPad website has tutorials and videos of the software in use.

Friday, 24 April 2009

Broken symmetry - Can SMILES and InChI ensure canonicalisation?

I've been working on the SMILES code in OpenBabel over the last while. The longer I've spent on it, the more impressed I've been with how it has been handled in the code and also with what a great idea SMILES was in the first place. The same goes for InChI, which has a slightly different goal, but which goes the extra mile and solves normalisation problems which I didn't even know existed.

But do they work? Can their canonicalisation procedures ensure that two identical molecular graphs result in the same canonical SMILES or InChI?

The InChI canonicalisation procedure is summarised in Rich's post. The Daylight algorithm is in Weininger*3, JCICS, 1989, 29, 97. And the review of the field that throws both into question is Ivanciuc's review of Processing Constitutional Information in Gasteiger's Handbook of Cheminformatics.

The key question here is whether the SMILES and InChI algorithms are capable of identifying automorphisms. There is a brute force way to do this, but both SMILES and InChI try to avoid this by identifying symmetry classes using extended connectivity and various graph invariants. An explicit automorphism check is not described as part of either algorithm but yet Ivanciuc argues repeatedly (e.g. at the end of 5.1.4) that any canonicalisation algorithm that does not include an explicit automorphism check "is incomplete, and its use in a chemical database...is unreliable".

The funny thing is that although the SMILES paper came out several years prior to Gasteiger's handbook (1993 vs. 2003), it is not referenced. Furthermore, the InChI developers have followed the same route more recently.

I leave the following question as an exercise for the reader: if a counterexample to the SMILES or InChI algorithms existed, how would one find it?

Image credit: _Blaster_

Tuesday, 14 April 2009

Are you on my side or not? It's E/Z

Handling cis/trans stereochemistry with SMILES should be easy, right? You have the canonical examples for trans:

A. I/C=C/Cl
(I is down, Cl is up)
B. I\C=C\Cl
(I is up, Cl is down)

and cis:

C. I/C=C\Cl (both are down)
D. I\C=C/Cl (both are up)

The "/" or "\" symbols should be chosen based on whether the substituent occurs before or after the atom attached to the double-bond. Bearing this in mind, the following represents the same trans structure as A:

E. C(=C/Cl)\I

Note that the effect of moving the "I" from one side of the "C" to the other (that is, A vs E) causes the bond symbol to change.

When ring closures occur on the double bond, a further complication arises as the stereobond appears twice, once at each end of the ring closure. The symbol indicating the stereochemistry should only appear at the end on the double bond:

F. I/C=C\1/CCCN1

Of course, where two substituents are shown explicitly at one end of a double bond, it's not necessary to show the stereochemisty for both of the bonds (although it makes things clearer). That is, the following two representations are identical to F:

G. I/C=C1/CCCN1
H. I/C=C\1CCCN1

Image credit: suttonhoo

Friday, 3 April 2009

Some short stories

I want to flag up Andrew Dalke's course at the end of April on Python and cheminformatics. While I might disagree with Andrew's toolkit of choice, there's no doubt that the skills learnt will be of great benefit to any cheminformatician in their day-to-day work. As well as a cheminformatics portion, the course includes matplotlib (plotting), communicating with Excel, XML processing, subprocess (for calling command-line programs), NumPy, R, SQL and Django.
The first issue of Journal of Cheminformatics has hit the electronic shelves. Point your RSS reader to the feed. Best of luck to Christoph and David.
Is 2009 the year of OChRe on the desktop? After almost a decade of little development in this area, we have in quick succession papers on ChemReader, OSRA and now Clide Pro, an update of the venerable Clide. The techniques used by the new version are described in detail in the paper. Unfortunately, there is little in the way of comparison either to the original Clide or other OChRe software. On the plus side, the dataset of images discussed in the paper has been made available as supporting material with the intention of forming part of a community benchmark for performance comparisons (although it's not clear whether this dataset was also used for training the software).
There seems to be some confusion over the name of this field. Is it OSR (Optical Stucture Recognition, according to OSRA), OCR (Optical Chemical Recognition, ala chemOCR), OCSR (you guessed it, Optical Chemical Structure Recognition, as referred to in the Clide Pro paper), or OChRe (Optical Chemical Recognition again, but spelling out a real word; it also has that InChI up-and-down thing going on)?
Did your experiments fail again? Tell me about it. I mean that literally, because you've got your choice of journals to publish in. There's the All Results Journal ("all results are good results") or (for the more mathematically inclined) Rejecta Mathematica.

Tuesday, 24 March 2009

The Clockwisdom of SMILES Part II

As many readers of this blog will be aware, a chiral SMILES is not a lopsided grin. Instead it is a way of describing the relative spatial arrangement of groups around a chiral centre using SMILES notation.

The following examples investigate this notation. The stereotypical examples are the following:

A. C[C@](Br)(Cl)I - ACW(C,Br,Cl,I)
B. C[C@H](Br)I    - ACW(C,H,Br,I)

where ACW(w,x,y,z) indicates anticlockwise in terms of x,y,z when looking from w. However, note that the first group does not necessarily need to appear before the chiral carbon:

C. [C@](C)(Br)(Cl)I - ACW(C,Br,Cl,I) (Same as A)
D. [C@H](C)(Br)I    - ACW(H,C,Br,I)  (Opposite of B)

What about ring closures? These are handled as followed:

E. C[C@H]1CCN1 -
ACW(C,H,N,C) (the 1 indicates a bond to the chiral C)
F. C[C@]12N(CC2)C1 -
ACW(C, the C1 carbon, the C2 carbon, N)
G. C[C@@]21N(CC2)C1 -
CW(C, the C2 carbon, the C1 carbon, N) (the same as F)

Note that ring closures directly before an atom do not indicate a bond to that atom. Try to draw the following and compare your result to that given by Daylight's Depict service:

H. [C@@]123[C@H](C(C=C3)(C)C)CC[C@@](C1)(CCC2)C

If you got that right, consider yourself a SMILES ninja.

Credit: Thanks to Craig James for his patient explanations.
Image credit: sean-b

Friday, 20 March 2009

Time for a test - Any questions?

I'm a great believer in tests for code quality. In fact, I don't want to contribute code to a project if I can't add a test to the test suite. This is particularly important in collaborative projects where changes by others might impact on bugs I've fixed or features I've added. I've learned my lesson in the past. With a test suite, I can be sure that everything is still working the way I expect it.

I've recently started a new test suite for OpenBabel called obunittest. Although OpenBabel already has a test suite ("make test"), I wanted to have a test suite written in Python where people could easily add new tests.

obunittest is hosted at github and all of the necessary instructions are available at the obunittest website (just scroll down). Git itself isn't required, but you may find it interesting to use - to do so, just create an account on github and fork my project.

So this is your chance to add a test for OpenBabel. Now while this might not be everyone's idea of a fun time, if there's some feature of OpenBabel upon which you rely, write a test for it and send it to me (or "git it" to me). This will ensure that that particular feature will always work in future OpenBabel releases. The same goes if there's something that you know is currently broken - just write a test. Remember that a stitch in time means you won't be saying "darn".

Image credit: Duncan Hull (hi!)

Wednesday, 18 March 2009

The Clockwisdom of SMILES

I was recently confronted with a question that many of us face at some point in our lives: how many ways can the groups attached to a chiral C be moved around in a SMILES string while retaining the clockwisdom?

What's all this about clockwisdom? Well, a chiral SMILES string can indicate R or S around a tetrahedral centre using C@ or C@@. The difference is that R or S refer to clockwisdom of groups arranged by CIP priority (with the lowest priority facing away), whereas @ and @@ refer to clockwisdom of groups arranged in order of their appearance in the SMILES string (with the first appearing facing towards) [1]. Whether this was a good design decision by the Daylight gurus, I'm not 100% sure, but that's how it is.

So in short, if you change the order of groups in the SMILES string, you may need to change the clockwisdom to ensure that stereochemistry is preserved. Specifically, if you swap two groups you will get the other enantiomer ("putting the SMILES on the other face"?) unless you flip the clockwisdom; that is, Cl[C@@](Br)(C)I is the same enantiomer as Cl[C@](Br)(I)C. Another swap and we get back a SMILES string with the original clockwisdom.

So I started off by trying to think of a clever program to identify how many swaps were required to convert between two orderings of groups. Next I tried to write a few loops that would simply perform all possible swaps of groups to generate all of rearrangments, but that missed a few. In the end, I just wrote the dumbest program I could think of and got the following results. For an original ordering of groups 0123, the following orderings have the same clockwisdom: 1032, 3021, 2013, 3210, 1320, 3102, 0123, 0231, 0312, 2301, 1203, 2130.

And the point of all this? OpenBabel was not generating the correct stereochemistry around tetrahedral carbons in canonical SMILES. Now fixed.

Update (19/03/09): Tim Vandermeersch pointed out to me a neat way of determining the parity of a particular ordering of groups. Simply count the number of pairs in the ordering where one number is larger than another number to its right. For example, for 1032, there are two pairs (10, 32); for 3021, there are 3 pairs (32, 31, 21). Orderings with even numbers of pairs have one parity while orderings with odd number of pairs have the opposite parity.

[1] The OpenSMILES specification on stereochemistry

Image credit: Swamibu

Monday, 2 March 2009

Review of Hello World - Computer Programming for Kids and Other Beginners

My generation were among the first children to learn programming. Thanks to the BBC Microcomputer (in my part of the world), kids were provided with a computer and a manual that taught computer programming in both BASIC and Logo. The local library had a complete set of Usborne books that covered everything from arcade games, to fantasy adventures (the old-school text only type, that is), to assembly language programming and sorting algorithms. And these were for children.

So what was it about programming on the BBC (or ZX Spectrum or Commodore 64) that drew kids in? For me it was all about graphics. Drawing circles could only be done dot by dot and led easily to drawing ellipses, and then to hyperboloids of revolution (think cooling towers) whose top you could twist. I read Chaos by James Gleick, couldn't believe the simplicity of generating the Mandelbrot fractal, and lifted my jaw off the floor the first time my BBC drew the little Mandelbeetle (I'm not the only one - see also PMR). I did some astronomy at school and plotted the night sky for different months of the year. And so on.

Since then we've seen the rise of the PC and Windows, which in fairness had QBasic for quite some time (I am a Nibbles master). However, as David Brin pointed out ("Why Johnny can't code", 2006) today there's no easy way for kids to get hooked on programming. Even my favourite language, Python, is lacking here. Out of the box the only usable graphics library for kids is the turtle module, an implementation of LOGO:

C:\Documents and Settings\oboyle> python
Python 2.6.1 (r261:67517, Dec  4 2008, ...
Type "help", "copyright", "credits" ...
>>> from turtle import *
>>> for i in range(10):
...     for j in range(5):
...             forward(100)
...             left(360/5)
...     left(360/10)
...
>>>

It seems to work quite well, although the documentation is aimed at computer science majors rather than teachers (never mind kids). Also, the demo files are only available in the source distribution (you can get them from SVN here).

While Logo might be quite good for introducing the basics of programming languages, its graphics capabilities are limited. pygame is really the way to go. This is one of the big third-party Python extensions that incorporates support for sound, graphics and input devices. As the name implies it has everything necessary to write a decent computer game (see for example, the list of pygame arcade games). The downside is that this library makes no effort to cater for kids.

Enter a recent publication from Manning, "Hello World! - Computer programming for kids and other beginners" by Warren and Carter Sande. Written with 12 year old kids in mind, the preface makes it clear that the authors (one a 12-year old kid himself) know their target audience well:

"For kids especially, one of the most fun parts of using a computer is playing games, with graphics and sound. We’re going to learn how to make our own games and do lots of things with graphics and sound as we go along. Here are pictures of some of the programs we’ll be making:"

(Figure published with permission from Manning Publications)

Lunar lander! Slalom racing! The Sandes have reinvented the Usborne books for the YouTellyTub generation, and then some. Assuming no previous programming knowledge (a reasonable assumption when you're 12), the book teaches Python programming with the goal of writing computer games. The initial chapters cover the basics from variables, through maths, "if" statements and loops. But there's also already the fun stuff like getting input and simple graphical dialogs, and in case attention is waning Chapter 10 (of 24) has the complete listing for a Skiing game. As the book says:"One of the great traditions of learning to program is typing in code you don’t understand. Really!"

After introducing lists, functions, objects and modules, pygame enters the picture in Chapter 16 which covers drawing, images and animation. The following chapters cover sprites and collision detection, events and sound. The final chapters return to useful Python modules such as handling strings, file input and output, and using random numbers. All of the code examples are available for download from the book's website, along with a simple installer that contains all of the examples and modules required, along with Python itself.

As you might have guessed, I think this is a great book that fills a real niche - I don't know of any other programming book on the market that targets kids. What's amazing is that it has set its sights so high, and yet manages to meet its goals. I think it would be great to see this book promoted as a way of teaching programming in primary schools. In the meanwhile if you know any 12+ kids interested in computers, give them an opportunity to develop a fascinating hobby and get them this book.