Friday, 3 February 2012

LICSS hits the press, and an update on the OB/BO papers

I've written here before about how Kevin Lawson (of Syngenta) has developed a way to incorporate chemistry into Excel using only freely-available software, namely the CDK and JChemPaint (and also now OPSIN it seems). This system is called LICSS, and the corresponding paper has just appeared in Journal of Cheminformatics where it has been highlighted as an Editor's Pick.

So go check it out. I'm particularly interested in the use of this software in an academic teaching setting. It would seem to be ideal for introducting students to cheminformatics.

While on the subject of papers in J. Cheminf., I've been keeping an eye on accesses and citations of the Open Babel and Blue Obelisk papers since publication in October of last year (see also my earlier post on the topic).

Both have remained in the top 10 most accessed papers in the last 30 days (now at positions 4 and 6 for OB and BO respectively). In terms of accesses over the last year, the OB paper is now at position 5 (BO at 23) behind Peter Ertl and Ansgar Schuffenhauer, Mikhail Elyashberg et al (including Tony Williams), Peter Ertl again, and Matthias Samwald et al (including Egon Willighagen) at #1. In terms of all-time accesses, there's still some way to go for OB (now at 24) and BO (now at 46).

Keeping an eye on accesses is fun, but do they translate into the traditional academic coin of citations? Well, the Open Babel paper has already been cited four times, although the Blue Obelisk paper still has only the initial citation from the corresponding editorial (early days yet though).

Thursday, 26 January 2012

Visualising the fragments in a path-based fingerprint

With any tool, users will often come up with ways of applying it that the developers of the tool did not originally anticipate. In the Open Babel paper, I gave an example where the InChI was used (by Fábián and Brock) as part of a workflow to identify a specific class of racemic crystals - I don't think that NIST had this use-case originally in mind.

Similarly, although Open Babel's path-based (or Daylight-type) fingerprint FP2 was developed for similarity searching of databases, we realised that users wanted to use the information in the fingerprint for other purposes. From time to time, someone would ask on the mailing list what fragments corresponded to each of the 1024 bits. At first, our response was to point out that we couldn't really say as (a) more than one fragment might correspond to a particular bit and (b) the hashing algorithm that was used to link the fragments and the bits only worked one-way.

Eventually we realised that people wanted something more, and so Chris added an output option to describe the fragments and their corresponding bits. These can be used just like fragments from other fragmentation schemes (looking for privileged fragments, unusual fragments, whatever), and the purpose of this blog post is to show how to get to grips with these fragments by visualising them.

The example molecule is:
generated by:
obabel -:N1CC1C(=O)Cl -O example.png
And here are the corresponding fragments generated by the FP2 fingerprint (scroll to zoom in the image below, click+drag to pan):Note: In the visualisation above, hydrogens should be ignored as they are not included in the paths (we could add an option to the SVG depiction to suppress these if necessary). Also, aromatic bonds are depicted as single bonds unless a complete aromatic ring is present in the fragment.

So...how is it done?

The first step in creating this visualisation is to generate a description of the bits in the corresponding FP2 fingerprint:
obabel -:N1CC1C(=O)Cl -ofpt -xs -xf FP2 > example.txt
example.txt will contain the following:
>
0 6 1 6 <670>
0 6 1 6 1 6 <260>
0 6 1 7 1 6 <693>
0 6 1 7 1 6 1 6 <9>
0 7 1 6 <82>
0 7 1 6 1 6 <906>
0 7 1 6 1 6 1 6 <348>
0 8 2 6 <623>
0 8 2 6 1 6 <329>
0 8 2 6 1 6 1 6 <652>
0 8 2 6 1 6 1 6 1 7 <635>
0 8 2 6 1 6 1 7 <653>
0 8 2 6 1 6 1 7 1 6 <46>
0 17 <17>
0 17 1 6 <328>
0 17 1 6 1 6 <219>
0 17 1 6 1 6 1 6 <1009>
0 17 1 6 1 6 1 6 1 7 <24>
0 17 1 6 1 6 1 7 <1010>
0 17 1 6 1 6 1 7 1 6 <456>
0 17 1 6 2 8 <329>
1 7 1 6 1 6 <225>
The help text for the FPT format explains what this means:
obabel -H fpt
...
For the path-based fingerprint FP2, the output from the ``-xs`` option is
instead a list of the chemical fragments used to set bits, e.g.::

 $ obabel -:"CCC(=O)Cl" -ofpt -xs -xf FP2
 >
 0 6 1 6 <670>
 0 6 1 6 1 6 <260>
 0 8 2 6 <623>
 ...etc

where the first digit is 0 for linear fragments but is a bond order
for cyclic fragments. The remaining digits indicate the atomic number
and bond order alternatively. Note that a bond order of 5 is used for
aromatic bonds. For example, bit 623 above is the linear fragment O=C
(8 for oxygen, 2 for double bond and 6 for carbon).
...
If we want to visualise these fragments, a small Python script can read example.txt, create the corresponding molecules, and write out their SMILES strings to output.smi:Visualising a file full of SMILES strings is then easy. The following line generates the SVG depiction shown above:
obabel output.smi -O fragments.svg -xC

Tuesday, 17 January 2012

What's up dock? - Calculate the RMSD between docked and crystal poses

In cheminformatics, there are two reasons why one might want to calculate the RMSD between two conformers. The first is to check whether two conformers are very close in structure - e.g. for the purpose of generating a diverse set of conformers. This problem is solved using (1) a least squares alignment followed by (2) calculation of the RMSD.

The other situation is comparing two sets of 3D coordinates to see whether a prediction method has accurately reproduced experimental coordinates (e.g. docking). This just requires step (2) above.

The situation is complicated a little bit by the fact that only "heavy" atoms (i.e. non-H atoms in this context) are typically used to calculate the RMSD. A much greater complication is that automorphisms (well, isomorphisms of two molecules which are identical, to be exact) must be taken into account in both cases above. For example, consider the case where two para-substituted benzene rings must be compared; the RMSD calculation must take into account the fact that a 180 degree flip of the ring might yield a smaller RMSD.

Anyhoo, here's some Pybel code that will calculate the RMSD between a crystal pose and a set of docked poses. The code also illustrates how to access the isomorphisms. You should modify the code for your specific purpose:

Wednesday, 21 December 2011

Open Babel: 10 Years and Future Directions

On the Open Babel mailing list, Geoff looks back on 10 years of the project, and looks forward to the future:
As 2011 draws to a close, Open Babel is over 10 years old! At this point, it's used by over 40 open source projects, downloaded over 200,000 times, and been used in over 400 academic papers. And of course, there have been 15 releases and dozens of contributors.
...Read the rest

See you in 2012!

Monday, 5 December 2011

Poll over and discuss Goslar

So maybe my poll question was not very difficult (given that 50% of you got it right), but I thought it was quite surprising nonetheless when I came across a guest editorial by Steve Heller* in Anal. Chim. Acta in 1982, entitled "Where have all the data gone?":
"Unless something real and practical is done in the near future, it will become impossible to find or use scientific data with the resulting loss of time and money for those who need to repeat experiments."
The future alluded to is the one in which we now live. A follow-up letter, "Computer readable analytical chemical data - comments on a critical need" in Trends in Anal. Chem. discusses this further.

I was put in mind of these articles at the recent German Conference on Chemoinformatics (GCC2011) in Goslar, when (a) I met Steve Heller, and (b) in Prof Johnny Gasteiger's talk, he highlighted this same problem as one of the outstanding challenges that we should be sorting out. PMR of course has been discussing this issue for some time, but it's the first time I'd heard Prof Gasteiger mention it.

Since I'm on the subject of the GCC, it was good to meet several people who I know through the Open Babel mailing lists, and in particular Michael Banck, who plays a major role in curating chemistry software for Debian. For example, see this list of packages. His talk is available on Lanyrd.

In a recent blogpost I mentioned that Open Access makes it easy to redistribute copies of papers, and I wondered why OA journals don't take advantage of this. Well, it turns out they are - Jan Kuras of Chemistry Central was giving out nice colour copies of the Open Babel paper printed in booklet form, along with similar booklets summarising the three series they have recently published on RDF, PubChem3D and PMR's Symposium.

And finally here's a picture of me trying to steal a pretzel from one of the FIZ-Chemie Berlin Award winners, Dr. Volker Dirk Hähnke, who gave a very interesting talk on using sequence alignment methods to align a string representation of a chemical graph:

Footnote:
* You may know of Steve from such string representations as the InChI. Incidentally, I thought I was blazing a trail putting my talks on the web, but check out Steve's page.

Thursday, 1 December 2011

Cinfony 1.1 released

Cinfony presents a common API to several cheminformatics toolkits. It uses the Python programming language, and builds on top of Open Babel, the RDKit, the CDK, Indigo, OPSIN and cheminformatics webservices.

Cinfony 1.1 is now available for download.

The two major additions in this release are support for using the Indigo cheminformatics toolkit (the indy module) and support for OPSIN (IUPAC name to structure, the opsin module).

As usual, Cinfony has been updated to use the latest stable releases of each toolkit: Open Babel 2.3.1, CDK 1.4.5, RDKit 2011.09, Indigo 1.0 and OPSIN 1.1. Installation on Windows has also been simplified somewhat as Open Babel 2.3.1 now includes the necessary .jar file and .NET libraries (for use from Jython and IronPython).

The Cinfony website has a somewhat condensed (and only slightly contrived :-) example showing the use of all of these resources in just 12 lines of Python. Here's a small example showing that roundtripping of IUPAC names is now easy to play with:
>>> from cinfony import opsin, webel
>>> mol = opsin.readstring("iupac",
                           "1-chloro-2-bromopropane")
>>> print webel.Molecule(mol).write("iupac")
2-Bromo-1-chloropropane

To support Cinfony, please cite:
N.M. O'Boyle, G.R. Hutchison, Chem. Cent. J., 2008, 2, 24. [link]

Friday, 25 November 2011

Your turn - Poll up and answer

It's been a while, so here's a poll (see the sidebar on the left).

In which decade was the following statement made? Guess before you google it, and extra marks if you know who the author was also.
With the ever-increasing number of publications, coupled with higher printing costs, there is great pressure brought on journal editors to keep manuscripts as short as possible. While this is quite understandable, it has, in my opinion, lead to a very serious problem. The vast majority of (hopefully) good analytical data, such as spectroscopic, kinetic and thermodynamic measurements, is never readily made available to the scientific community. Published data are often so "compressed" that one is unable to examine alternative interpretations, as the published data are not sufficient. Partial data are preferred to complete data...

Why doesn't [______] make it a policy to require authors to submit full data on spectroscopic and other data for which there are existing data centres? Furthermore, I propose that the editors of this journal and other such journals establish criteria for collecting relevant data for which no data center exists today in order to prepare for the future. Perhaps it is time for a conference of journal editors to meet and propose a solution to this problem. Unless something real and practical is done in the near future, it will become impossible to find or use scientific data with the resulting loss of time and money for those who need to repeat experiments.
Note:
(1) I've mixed up the spelling of center/re to protect the innocent.
(2) Poll closes in 7 days.
(3) Please - no spoilers in the comments.