Thursday, 4 December 2008

Cinfony paper published in Chemistry Central Journal

My Cinfony paper has just come out:
Cinfony - combining Open Source cheminformatics toolkits behind a common interface NM O'Boyle, GR Hutchison. Chem. Cent. J. 2008, 2, 24.

The paper describes the Why? and How? of Cinfony, shows some examples of use, and discusses performance. To download Cinfony, or for documentation on using Cinfony, please see the Cinfony web site.

Update (05/12/08): Table 3 in the paper caused a stir over at Chem-Bla-ics (Who Says Java is Not Fast and Cheminformatics Benchmark Project #1) and Depth-First (Choose Java for Speed).

Here's the abstract:
Background

Open Source cheminformatics toolkits such as OpenBabel, the CDK and the RDKit share the same core functionality but support different sets of file formats and forcefields, and calculate different fingerprints and descriptors. Despite their complementary features, using these toolkits in the same program is difficult as they are implemented in different languages (C++ versus Java), have different underlying chemical models and have different application programming interfaces (APIs).

Results

We describe Cinfony, a Python module that presents a common interface to all three of these toolkits, allowing the user to easily combine methods and results from any of the toolkits. In general, the run time of the Cinfony modules is almost as fast as accessing the underlying toolkits directly from C++ or Java, but Cinfony makes it much easier to carry out common tasks in cheminformatics such as reading file formats and calculating descriptors.

Conclusion

By providing a simplified interface and improving interoperability, Cinfony makes it easy to combine complementary features of OpenBabel, the CDK and the RDKit.

7 comments:

Geoffrey Hutchison said...

Gee, with all the discussion, this should be a highly-cited paper, right? :-)

Egon Willighagen said...

Give it some time... Should happen... I spotted Noel's own item at the front page of blogs.nature.com :)

BTW, Noel, if you would have cited my and Rich' blog items, I'm sure we would have scored high on PostGenomic.com too...

baoilleach said...

I don't think I'm on PG actually - but I think I was just in a bit of a panic to get out this blog post after seeing your items. I'll add links to your posts above as soon as I have a chance.

Anonymous said...

Hi Noel,
I posted some comments on Egons blog (sorry :-) but besides the
elegant and easy approaches using Python (I basically ignore what the paper is all about), how could you possibly speed up such calculations by using parallelization concepts?

You did your benchmarks on a 3.2 GHz Dual core and the time for MW calculations was something between 40 and 100 seconds. Using SMILES on a Dual Core 2 GHz it is possible to do that in 8 seconds.
Command Line : cxcalc mass 3_p0.0.smi
Elapsed Time : 00:00:08.453
The SDF file takes 20 seconds.
So with SMILES on a Quad-Core it could be 4 seconds or on a 8-Core CPU it could be 2-3 seconds.

Would it be easy to parallelize the Cinfony code in Python using threading APIs or SMP multiprocessing functionality?

Ooh and see the cool language benchmarks at
http://shootout.alioth.debian.org


Kind regards
Tobias Kind

baoilleach said...

@Tobias: I saw your comments over at Egon's. The purpose of the comparison was to investigate the overhead associated with using cinfony, a Python module, rather than writing a C++ or Java program to access the toolkits natively. I don't think parallelisation would help answer this question.

Anonymous said...

Noel,
i know i know, you publish a nice paper about a common interface in python, and someone picks a single table and from there a story develops :-)

BTW. I would not dismiss my concurrency thoughts, because as an example you say in the paper Cinfony makes it easier to calculate descriptors.

If you start using PubChem with 19,495,751 compounds in SDF files,
such parallelization ideas become very important. How long would it take to calculate all masses (MW as very fast 1D descriptor) for PubChem?

20 seconds for 25,000 compounds =
16000 seconds for 20,000,000 cmpds.
16,000 seconds = 266 min = 4,4 h

Now if you take some of the other 2D or 3D descriptors, which are maybe 10 or 100 times slower - it becomes days. I would like to do my descriptor analysis in minutes or seconds but not hours and days.

If the tools are single threaded, the python interface could provide a parallel wrapper for such purposes (without using cluster computing or MPI utils). So people on a normal quad core workstation could already "feel the speed".

Furthermore parallelism using streaming GPUs as in CUDA for NVidia becomes very interesting.PyCuda is the API wrapper for CUDA.
A colleague of mine (Gert Wohlgemuth) tested PyCuda and got some nice results using 128 streaming CPUs instead of a Dual Core machine. Of course this would require native python code, but what I want to say is: it is all there, people can use multiple CPU cores, but they don't do it (much).

If you go one step further and use workflow system like Pipeline Pilot, Taverna or Kepler (and every single drug company uses workflow systems) then it becomes even more convenient. Such parallelization tasks can be automatically performed (without user interference and depending on the number of CPUs the person has). That is really cool I think.

Our BinBase system uses a JAVA cluster API for our Rocks LINUX cluster (Sun grid engine) so time consuming calculations for mass spectral alignements and identification in metabolomics can be done outside the persons workstation.

I think parallelization is a *big* deal because the speed-up allows computational approaches which were not reasonable to perform before, hence such compute-intensive calculations are possible on normal quad core or oct core workstations and clusters. Hence parallelization becomes an enabler for new discoveries.

That said, I am still waiting for my calculations to finish...

Cheers
Tobias Kind
fiehnlab.ucdavis.edu/staff/kind

baoilleach said...

Tobias, it's not that I don't think performance is important, but just that this wasn't the focus of the paper. The paper was really about enhancing the usability of these toolkits and making their features more widely available.

As you point out, if you're going to be analysing the whole of PubChem, performance is key. In that instance, you're probably not going to want to use Cinfony, although it may be possible to spawn subpocesses for each node (I haven't looked into this area much).

I agree that the trend towards multicore processors is going to make this an increasingly hot topic in future, although I am not convinced that more computational power will yield better answers.