Wednesday, 23 July 2008

Chemistry in R

For many cheminformaticians, R is the preferred way of analysing multivariate data and developing predictive models. However, it is not so widely known that there are R packages available that are directly aimed at handling chemical data.

Over the last few years, Rajarshi Guha (Indiana University) has been doing some nice work integrating the CDK and R. His publication in J. Stat. Soft., Chemical Informatics Functionality in R, describes the rcdk and rpubchem packages. The rcdk package allows the user to read SDF files directly into R, calculate fingerprints and descriptors, calculate Tanimoto values, view molecules in 2D (JChemPaint) and 3D (Jmol), calculate SMILES strings, and access the property fields of file formats. The rpubchem package is focused on downloading compounds, property values and assay data from PubChem. See also articles in R news and CDK news [1], [2], [3].

A more recent development is ChemmineR, described in the latest issue of Bioinformatics: "ChemmineR: a compound mining framework for R". The authors appear to be unaware of the earlier work by Rajarshi, and so there is no comparison of available features. However, based on the documentation on their website, it seems that much of the functionality revolves around a type of fingerprint called atom-pair descriptors (APD). SDF files, when read in, are converted to a database of APDs and these can be used for similarity searching, clustering, removal of duplicates and so on. Sets of molecules can be visualised using a web connection to the ChemMine portal (I'm not sure what software is used). According to the documentation, future work will include descriptor calculation with JOELib.

So, there you have it. An exhaustive survey of the two available methods for bringing chemistry into R. Is the time ripe for a cheminformatics equivalent to Bioconductor?


Egon Willighagen said...

Every now and then new chemoinformatics libraries pop up, but none really seem to have the functionality coverage the CDK does. And, most of them are GPL.

J. Fred Muggs said...

It would be nice if CDK weren't so java-centric. Cinfony seems to be headed in the right direction to fix this problem, at least for python. Personally, I think there's a crying need for idiomatic APIs in high-level languages that practitioners are likely to use ( pybel, z.B.).

Is bioconductor the best way to describe what an R cheminformatics platform should be? Heaven knows that microarray analysis is big, but it is just one application area. A chemically-aware R is a good thing, but I would hope we could have a really strong, idiomatic core API in place before the decision is made to press for a single application area as the main thrust of new development.

Rich Apodaca said...

>Personally, I think there's a crying need for idiomatic APIs in high-level languages that practitioners are likely to use ( pybel, z.B.).

If you haven't seen them already, you might want to check out Ruby Open Babel and Ruby CDK.

Off topic: Noel, is there any way to enable comments on your blog without using OpenID or Google/Blogger ID?

Noel O'Boyle said...

@jfred: The CDK isn't Java-centric; it's just written in Java. If you look at Rajarshi's R/CDK bridge, that's quite a high-level API. In fact, it reminds me of the Cinfony API. I think you might find that quite an idiomatic (for R) API.

I mention bioconductor but I must admit I'm not very familiar with it. The ChemmineR paper also mentions something about it.

z.B. means...for example, right? Assuming I know German would be a mistake :-)

@Rich: I think we need to do more on the OpenBabel side to advertise the Ruby capabilities.

Regarding comments, I simply don't want to deal with spam. However, I've enabled word verification and anonymous commenting for a trial period.

Rajarshi said...

ChemmineR looks interesting, but in the end is limited to doing desc calcs on a set of molecules. Given that JOELib is no longer developed I wonder why they're basing their work on that?

The whole idea of accessing cheminformatics functionality in R is to make it so that you don't need to go outside of the R environment to do stuff (sort of on the lines, start Emacs once and never leave :). So if all you can do is eval descritpors, that could be limiting

Regarding the Java centricity of the CDK - it's written in Java!

Regarding Bioconductor - I'm not sure that such a thing is justified. How much of cheminformatics modeling is cheminformatics-specific and how much is machine-learning specific?

Regarding idomatic cheminformatics - I like to think the rcdk and rpubchem packages represent idomatic R. That's why I didn't use SWIG or anything like that and instead did it manually. And it's also the reason that the the CDK API is not fully represented in an idiomatic form.

Rajarshi said...

On a related note, another useful package for cheminformatics in R is fingerprint

Egon Willighagen said...

Dear J. Fred Muggs... CDK is written in Java, so using it in a Java environment is easy, but certainly not the only option. People have been using it from other programming languages, such as python and ruby.

Integration, like bindings for python, ruby, etc, or in other frameworks like KNIME, Taverna, R, is not my personal priority; I have my hands full with just maintaining and improving the CDK. Others make the useful contributions, like bindings for other languages. At some point, I hope that these projects become integral part of the CDK project.

Anonymous said...

It is so nice that I can import data in sdf format without converting to csv or txt when using R for QSAR work. Although good workflow engines (taverna, Pipeline Pilot, knime, etc) enable almost every integration like that, doing it in a far simpler environment is preferable for me as it requires no additional learning time.

Unknown said...

Since hope is not a strategy ... I agree with the importance of pragmatic, large-scale, and data driven solutions.

In other words, call me bold and ignorant, but I am nowadays less interested in API comparisons, only in solutions.

Their strategy is creating solutions using several packages, so I do not see any reason why they should not use JOELib ;-)
If academic people want to choose out of OELib derivatives, then I would strongly recommend a modular design using OEChem (academic license) and OpenBabel (GPL), or any other package out there. If the cheminformatics kernel itself is important, then the basic design is questionable, right?

Anyway, please note that OEChem and OpenBabel have many interfaces, and some very experienced developers with industrial experience.

Rajarshi said...

Joerg, I hope you didn't take the comment personally.

But the fact remains that if OSS product X is based on OSS product Y and the community around Y does not provide further development, it forces either X to do the new development or else be stuck with what ever Y gives. If X is happy with what Y provides well and good. But what if X wants more?

I can see license arguments becoming religious - but more important to me is what support I can get from an OSS project - especially one that is not in my core expertise area.

In that sense, if people are looking for OSS cheminformatics toolkits, then OpenBabel or CDK would do. Which one is chosen would obviously depend on platform, license, features, personal preference etc.

Egon Willighagen said...

Regarding maintainance of JOELib... It's GPL, so it cannot be integrated into the CDK. However, there is a relatively small GPL extension of the CDK, with some Weka code by Miguel in there.

If things go the next year I like it to go, I would like to port JOELib code into this GPL extension of the CDK (yeah, I need a proper name for it... it's not CDK, as CDK is LGPL, not GPL).

I see mutual benefit: for people using the CDK, being a single point of entry; for JOELib, continued maintainance.

In Bioclipse1, one would just load both the CDK and the JOELib plugin to allow calculation of both descriptors; rather user friendly, I think.

Unknown said...

Guys, I hope you know that I do not like black and white scenarios. I am not following the NIH principle. In contrast, I just want to encourage people to think about the following scenario.

You are in a situation with limited resources in academia or industry and you have to deliver something.
What are you using for processing your data? OSS or commercial source code?
In most cases I would hope that people are combining the best tools out there for providing the most transparent solution with the highest scientific standard. In reality, people are limited by time, money, and knowledge for doing this.

This is exactly the reason why I highly appreciate any efforts merging and integrating tools. If some tools can not play with each other due to license issues, then there are still multiple possible scenarios.
Any solution requires time, and people in industry and academia spending time on it and you have multiple examples where extensions become either a commercial product or a new OSS addition.

Anyway, if tools offer both, academic licenses and commercial support, I appreciate this strongly, because people in academia can do what the want (as long they have no commercial interest) and people in industry get their paid and required software support. In other words, I believe in 'both', as mentioned by
D. C. Weaver, Build vs. Buy vs. Both, Pharmaceutical Discovery, 2005.