Monday 26 September 2011

chemfp 1.0 - Get your fingerprints off this

Andrew Dalke has just released V1.0 of his chemical fingerprint package, chemfp. The goal of the project is to produce a standardised file format called FPS for chemical fingerprints, as well as a set of tools around them. The project itself is a Python module with some superfast C code.

Why is this needed? Well, right now, if you wanted to use several different toolkits to generate fingerprints and then compare and contrast their use for some application (e.g. 2D similarity searching), you would have to go to quite some lengths to figure out how to handle the file format from each toolkit, and so forth. You could of course avoid the file format completely and use a standard API such as Cinfony, but it's often useful to precalculate the fingerprint and store as a file (also, you may not have access to the toolkit, only a file of fingerprints).

So, if you have the appropriate toolkit, you can use chemfp to generate fingerprints in the FPS format. For example, if you have the Python bindings for Open Babel, you can use generate the FP2, FP3, FP4 and MACCS fingerprints in FPS format. Other toolkits are supported, namely OEChem and RDKit, with their own fingerprints. Of course, it would make sense for this format to be supported by the toolkits themselves, and indeed Cactvs already supports the FPS format, as does Rajarshi's fingerprint R package. Direct FPS support by Open Babel is also on the cards.

What about the tools around this format? Right now, there's only a similarity search tool, but that already is very useful as (for example) it supports "many against many" searches, a feature which I have heard requested by several Open Babel users. More tools are on the way though. It's also possible to write your own tools using the Python API. For example, Andrew has written up examples on generating a distance matrix and drawing a cluster dendrogram (in about 20 lines of code, albeit with the help of a few libraries), and on Taylor-Butina clustering (this one maybe 40 lines).

So check it out at http://chem-fingerprints.googlecode.com. In particular, the documentation is excellent.

Image credit: Jack of Spades

Tuesday 13 September 2011

Comments on 5th Meeting on U.S. Government Chemical Databases and Open Chemistry

As previously discussed, I attended and presented at the 5th Meeting on U.S. Government Chemical Databases and Open Chemistry in Frederick, Maryland. Talks were 20 min plus 5 for questions and so quite a lot of ground was covered over the two days.

I understand that Marc is going to put up many of the talks on the web somewhere (insert link here from future), but in the meanwhile you can check out several of the talks over at Lanyrd. The image above is from John Overington's talk (included with permission), and is meant as a joke as in reality the various databases really do cooperate quite well. In fact there was some talk (-ing) of just having a single database maintained by several partners (like the PDB does it).

Now some random notes on the meeting...

It's interesting to note that many of the ideas championed by PMR have now mainstreamed. As an example, the database people were very clear that chemical data should be available directly from publications (in this day and age) rather than have to manually (or automatically) read the literature and enter it in as ChEMBL (and many others) are doing. Attendees on various editorial boards were planning to try to make this happen, although it was noted that submissions to non-prestige journals decrease with every additional hurdle for authors. The counter argument was made that providing these data would make the publication more discoverable, something academia is very keen on.

Jonathan Goodman took a break from finding InChIKey collisions to describe some work on the development of RInChI, a variant on InChI for reaction indexing.

Jasmine Young from the PDB pointed out that reviewers have a role in ensuring data quality too; the PDB is an archival site, not a policing site, and journal reviewers should request the validation report from the PDB when reviewing papers with protein crystal structures.

I learnt, at least a bit, about computational toxicology programmes. Ann Richard from the National Center for Computational Toxicology (part of the EPA) described Tox21 and ToxCast, two large screening programmes. Tox21 has 10,000 chemicals with 50->100 assays, while ToxCast has 960 chemical and 500 assays (E&OE). The overall idea is to generate high quality tox data that can be used to develop tox models and essentially push the whole field forward. QC review of the chemicals bought from suppliers was an important and essential part of the process; 20->30% of purchased chemicals had "issues" (purity but also identity).

Nina Jeliazkova described the EU OpenTox programme (see slides at link above). Apparently, this programme is just complete but it has successfully developed a set of resources, principally a REST interface to a series of webservices for QSAR. Essentially, these webservices can be used to build a model and make predictions. Applications can be built using these webservices, e.g. ToxPredict and ToxCreate.

On the way back from the meeting, I decided to stop off in Keflavik airport in Iceland due to smoke in the cockpit. Any recommendations for must-have Icelandish merch for next time? I held back from buying a jar of volcanic ash from Eyjafjallajökull, and went for the liquorice instead.