Friday, 22 June 2007

4th Joint Sheffield Conference on Chemoinformatics

I recently attended the 4th Joint Sheffield Conference on Chemoinformatics, and enjoyed it a lot.

Some of the things I noted were:
(1) Industry talks have suddenly got more interesting (than they were, that is). After describing methods run on their own in-house data, they suddenly say "In order to compare with other methods, I ran this method on a publicly available dataset". Great. About time. The big free datasets are now so well established, they've even heard about them in industry. Thanks go to ZINC, DUD, PubChem, and the NIH guys who are even making available some HTS data (is this correct? I cannot easily find it on the web).

(2) In this postmodern age, it is now a requirement for all cheminformatics conferences to start with a talk that tells us we're wasting our time trying to dock anything, as it doesn't work. Full marks for shock value, but perhaps the more interesting content of Anthony Nicholls talk related to statistical comparisons of AUC (area under the ROC curve) for a published study of multiple docking problems (Warren et al.). Basically, the error bars are so large we cannot say that any program is significantly different from any other (according to him :-) [disclaimer, I'm developing GOLD]).

(3) Of course, no cheminformatics conference would be complete without dodgy statistics, and I'm as guilty of it (not knowingly, I hope) as anyone. Multiple tests on the same dataset require corrections for significance testing such as the Bonferroni statistic - "if it passes the Bonferroni it's probably true" was the quote from Martin Packer (AZ). Everyone wrote that one down when it was mentioned. But for the cheminformaticians who skipped Statistics 101, there was more extra homework. Jonathan Hirst directed us to read the appendix of one of his papers for some more light reading on hard-core statistics such as the Nemenyi test and the improved Friedman statistic.

(4) Open Source chemistry software got a mention by some of the academics speaking. Jonathan Hirst in particular gave Joelib2 a big thumbs up, and made it clear that his own software is Openly available from his web page (although no license is mentioned in the README there). The author of Joelib2 was in the audience, Joerg Kurt Wegner, and it would have been nice if the speaker had put Joerg's name on his slide along with the name of the program and the website. After all, it's nice to get some personal recognition if you put a lot of work into such a program and then make it Openly available. Jonathan had to skip his next-to-last slide promoting the Blue Obelisk group, but it was still good to see the reference flash by. Irilenia Nobeli used the CDK, as did David Wild who is very active in the development of Web services with open source software.

8 comments:

James said...

Re: the Nottingham code using JOELib2. The TMACC code is available under the GPL, just like JOELib. I've updated the webpage and the README to explicitly state the licence. Cheers.

Egon Willighagen said...

In reply to the comments on docking. Last time I saw plots at the German Chemoinformatics Conference I was amazed by the prediction errors, and am happy that more people think that too :) I also found in my own research that those error bars apply to many QSAR and QSPR studies (see ). Below 100 objects you can forget purely statistic validation to give reasonable results. The error margin on, for example, R^2/Q^2 is in the order of 0.05. Any model which is does not have a R^2 or Q^2 more than 0.05 better, is not significant. The more objects in your data set, the smaller this error margin (obviously).
(BTW, I only learned about ROC curves about a year ago... too late for my research; it seems really useful.)

Thanx for the links to the other statistics. Sounds interesting.

Good to see further adoption of opensource here too. Did you talk with Joerg about the state of JOELib2?

Egon Willighagen said...

Sorry, forgot to add the DOI for that paper: 10.1021/ci050282s.

baoilleach said...

@james: Great, thanks for clarifying that

@egon: I don't think people realise that all docking programs perform well on some targets and not so well on others (and it's different for each program). What does 'prediction error' mean in this context, though? Anthony Nicholls wants to improve the average AUC of a ROC curve across all targets. I think this is reasonable. However, even then, since the AUC of a ROC curve is simply directly related to the average rank of an active (as derived in the BEDROC paper), maybe the AUC value itself already hides some useful information, i.e. which is better, to predict some actives in the top 1% and some in the bottom 1%, or all actives in the centre around 50%.

I spoke briefly with Joerg about Joelib2, and other things. I think at the moment he is busy getting to grips with his new job.

Joerg Kurt Wegner said...

Beside the things mentioned already, David highlighted also Rajarshi's and Egon's support to the Web2.0 things presented.

And there was a critical and eligible question of Anthony Nicholls why this is done? The problem is that no company will ever submit molecular structures into the WWW, because this is just not safe and violates intellectual property protection. There he is abolutely right and he was also the one freaking out the whole docking community by giving a brilliant and refreshing talk.

David answered that education is the main goal and the next time anyone asks, we should not forget that setting standards is another goal. And, companies can always use intranet solutions on the same technology. So, there is indeed a reason for working on Web2.0 technologies. And especially 'the problem' of companies is the reason why universities should work on this. As already said by David might those mixtures lead to very innovative ways looking at data.

Joerg Kurt Wegner said...

Two PhD students from my former institute have taken the JOELib2 responsibility, but it might take a while for them to get productive. They told me that they have done already some GUI and mining extensions, but those things are still internal (I guess for publishing reasons).

As said, I am completely busy with my PostDoc and my job at Tibotec. Extending JOELib2 is not an active part of it, not at the moment. And I think this status will not change for a while, because I have higher priorities elsewhere. And, as usual, I am working on them very seriously.

chris said...

" I don't think people realise that all docking programs perform well on some targets and not so well on others (and it's different for each program)."

You should also add that it is not possible in advance to tell which program will work best for which target.

Also for targets with known ligands it would be interesting to see how the docking/scoring tools compare with the use of simple 2D descriptors.

Mukesh Yadav said...

"Docking Tools need cross validation and a essential set of input parameters concerned with the job assigned." i have used GOLD and many others and found what u offer to calculate using a statistical method should be compatible under limitations using the extreme of it...
Ultimately its u who decides the way you want to work a software...