Tuesday, 20 November 2012

What's taking so long? - Profiling Open Babel

Profiling code shows where all the runtime is spent. In the case of Open Babel, profiling is a bit awkward due to its use of dynamically-loaded libraries (the format plugins, etc.). So here's how you do it on Linux...
[openbabel/build]$ rm CMakeCache.txt 
[openbabel/build]$ CXXFLAGS="-pg" LDFLAGS="-pg" cmake ../trunk -DCMAKE_INSTALL_PREFIX=../tree -DCMAKE_BUILD_TYPE=DEBUG -DBUILD_SHARED=OFF
[openbabel/build]$ make
This should successfully compile the library and plugins with profiling information, but will fail when it comes to linking one of the executables:
[openbabel/build]$ make obabel
[100%] Built target openbabel
Scanning dependencies of target obabel
[100%] Building CXX object tools/CMakeFiles/obabel.dir/obabel.o
Linking CXX executable ../bin/obabel
/usr/bin/ld: dynamic STT_GNU_IFUNC symbol `strcmp' with pointer equality in `/usr/lib/../lib64/libc.a(strcmp.o)' can not be used when making an executable; recompile with -fPIE and relink with -pie
collect2: error: ld returned 1 exit status
make[3]: *** [bin/obabel] Error 1
make[2]: *** [tools/CMakeFiles/obabel.dir/all] Error 2
make[1]: *** [tools/CMakeFiles/obabel.dir/rule] Error 2
make: *** [obabel] Error 2
Yikes! Using VERBOSE=1 we can see the offending command:
[openbabel/build]$ VERBOSE=1 make obabel 
cd /home/noel/Tools/openbabel/profile/build/tools && /usr/local/bin/cmake -E cmake_link_script CMakeFiles/obabel.dir/link.txt --verbose=1
/usr/local/bin/c++   -static -pg  -g -g3 -fno-inline   -pg CMakeFiles/obabel.dir/obabel.o  -o ../bin/obabel -rdynamic ../src/libopenbabel.a -Wl,-Bstatic -lpthread -Wl,-Bdynamic -lm -lz -Wl,-Bstatic 
/usr/bin/ld: dynamic STT_GNU_IFUNC symbol `strcmp' with pointer equality in `/usr/lib/../lib64/libc.a(strcmp.o)' can not be used when making an executable; recompile with -fPIE and relink with -pie
collect2: error: ld returned 1 exit status
With Roger's help, I was able to change this to something simpler which will compile:
[build/tools]$ /usr/local/bin/c++ -pg  -g -g3 -fno-inline   -pg CMakeFiles/obabel.dir/obabel.o  -o ../bin/obabel ../src/libopenbabel.a -lpthread -lm -lz
Success is mine AT LAST!! Now let's profile it:
[build/bin]$ export BABEL_DATADIR=wherever
[build/bin]$./obabel bigfile.smi -onul
[build/bin]$ gprof ./obabel > gprof.out
Now time to read the gprof manual.

We can design molecular wires for you wholesale Part II

As described previously, last year I published a paper in J. Phys. Chem. C with Geoff Hutchison on Computational Design and Selection of Optimal Organic Photovoltaic Materials.

A week or two ago someone emailed me for a copy of the paper. Fortunately, after a one year embargo, if you have an open access mandate you can request permission from the editor of an ACS journal to deposit the PDF in an institutional repository.

I went through this process a little while ago, and I am pleased to say that a copy of the PDF can now be found in University College Cork's institutional repository at http://hdl.handle.net/10468/748.

I'm not sure how people are going to find this though - it doesn't seem to be very prominent in Google. In particular, the PDF is not indexed (google search with filetype:pdf).

Friday, 9 November 2012

Tricks with SMILES and SMARTS Part II

A much-underused feature of SMILES is the ability to apply 'atom classes' to atoms using a colon (inside square brackets). So, for example, CC and C[CH3:6] both represent ethane, but in the latter case one of the carbons is labelled as being a member of atom class 6.

So what's the meaning of atom class 6? Well, it's whatever you want - it's simply a label that you use to indicate some related information. For example, you might want to record reaction locations, or locations of common substitutions, or mappings between different molecules (reactant/product, or sub/superstructures).

Anyhoo, here's how you access the atom class information in Open Babel from Python:
>>> import pybel
>>> ob = pybel.ob
>>> mol = pybel.readstring("smi", "C[CH3:6]")
>>> print mol.write("smi")
CC

>>> print mol.write("smi", opt={"a":True})
C[CH3:6]

>>> data = ob.toAtomClassData(mol.OBMol.GetData("Atom Class"))
>>> data.HasClass(1)
False
>>> data.GetClassString(1)
''
>>> data.HasClass(2)
True
>>> data.GetClassString(2)
':6'
>>>

Thursday, 8 November 2012

Plotting accesses on the axis Part III

Following on from Parts I and II, this is the last in a series of posts exploring how journal access statistics provided by Open Access journals such as Journal of Cheminformatics give an insight into relative impact of papers.

It's now over a year since the Blue Obelisk and Open Babel papers were published so let's look at the accesses over that period: ...close to straight lines with a defined slope.

And here's an update of the view over the first month, this time including the Universal SMILES paper: ...it seems like the accesses to this paper mirror almost exactly those of the Blue Obelisk paper.

In conclusion, it would be nice if journals provided these sorts of graphs, or if some third-party website (e.g. one of altmetrics ones) did it. All of the data is on the website; it just needs to be collated as I've done here.

Wednesday, 7 November 2012

Tricks with SMILES and SMARTS

Because of the relationship between SMILES and SMARTS, there are some fun tricks you can do (for some value of fun). For example, over at the NextMove blog I have written about creating a substructure hierarchy (here and here).

Here's another example I came up with in response to a recent question on the OB mailing list from Pascal Muller:
Having a molecule (let's say ethylpyridine CCc1cccnc1) and its scaffold (pyridine c1ccncc1), I would like to create a generic scaffold (smarts) for substructure searches: considered atoms become "any" atom (*), and bonds becomes "any" bond (~).
I.e., the smarts should be  CC*~1~*~*~*~*~*~1 (parts not belonging to the scaffold don't change).

Is there a way in Pybel to mutate atom / bond into "*~" apart from string replacement in the smiles? I anticipate problems with brackets doing so.
And my reply:
Atoms with atomic number 0 are written as *. Bonds with BO 4 are
written as $. So...the following hack may work for you in most cases.
:-)

>>> import pybel
>>> mol = pybel.readstring("smi", "CC(=O)Cl")
>>> mol.atoms[0].OBAtom.SetAtomicNum(0)
>>> print mol
*C(=O)Cl
>>> mol.atoms[2].OBAtom.SetAtomicNum(0)
>>> print mol
*C(=*)Cl
>>> bond = mol.OBMol.GetBond(2, 3)
>>> bond.GetBO()
2
>>> bond.SetBO(4)
>>> print mol
*C($*)Cl
>>> print mol.write("smi").replace("$", "~")
*C(~*)Cl
It probably isn't a general solution, but tricks like these can go a long way to solving a problem in many cases.

Wednesday, 31 October 2012

A non-random method to improve your QSAR results - works every time! Part II

In an earlier post, I argued that when developing a predictive QSAR model that non-random division of a QSAR dataset into training and test is a bad idea, and in particular that diversity selection should be avoided.

Not everyone agrees with me (surprise!). See for example this excellent review by Scior et al in 2009 on "How to Recognize and Workaround Pitfalls in QSAR Studies: A Critical Review". Well, the paper is excellent except for the bit about training/test set selection which explicitly pours cold water on random selection in favour of, for example, diversity selection, hand selection or *cough* rational selection.

I had vague thoughts about writing a paper on this topic, but now there's no need. A paper by Martin et al has just appeared in JCIM: "Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?"

And the answer? No.

Combining their discussion with my own randomly-selected thoughts, the reasons it's a bad idea can be summarised as follows:
  1. You are violating the first rule of statistical testing - the training and test sets must be chosen in the same way (this was pointed out to me by a bioinformatician on FF - sorry I can't recall whom).
  2. All the weird molecules are sucked into your training set
  3. Every item in the test set is going to be close to an item in your training set, and your internal predictions are going to be overoptimistic compared to reality. (You do care about reality, don't you?)
Dr Kennard-Stone has a lot to answer for, but hopefully this is the final nail in the coffin (it being Halloween after all) for diversity selection in QSAR.

Update (7/11/12): In the comments, George Papadatos points out that "There's also related evidence that sophisticated diversity selection methods do not actually perform better than random picking".

Thursday, 25 October 2012

Open Access and Ireland

It's Open Access week, so let's see how Ireland (ROI) is shaping up.

The good news first. The Government has just this week announced a national OA mandate, presumably following on the recent happenings in the UK. Indeed, most of the national funding agencies have had their own OA mandates for several years, but this formalises this policy at a national level.

So let's check out the OA output of the various universities and see whether taxpayers are getting their money's worth. Each university has its own institutional repository, and these are collected in a national repository, Rian. I generated the following bar charts using the statistics on deposited journal articles in Rian, combined with academic staff numbers taken from the Wikipedia pages for each university.

Are academics in TCD really 9.5 times more productive than their peers in UCC? Possibly not. I think it likely that OA deposition rates across most Irish universities could do with improvement.