Friday, 20 February 2009

Repeat after me - for loop in Python

for loops allow you to repeat the same set of instructions multiple times with a different value of a particular variable.

How do I loop in Python?


The Python for loop has the following general form:
for myvar in myiterator:
# do something with myvar

How do I loop over a list?


If you coming from other languages such as C, you might be used to:
for i in range(len(mylist)):
myvar = mylist[i]
# do something with myvar

In Python, it's much better (that is, the intention is clearer) if you do the following instead:
for myvar in mylist:
# do something with myvar

If you suddenly realise you want the value of the index i, you can easily adapt the previous loop:
for i, myvar in enumerate(mylist):
# do something with myvar
# do something with i
How do I loop over two lists simultaneously?

Let's say you have a list of people's firstnames and another list containing their surnames, and you want to write them on the screen. Don't do the following:
for i in range(len(firstnames)):
print "%s %s" % (firstnames[i], secondnames[i])

Instead, you should make the intention of the code clear by using zip as follows:
for firstname, secondname in zip(firstnames, secondnames):
print "%s %s" % (firstname, secondname)

Loop pattern #1 - Building a new list from an old

newlist = []
for mynum in mylist:
newnum = mynum*2
newlist.append(newnum)

In the case of this simple example, it should be done as a list comprehension instead:
newlist = [mynum*2 for mynum in mylist]

Here is a slightly more complicated example, involving an "if" statement that acts as a filter for even numbers:
newlist = []
for mynum in mylist:
if mynum % 2 == 0:
newnum = mynum*2
newlist.append(newnum)

Again, this can be done instead as a list comprehension:
newlist = [mynum*2 for mynum in mylist if mynum % 2 == 0]
Loop pattern #2 - Summing things up

Another common pattern is totalling things up using a list:
total = 0
for mynum in mylist:
total += mynum

Although in this case, it would be easier to just use:
total = sum(mylist)

One thing to be careful of though is that using "+" to add strings is slow. If speed is important, avoid the following:
longstring = ""
for myvar in mylist:
# Create smallstring here somehow
longstring += smallstring
Instead use pattern #1 to create a list of strings, and join them at the end:
stringlist = []
for myvar in mylist:
# Create smallstring here somehow
stringlist.append(smallstring)
longstring = "".join(stringlist)
Loop pattern #3 - Filling in a dictionary

A common pattern is to update information in a dictionary using a loop. Let's take an example of counting up how many people have a particular firstname given a list of firstnames.

The problem here is that it is necessary to check whether a particular key is in the dictionary before updating it. This can leads to awkward code like the following:
name_freq = {}
for firstname in firstnames:
if firstname in name_freq:
name_freq[firstname] += 1
else:
name_freq[fistname] = 1

Now while this can be improved somewhat by using name_freq.get(firstname, 0), that's getting a bit complicated (and doesn't extend to dictionaries of lists). Instead you should use a defaultdict, a special dictionary that has a default value, as follows:
from collections import defaultdict
name_freq = defaultdict(int)
for firstname in firstnames:
name_freq[firstname] += 1

And what about where you wanted to store the corresponding surnames in a dictionary by firstnames? Use a defaultdict(list), of course:
from collections import defaultdict
same_firstname = defaultdict(list)
for firstname, surname in zip(firstnames, surnames):
same_firstname[firstname].append(surname)

Loop pattern #4 - Don't use a loop


Think dictionary, set, and sort. A lot of tricky algorithms can be implemented in a few lines with one or two of these guys.

A trivial example is finding unique items in a list: set(mylist). Want to check whether that genome sequence only contains 4 letters?: assert len(set(mygenome))==4

The following example uses sort. Given a set of docking scores for 10 poses, find which poses have the top three scores. Here's a solution using the so-called decorate-sort-undecorate paradigm:
# Decorate ("stick your data in a tuple with other stuff")
tmp = [(x,i) for i,x in enumerate(pose_scores)]
# Sort (uses the items in the first position of the tuple)
tmp.sort(reverse=True)
# Undecorate ("get your data back out of the tuple")
top_poses = [z[1] for z in tmp]

print top_poses[:3]



Image credit: Loop by Panca Satrio Nugroho (CC BY-ND 2.0)

Thursday, 19 February 2009

Extract of chemical - Paper on OSRA published

It seems like only last week that another paper was published on an Optical Chemical Recognition software. This week it's the turn of OSRA, the work of Igor Filippov, which has previously been discussed in several posts on this blog:
Igor V. Filippov and Marc C. Nicklaus. Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. J. Chem. Inf. Model. 2009, In press.

Unlike all existing software, OSRA is open source. You can get it right now from its web site.

One interesting point raised in the paper is how best to assess performance. It appears that a Tanimoto type fingerprint is highly penalised if even a single error is made. This is because an error such as replacement of an O with a C, although visually minor and easily corrected in a drawing package, is highly significant chemically. Perhaps a simple count of (bonds in common)/(union of bonds) would suffice although this would require mapping of bonds from the original structure to the recognised structure (the new SMSD tool by Syed Asad Rahman from Janet Thornton's group might help here).

Friday, 6 February 2009

New software for OChRe - ChemReader

The good news is that the field of Optical Chemical Recognition (OChRe) has just received a new entry, ChemReader:
J Park, GR Rosania, KA Shedden, M Nguyen, N Lyu, K Saitou. Automated extraction of chemical structure information from digital raster images. Chem. Cent. J. 2009, 3, 4.

According to the test set used (and it's not stated whether the test set was also used to train the software) the results are very good, and not just for ChemReader - OSRA is doing pretty well also:
And the bad news? Rajarshi recently discussed the phenomenon of publishing papers describing software which can neither be purchased nor obtained for free. This is also the case for ChemReader. As the software is not open source, it cannot be made available as it relies on the open source libraries GOCR and Greystoration (also used by OSRA). However, the authors are "planning that ChemReader becomes commercially available in the near future, with removal of open source parts". Doesn't that mean that the final performance of the software, as well as the algorithms used, will be different from that described in the paper?

Tuesday, 3 February 2009

Of OChRe, OSRA and OASA (but not OSCAR) Part II

In an earlier post I showed a way of testing the performance of an optical chemical recognition (OChRe) software, OSRA.

Since then, a new version of OSRA has been released (1.1.0). In addition, a new version of OASA has been released which can handle multiple molecules (this caused problems with the earlier test where a disconnected molecule was produced).

The results are shown here.

Notes:
(1) I used Xe to indicate an unknown atom type (only occurs once).
(2) If OSRA detects multiple molecules in the original image, a multi-mol SDF is created where the coordinates do not correspond to the original location in the image. Where this occured here, I have just depicted the first molecule.

Friday, 30 January 2009

TwistyMol is dead - Long live TwirlyMol

My first attempt at a Javascript molecular viewer culminated in TwistyMol. For TwistyMol, I took as my starting point processing.js and consequently ran into some difficulties on IE, which doesn't support Canvas.

TwirlyMol is my attempt to start over with a browser-independent Javascript vector graphics library. I googled a bit and came across some pointers on Stack Overflow (a Q&A website that does things in a new way). I choose to go ahead with dojox.gfx, which is an experimental component of the Dojo javascript library. On IE, it uses VML, while on Firefox, it uses SVG (Canvas is also supported). I reckoned that this might improve performance on IE. Also, Dojo is likely to be better supported going forward, unlike processing.js which is essentially a one-man job (sure, the one man is the author of jQuery, but still) and its goal is quite different.

So, here's the demo.

It's ready to use for whatever you want. If you think about it, the code behind TwirlyMol can easily be adapted to displaying other types of 3D data, for example principal components graphs. What else would be useful? That's up to your imagination (comments below welcome).

And I think this draws to a close my involvement in Javascript molecular viewers. Hope y'all had fun. I've put the code up on Github just to see what happens. Maybe someone wants to add perspective.

Notes:
(1) Colours of atoms taken from the Blue Obelisk Data Repository. Even if you don't understand what the Blue Obelisk is about, I'm sure you can appreciate the benefit of shared chemical resources like these data.
(2) IE performance is still much slower than Firefox, but is better than with Twistymol.
(3) With Firefox, adding a large number of SVG molecules to a page is much slower than with Canvas. If doing this in practice, you'll need to add them in chunks of 50 or so, with a setTimeout() to add the next 50.
(4) Coding Javascript for IE really is frustrating.
(5) If you use mouse gestures, rotating or zooming with the right mouse button can trigger a gesture event. One way around this would be to implement support for modifier keys so that shift+left button causes right button behaviour, while alt+left would cause middle button behaviour. This would be quite easy as the dojox.gjx library has a function for testing whether shift is held down (I can't remember about the alt).

Thursday, 22 January 2009

Turn, turn, turn - TwistyMol ready for action

A new version of TwistyMol is available. It's now ready for use by others (that means YOU). Here's a simple example and here's another with a few more molecules.

What's changed since last time? Well I've added in an SDF file parser, made it possible to have multiple TwistyMols on the same page, sorted out the rotation (at least for the moment) and gone a bit overboard on the shadows. Performance on IE is still a big problem, and displaying more than a handful of molecules at the same time is out of the question (too slow). I still don't know if this is an intrinsic problem with VML or just the whole Canvas to Excanvas to VML conversion.

If you want to use it yourself, download the Javascript files referenced in the HEAD section of the demo page, and look at the code on the demo page itself. The only tricky part is getting the SDF file into Javascript - you'll either have to place it on a webserver and use an Ajax call or do as I did and convert it to a Javascript string variable (hint: read() it in with Python and convert \n to \\n).

Feel free to modify and distribute the code. You could add a comment below with a link if you use it anywhere.

Tuesday, 13 January 2009

Molecular viewer now works on Internet Explorer

I've managed to get my mimimal molecular viewer (see previous post) working on Internet Explorer. Smooth, it is not, but at least it works (I recommend Firefox instead). I've also renamed it to TwistyMol. Minimol sounded a bit boring, like Microsoft. You can try it out here.

So what was the problem with Internet Explorer? Well, for graphics, TwistyMol uses processing.js which draws the molecule using the Canvas tag. Unfortunately, Internet Explorer is the only browser that doesn't support Canvas; instead, it has invented its own system called VML (Vector Markup Language). So, for IE users, I need to import the ExplorerCanvas Javascript library developed by Google; this converts all calls to Canvas to their equivalent in VML. (I also needed to make a few changes to processing.js to enable IE support.)

Next steps - sort out the rotation (Simon suggested a possible solution), turn the Javascript into proper Javascript (encapsulate it a bit), stick in a file format parser (I'm hoping Rich will come through on this one), and deploy to the masses.