Friday 20 February 2009

Repeat after me - for loop in Python

for loops allow you to repeat the same set of instructions multiple times with a different value of a particular variable.

How do I loop in Python?

The Python for loop has the following general form:
for myvar in myiterator:
# do something with myvar

How do I loop over a list?

If you coming from other languages such as C, you might be used to:
for i in range(len(mylist)):
myvar = mylist[i]
# do something with myvar

In Python, it's much better (that is, the intention is clearer) if you do the following instead:
for myvar in mylist:
# do something with myvar

If you suddenly realise you want the value of the index i, you can easily adapt the previous loop:
for i, myvar in enumerate(mylist):
# do something with myvar
# do something with i
How do I loop over two lists simultaneously?

Let's say you have a list of people's firstnames and another list containing their surnames, and you want to write them on the screen. Don't do the following:
for i in range(len(firstnames)):
print "%s %s" % (firstnames[i], secondnames[i])

Instead, you should make the intention of the code clear by using zip as follows:
for firstname, secondname in zip(firstnames, secondnames):
print "%s %s" % (firstname, secondname)

Loop pattern #1 - Building a new list from an old

newlist = []
for mynum in mylist:
newnum = mynum*2

In the case of this simple example, it should be done as a list comprehension instead:
newlist = [mynum*2 for mynum in mylist]

Here is a slightly more complicated example, involving an "if" statement that acts as a filter for even numbers:
newlist = []
for mynum in mylist:
if mynum % 2 == 0:
newnum = mynum*2

Again, this can be done instead as a list comprehension:
newlist = [mynum*2 for mynum in mylist if mynum % 2 == 0]
Loop pattern #2 - Summing things up

Another common pattern is totalling things up using a list:
total = 0
for mynum in mylist:
total += mynum

Although in this case, it would be easier to just use:
total = sum(mylist)

One thing to be careful of though is that using "+" to add strings is slow. If speed is important, avoid the following:
longstring = ""
for myvar in mylist:
# Create smallstring here somehow
longstring += smallstring
Instead use pattern #1 to create a list of strings, and join them at the end:
stringlist = []
for myvar in mylist:
# Create smallstring here somehow
longstring = "".join(stringlist)
Loop pattern #3 - Filling in a dictionary

A common pattern is to update information in a dictionary using a loop. Let's take an example of counting up how many people have a particular firstname given a list of firstnames.

The problem here is that it is necessary to check whether a particular key is in the dictionary before updating it. This can leads to awkward code like the following:
name_freq = {}
for firstname in firstnames:
if firstname in name_freq:
name_freq[firstname] += 1
name_freq[fistname] = 1

Now while this can be improved somewhat by using name_freq.get(firstname, 0), that's getting a bit complicated (and doesn't extend to dictionaries of lists). Instead you should use a defaultdict, a special dictionary that has a default value, as follows:
from collections import defaultdict
name_freq = defaultdict(int)
for firstname in firstnames:
name_freq[firstname] += 1

And what about where you wanted to store the corresponding surnames in a dictionary by firstnames? Use a defaultdict(list), of course:
from collections import defaultdict
same_firstname = defaultdict(list)
for firstname, surname in zip(firstnames, surnames):

Loop pattern #4 - Don't use a loop

Think dictionary, set, and sort. A lot of tricky algorithms can be implemented in a few lines with one or two of these guys.

A trivial example is finding unique items in a list: set(mylist). Want to check whether that genome sequence only contains 4 letters?: assert len(set(mygenome))==4

The following example uses sort. Given a set of docking scores for 10 poses, find which poses have the top three scores. Here's a solution using the so-called decorate-sort-undecorate paradigm:
# Decorate ("stick your data in a tuple with other stuff")
tmp = [(x,i) for i,x in enumerate(pose_scores)]
# Sort (uses the items in the first position of the tuple)
# Undecorate ("get your data back out of the tuple")
top_poses = [z[1] for z in tmp]

print top_poses[:3]

Image credit: Loop by Panca Satrio Nugroho (CC BY-ND 2.0)

Thursday 19 February 2009

Extract of chemical - Paper on OSRA published

It seems like only last week that another paper was published on an Optical Chemical Recognition software. This week it's the turn of OSRA, the work of Igor Filippov, which has previously been discussed in several posts on this blog:
Igor V. Filippov and Marc C. Nicklaus. Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. J. Chem. Inf. Model. 2009, In press.

Unlike all existing software, OSRA is open source. You can get it right now from its web site.

One interesting point raised in the paper is how best to assess performance. It appears that a Tanimoto type fingerprint is highly penalised if even a single error is made. This is because an error such as replacement of an O with a C, although visually minor and easily corrected in a drawing package, is highly significant chemically. Perhaps a simple count of (bonds in common)/(union of bonds) would suffice although this would require mapping of bonds from the original structure to the recognised structure (the new SMSD tool by Syed Asad Rahman from Janet Thornton's group might help here).

Friday 6 February 2009

New software for OChRe - ChemReader

The good news is that the field of Optical Chemical Recognition (OChRe) has just received a new entry, ChemReader:
J Park, GR Rosania, KA Shedden, M Nguyen, N Lyu, K Saitou. Automated extraction of chemical structure information from digital raster images. Chem. Cent. J. 2009, 3, 4.

According to the test set used (and it's not stated whether the test set was also used to train the software) the results are very good, and not just for ChemReader - OSRA is doing pretty well also:
And the bad news? Rajarshi recently discussed the phenomenon of publishing papers describing software which can neither be purchased nor obtained for free. This is also the case for ChemReader. As the software is not open source, it cannot be made available as it relies on the open source libraries GOCR and Greystoration (also used by OSRA). However, the authors are "planning that ChemReader becomes commercially available in the near future, with removal of open source parts". Doesn't that mean that the final performance of the software, as well as the algorithms used, will be different from that described in the paper?

Tuesday 3 February 2009

Of OChRe, OSRA and OASA (but not OSCAR) Part II

In an earlier post I showed a way of testing the performance of an optical chemical recognition (OChRe) software, OSRA.

Since then, a new version of OSRA has been released (1.1.0). In addition, a new version of OASA has been released which can handle multiple molecules (this caused problems with the earlier test where a disconnected molecule was produced).

The results are shown here.

(1) I used Xe to indicate an unknown atom type (only occurs once).
(2) If OSRA detects multiple molecules in the original image, a multi-mol SDF is created where the coordinates do not correspond to the original location in the image. Where this occured here, I have just depicted the first molecule.