Friday, 20 February 2009

Repeat after me - for loop in Python

for loops allow you to repeat the same set of instructions multiple times with a different value of a particular variable.

How do I loop in Python?

The Python for loop has the following general form:
for myvar in myiterator:
# do something with myvar

How do I loop over a list?

If you coming from other languages such as C, you might be used to:
for i in range(len(mylist)):
myvar = mylist[i]
# do something with myvar

In Python, it's much better (that is, the intention is clearer) if you do the following instead:
for myvar in mylist:
# do something with myvar

If you suddenly realise you want the value of the index i, you can easily adapt the previous loop:
for i, myvar in enumerate(mylist):
# do something with myvar
# do something with i
How do I loop over two lists simultaneously?

Let's say you have a list of people's firstnames and another list containing their surnames, and you want to write them on the screen. Don't do the following:
for i in range(len(firstnames)):
print "%s %s" % (firstnames[i], secondnames[i])

Instead, you should make the intention of the code clear by using zip as follows:
for firstname, secondname in zip(firstnames, secondnames):
print "%s %s" % (firstname, secondname)

Loop pattern #1 - Building a new list from an old

newlist = []
for mynum in mylist:
newnum = mynum*2

In the case of this simple example, it should be done as a list comprehension instead:
newlist = [mynum*2 for mynum in mylist]

Here is a slightly more complicated example, involving an "if" statement that acts as a filter for even numbers:
newlist = []
for mynum in mylist:
if mynum % 2 == 0:
newnum = mynum*2

Again, this can be done instead as a list comprehension:
newlist = [mynum*2 for mynum in mylist if mynum % 2 == 0]
Loop pattern #2 - Summing things up

Another common pattern is totalling things up using a list:
total = 0
for mynum in mylist:
total += mynum

Although in this case, it would be easier to just use:
total = sum(mylist)

One thing to be careful of though is that using "+" to add strings is slow. If speed is important, avoid the following:
longstring = ""
for myvar in mylist:
# Create smallstring here somehow
longstring += smallstring
Instead use pattern #1 to create a list of strings, and join them at the end:
stringlist = []
for myvar in mylist:
# Create smallstring here somehow
longstring = "".join(stringlist)
Loop pattern #3 - Filling in a dictionary

A common pattern is to update information in a dictionary using a loop. Let's take an example of counting up how many people have a particular firstname given a list of firstnames.

The problem here is that it is necessary to check whether a particular key is in the dictionary before updating it. This can leads to awkward code like the following:
name_freq = {}
for firstname in firstnames:
if firstname in name_freq:
name_freq[firstname] += 1
name_freq[fistname] = 1

Now while this can be improved somewhat by using name_freq.get(firstname, 0), that's getting a bit complicated (and doesn't extend to dictionaries of lists). Instead you should use a defaultdict, a special dictionary that has a default value, as follows:
from collections import defaultdict
name_freq = defaultdict(int)
for firstname in firstnames:
name_freq[firstname] += 1

And what about where you wanted to store the corresponding surnames in a dictionary by firstnames? Use a defaultdict(list), of course:
from collections import defaultdict
same_firstname = defaultdict(list)
for firstname, surname in zip(firstnames, surnames):

Loop pattern #4 - Don't use a loop

Think dictionary, set, and sort. A lot of tricky algorithms can be implemented in a few lines with one or two of these guys.

A trivial example is finding unique items in a list: set(mylist). Want to check whether that genome sequence only contains 4 letters?: assert len(set(mygenome))==4

The following example uses sort. Given a set of docking scores for 10 poses, find which poses have the top three scores. Here's a solution using the so-called decorate-sort-undecorate paradigm:
# Decorate ("stick your data in a tuple with other stuff")
tmp = [(x,i) for i,x in enumerate(pose_scores)]
# Sort (uses the items in the first position of the tuple)
# Undecorate ("get your data back out of the tuple")
top_poses = [z[1] for z in tmp]

print top_poses[:3]

Image credit: PanCa SatRio


Brad said...

Nice post. One small note on your last sorting example. The sort alone will get you the scores ordered from lowest to highest. Unless lower scores are better you will want to reverse the sorted list to get the top scores:

tmp = zip(scores, items)
tmp.sort() # lowest to highest
tmp.reverse() # highest to lowest
top_items = [z[1] for z in tmp][:3]

Can you tell I have pulled the worst things out of lists a number of times?


baoilleach said...

Me too, Brad, obviously :-) I've corrected the code.

Paddy3118 said...

Two points Noel:

You might want to use izip instead of zip for longer lists to save memory.

Decorate-sort-undecorate can be a lot easier as sort and sorted now have the key argument

baoilleach said...

Thanks for the comments, Paddy3118.

I know that "key" was added for just this reason, but using operator.itemgetter(1) does not make things easier in my book, only more complicated (and this article was aimed more at beginners than Python ninjas). On the other hand, I have found "key" useful for sorting strings, with something like key=str.lower().

Paddy3118 said...

Unless its needed for speed, I too don't like going to the expense of importing itemgetter for one sort, and often use:

key=lambda x: x[3]

or whatever :-)

- Paddy.