Tuesday, 31 December 2024

push and pop directories

Here's a tool that I've been using daily since 2005 at least. I had a write-up on my old website, but with its recent disappearance it seems like a good time to update it and publish it here.

If you add the code below to your .bashrc, then you can type "push here" in one terminal window to record the current directory under the name 'here', and then "pop here" in another terminal window to change to the same directory.

In other words, it's a simple bookmark system for directories backed by a persistent on-disk file-based database (i.e. a text file :-) ). You may find this useful to support sshing into multiple machines that have a shared home folder, or to synchronise windows in a screen session or tabs in a terminal emulator.

See you all next year...🎉

popnamelist="$HOME/.popnames.txt"

# Supporting code for 'pop'
function popname () {

  if [ -z "$1" ]; then
      echo "  Syntax: popname TAG"
      return 1
  fi

  if [ -f $popnamelist ]; then
      grep "^$1" $popnamelist > $popnamelist.tmp
      if [ ! -s $popnamelist.tmp ]; then
          echo "No such tag"
      else
          awk '{print $2}' $popnamelist.tmp
      fi
  else
      echo "  There isn't anything to pop!"
      return 1
  fi
}

# Pushes the current directory into a 'memory slot' indexed by a tag
# See also 'pop'
function push() {

  if [ -z "$1" ]; then
      echo "  Syntax: push TAG"
      return 1
  fi

  touch $popnamelist
  grep -v "^$1" $popnamelist > $popnamelist.tmp

  echo "$1 `pwd`" >> $popnamelist.tmp
  sort $popnamelist.tmp > $popnamelist
  rm $popnamelist.tmp
}

# Pops the directory associated with 'tag' and makes it the current
# directory.
# Then you can use: pop mytag
# or echo `popname mytag`, or cat `popname mytag`/readme.txt

function pop() {
  if [ -z "$1" ]; then
    echo "  Syntax: pop TAG"
    echo
    cat $popnamelist
  else
    cd `popname $1`
  fi
}

Saturday, 28 December 2024

The effect of ties on evaluation of virtual screens

For a virtual screening method where the associated score is prone to ties (e.g. a fingerprint comparison), care must be taken to handle these appropriately or the results will be biased towards better performance of either inactives or actives.

Let's work through a specific example, starting with some setup code assigning scores to a set of actives and separately to a set of inactives:

import random
random.seed(42)

INACTIVE, ACTIVE = range(2)

def mean_rank_of_actives(data):
    ranks = [i for (i, x) in enumerate(data) if x[1]==ACTIVE]
    return sum(ranks)/len(ranks)

if __name__ == "__main__":
    # Generate scores for actives and inactives
    actives = [1.0]*10 + [0.5]*10
    random.shuffle(actives)
    inactives = [0.5]*10 + [0.0]*70
    random.shuffle(inactives)

I've baked in a large number of ties. Half of the actives have a score of 0.5, a value that is shared by an equal number of inactives. We will use mean rank of the actives as a proxy here for ROC AUC - if I remember correctly, one is proportional to the other.

Our first attempt at evaluating performance is as follows:

def first(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    everything.sort(reverse=True) # rank the entries - highest score first
    return mean_rank_of_actives(everything)

It turns out that all of the actives are sorted ahead of the inactives, giving an overoptimistic mean rank of 9.5. Why? Because the ties are resolved based on the second item in the tuples, and ACTIVE (i.e. 1) is greater than INACTIVE (i.e. 0). In fact, swapping the values of these flags changes the results to the overpessimistic value of 14.5. Using a piece of text, e.g. "active" or "inactive", has the same problem.

No worries - when we sort, let's make sure we sort using the score values only:

def second(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    everything.sort(reverse=True, key=lambda x:x[0])
    return mean_rank_of_actives(everything)

Unfortunately, the result is still 9.5. Python's sort is a stable sort, which means that items retain their original order if there are ties. Since all of the actives come first in the original list, the tied actives still come before the tied inactives after sorting. To get rid of the influence of the original order, we need to shuffle the list into a random order:

def third(actives, inactives):
    everything = [(x, ACTIVE) for x in actives] + [(y, INACTIVE) for y in inactives]
    random.shuffle(everything)
    everything.sort(reverse=True, key=lambda x:x[0])
    return mean_rank_of_actives(everything)

This gives 12.0 +/- 0.67 (from 1000 repetitions), a more accurate assessment of the performance.

I think that this is an interesting little problem because it's rather subtle; even when you think you've solved it (as in the second solution above), some other issue rears its head. The only way to be sure is to test with tied data.

Note: If we are interested specifically in handling ties correctly in ranks, a common approach is to assign the same rank to all members of the tie. That's not really the point I want to make here. In practice, the evaluation metric might something quite different such as enrichment at 1%, which would not be amenable to such a solution.

Friday, 20 December 2024

Engagement on X vs Mastodon vs BlueSky

Egon posted recently on the situation with scientists leaving X and moving to another microblogging platform. He pointed out some pros and cons of BlueSky vs Mastodon and how it may be possible to bridge between the two.

This post doesn't really speak to the points that Egon raises, but rather gives a n=1 measure of the current level of community engagement on different platforms. I posted a link to my previous blog post on the three platforms mentioned in the title, and here's what I found:
  • X - Since Sept 2011 - I have 1190 followers - 270 views (34 link clicks), 4 comments, 11 additional likes (including one from the poteen brewery I often get confused with :-) )
  • Mastodon - Since Dec 2022 - I have 147 followers - 2 comments, 1 additional like
  • BlueSky - Since 15 Dec 2024, i.e. one day before the blog post - I have 28 followers - 2 comments, 11 additional likes/reposts (including one from "Low Quality Facts" - make of that what you will)
  • My blog - Since April 2007 - I no longer have visibility of the number of followers, but there are 57 people who use the follow.it link to follow by email - 428 views, 3 comments

First of all, the numbers confirm that X is no longer where the community is; despite being only 5 days on BlueSky and a handful of followers, the engagement just about matches X. Mastodon is an interesting platform, and great to have as an alternative should BlueSky not pan out, but the numbers and engagement have not taken off; it feels quiet over there, in a way that BlueSky doesn't. In each case, to actually find out the news, the user had to click through to the blog, so the figure of 428 there should give the overall total across everything (including LinkedIn where I also posted).

From the little time I have spent on BlueSky, it has the feel of the old Twitter. The Android app needs a bit of work (scrolling is very jerky on the main feed) but I'm sure it will be sorted. I moved early to Mastodon as I would have liked it to take off, but that hasn't happened. I note that BlueSky has plans to monetise somehow and time will tell whether this turns it into something closer to the current X. In the meanwhile, I guess I'll see you over there.

@baoilleach.bsky.social/@baoilleach@mstdn.science/@baoilleach

Monday, 16 December 2024

Moving to pastures new, but still in the same field Part IV

Following on from my previous post on the topic, I will shortly be leaving Nxera Pharma (Sosei Heptares as was).

The last five years at Nxera have passed quickly. I've gone from "what is this GPCR you speak of?" to providing tools and resources to support structure-based GPCR drug discovery, and indeed doing it myself. I'll miss working with my colleagues in CompChem and throughout Nxera, but I'm leaving in the knowledge that the Cheminformatics line is in very capable hands.

Which brings me to my next move...

I am honoured and super excited to take up the role of Chemical Biology Resources Team Leader at the EBI, leading the team responsible for ChEMBL, SureChEMBL, UniChem and ChEBI. There are some big shoes to fill, but I can't wait to start working with the fantastic team behind these resources. The start date is early February, so wish me luck!