Sunday 28 August 2022

Tournament sizes and their effect on selection distribution

If you've ever looked at genetic algorithms, you may have been struck by the counter-intuitive nature of the selection step. In each iteration, you don't take the best results from the previous iteration and combine them for the next iteration; in fact, if you do this, it works very poorly as you get premature convergence to a local minimum as you've gone all exploitation and no exploration. What you need to do is hold yourself back, and select 'goodish' solutions, because goodish can lead to much better in the long run.

One way to do this is with tournament selection. The one parameter is the size of the tournament, T. It works very simply; choose T items at random and keep the best. Remove that item from the pool and repeat this procedure until you have selected the required number of items.

I usually use a value of T=3, but I was curious what exactly this distribution looked like, and the effect of the tournament size. Since I was not able to find an analytical solution (feel free to leave a comment with the details if you do), I wrote a simulation which selects 1000 numbers between 0 and 99 inclusive with different tournament sizes (here I did selection with replacement just to get the distribution of the first selected item). The code also prints out the smallest number where the cumulative sum of the distribution exceeds 50%:

Tournament size of 1: 50% reached at 49
Tournament size of 2: 50% reached at 29
Tournament size of 3: 50% reached at 20
Tournament size of 4: 50% reached at 15
Tournament size of 5: 50% reached at 12
Tournament size of 6: 50% reached at 10
Tournament size of 7: 50% reached at 9
import random
random.seed(1)
import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":
    tour_sizes = [1, 2, 3, 4, 5, 6, 7]
    num_repetitions = 1000
    num_selections = 1000
    for tour_size in tour_sizes:
        averages = [[] for x in range(100)]
        for rep in range(num_repetitions): # repeat to get the mean
            count = [0] * 100
            for i in range(num_selections): # observe the distribution across selections
                # tournament selection of size tour_size
                select = min(random.sample(range(100), tour_size))
                count[select] += 1
            for i in range(100):
                averages[i].append(count[i])
        for i in range(100):
            averages[i] = np.mean(averages[i])
        plt.plot(range(100), averages, label=f'{tour_size}')
        cumsum = 0
        for i in range(100):
            cumsum += averages[i]
            if cumsum >= num_selections / 2:
                print(f"Tournament size of {tour_size}: 50% reached at {i}")
                break
    plt.legend()
    plt.ylabel("Count")
    plt.title("Effect of tournament size on distribution")
    plt.xlim(0, 100)
    plt.ylim(bottom=0)
    plt.show()

Update (2023-06-22): I've guessed the analytical solution as y = 10 . N . f**(N-1) where N is the tournament size and f = 1-(x/100) in the diagram above.

Monday 15 August 2022

Notes from the 12th International Conference on Chemical Structures

Following on from my earlier blog post, I recently attended the 12th ICCS in Noordwijkerhout, the Netherlands. This was a really great conference, this time organised by Gerard van Westen and colleagues. If you're interested in attending next time, unfortunately you'll have to wait another 3 years...

The conference started with the awarding of the CSA Trust Mike Lynch Award to Dr Greg Landrum, for his work on RDKit and the community around it. As a CSA Trustee, I had the pleasure of presenting this along with fellow trustee Wendy Warr.

But it wasn't all presenting awards. There was also the cruise on the IJsselmeer to be done, as ably illustrated by Peter Ertl on the right.

And most importantly, the conference itself. Once again I captured my notes as tweets which are available here. Sosei Heptares had three presentations, slides for which are available here:

  • Morgan Thomas (oral presentation): Augmented Hill-Climb improves language-based de novo molecule generation as benchmarked via the open source MolScore platform
  • Dominique Sydow (oral presentation): Integrated Structural Cheminformatics Analysis Tools for Customisable Chemogenomics Driven Kinase and GPCR Drug Design
  • Noel O’Boyle (poster presentation): Application of DeepSMILES to machine-learning of chemical structures