Sunday, 28 August 2022

Tournament sizes and their effect on selection distribution

If you've ever looked at genetic algorithms, you may have been struck by the counter-intuitive nature of the selection step. In each iteration, you don't take the best results from the previous iteration and combine them for the next iteration; in fact, if you do this, it works very poorly as you get premature convergence to a local minimum as you've gone all exploitation and no exploration. What you need to do is hold yourself back, and select 'goodish' solutions, because goodish can lead to much better in the long run.

One way to do this is with tournament selection. The one parameter is the size of the tournament, T. It works very simply; choose T items at random and keep the best. Remove that item from the pool and repeat this procedure until you have selected the required number of items.

I usually use a value of T=3, but I was curious what exactly this distribution looked like, and the effect of the tournament size. Since I was not able to find an analytical solution (feel free to leave a comment with the details if you do), I wrote a simulation which selects 1000 numbers between 0 and 99 inclusive with different tournament sizes (here I did selection with replacement just to get the distribution of the first selected item). The code also prints out the smallest number where the cumulative sum of the distribution exceeds 50%:

Tournament size of 1: 50% reached at 49
Tournament size of 2: 50% reached at 29
Tournament size of 3: 50% reached at 20
Tournament size of 4: 50% reached at 15
Tournament size of 5: 50% reached at 12
Tournament size of 6: 50% reached at 10
Tournament size of 7: 50% reached at 9
import random
random.seed(1)
import numpy as np
import matplotlib.pyplot as plt

if __name__ == "__main__":
    tour_sizes = [1, 2, 3, 4, 5, 6, 7]
    num_repetitions = 1000
    num_selections = 1000
    for tour_size in tour_sizes:
        averages = [[] for x in range(100)]
        for rep in range(num_repetitions): # repeat to get the mean
            count = [0] * 100
            for i in range(num_selections): # observe the distribution across selections
                # tournament selection of size tour_size
                select = min(random.sample(range(100), tour_size))
                count[select] += 1
            for i in range(100):
                averages[i].append(count[i])
        for i in range(100):
            averages[i] = np.mean(averages[i])
        plt.plot(range(100), averages, label=f'{tour_size}')
        cumsum = 0
        for i in range(100):
            cumsum += averages[i]
            if cumsum >= num_selections / 2:
                print(f"Tournament size of {tour_size}: 50% reached at {i}")
                break
    plt.legend()
    plt.ylabel("Count")
    plt.title("Effect of tournament size on distribution")
    plt.xlim(0, 100)
    plt.ylim(bottom=0)
    plt.show()

Update (2023-06-22): I've guessed the analytical solution as y = 10 . N . f**(N-1) where N is the tournament size and f = 1-(x/100) in the diagram above.

No comments: