Tuesday 4 June 2024

Every ChEMBL everywhere, all at once

The team over at ChEMBL has just announced a symposium to celebrate 15 years of ChEMBLdb and 10 of SureChEMBL. With this in mind (or perhaps for other reasons), I have attempted to extract data from all 34 releases of ChEMBLdb. Let's see how I got on...

One of the things I love about ChEMBL is that they still provide all of the download files for all previous releases. But before w'get wget, let's see if there's an easier way, namely Charles Tapley Hoyt's ChEMBL Downloader - if it can do the work for me, then that'll save a whole bunch of work. Looking into it, however, the ability to run SQL queries relies on an SQLite database being available, which is only the case for ChEMBL 21 or later.

So old skool it is, which leads to the question of which database format to use. Well, it turns out that there's only one database format that is available for every single release of ChEMBL, and that's MySQL. One bash loop later, and I have all 34 chembl_nn_mysql.tar.gz files (well, two extra wgets for 22_1 and 24_1). A bit of unzipping later and my home folder is 300G heavier and we are ready to roll.

That is, if we can navigate the slightly differently named/arranged files in every case. Is the unzipped folder chembl_nn_mysql, or is it chembl_nn with the chembl_nn_mysql folder inside that? Are the setup instructions in INSTALL or INSTALL_mysql? Is the SQL data in a .sql file or a .dmp file? It is of course nitpicky to even mention this given the nature of what I'm getting for free from ChEMBL, but have you ever tried to install 34 ChEMBLs? :-) In the end, a carefully crafted find/grep was able to pull out the correct create command:

for d in `find . | grep INSTALL | sort`; do grep "create database" $d |sed "s/mysql>//g" ; done

This is simply 'create database chembl_nn' from version 1 until version 22, and then from 23 onwards it changes to specify the character set as UTF-8. For the actual import, I used a Python script to work out the right commands, wrote them to a shell script and then ran them. Just another 700G later, and it's all available in MySQL.

What I wanted to do was simply export a set of SMILES, InChIs and CHEMBL Ids from each of the versions. What I thought would be a problem turned out not to be at all; while the schema as a whole has grown and changed quite a bit over the years, the only effect for me was that 'compounds' became 'compound_structures' in ChEMBL 9.

More of an issue was that the data stored has changed over the years. I had forgotten that ChEMBL 1 did not have ChEMBL Ids, nor version 2; in fact, it wasn't until version 8 that they appeared on the scene. Before that, there was an attempt to use ChEBI ids and at the very start there was just the molregno. Furthermore, version 1 doesn't have Standard InChI - it just has a non-standard one (why, or what the settings are, I don't know ...Added 30/06/2024: Standard InChI was not available until shortly before ChEMBL 1 - that would explain it!). This at least we can pull in from version 2 by matching on molregno and SMILES, and then calculate any missing Standard InChIs off-line.

There was a point to all this, but that'll wait for another day. What I've shown here is that despite 15 years covering 34 releases, it's possible to recreate every single release and relive those glory days (ah, ChEMBL 16, will I ever forget you?).