Wednesday, 23 June 2010

Visualising orbitals and density with Molekel

Molekel is an open-source 3D molecular visualisation package for analysing the results of computational chemistry packages. It has been developed at the Swiss National Supercomputing Centre by Ugo Varetto (et al.?). Unlike Avogadro, it does not have facilities for setting up calculations or building structures; however, at least for what I want to do, it has the edge in visualising orbitals and density. So, my curent workflow is to set up Gaussian calculations with Avogadro, and then analyse them with Molekel.

One interesting aspect of Molekel is that the electron density (and orbitals, etc.) are calculated on the fly by Molekel rather than extracted from a cube file. This makes the program much more capable and completely avoids the need for cube files. On the other hand, it is not possible to save the volumetric information generated by Molekel, so there is no way to sanity check the accuracy against a Gaussian cube (e.g. by subtracting them with cubman, and then visualising the results).

It took me a little while to figure out a workflow for using Molekel successfully on Windows with Gaussian 09 log files from a Linux server. Here's how it goes:

  1. Run a single point calculation with POP=(FULL, ESP) and GFOLDPRINT. I also use IOP(3/36=-1) to reduce the file size. The current release of Molekel will not be able to colour surfaces with the electrostatic potential unless the ESP option is specified (this information is courtesy of src/old/readgauss.cpp around line 1043).
  2. Convert the log file to Gaussian03 format using the bash script available here. (I hope they will fix this in the next version of Molekel.)
  3. Convert to dos format with unix2dos. If you don't do this, you will get a segfault from Molekel after opening the file. (They should fix this.)
  4. scp the file over to your Windows machine. (It would be nice to be able to open the file directly on the server using ssh; maybe this is planned for the future?)
  5. Start Molekel, File, Open, Choose "G03", then the filename and "Open". If you don't choose G03 and instead use "All files", it will fail horribly.
  6. Under Display, 3D View Properties, change Samples to "10" to add reasonable anti-aliasing. This also affects images saved to disk so it's worth doing as the difference is quite noticeable.
  7. Also under Display, Set Background to white, and turn off the bounding box and the axes. (There is a bug if you leave the axes on and choose scaling > 1 when saving an image - the axes are repeated several times.)

In practice, I carried out steps (2) and (3) using Cygwin on Windows, but it would be easier to do this on the Linux server. (As a side issue, I note that only the File, Edit, Interaction and Help menu are accessible through keyboard shortcuts - hopefully this will be fixed in a future release.)

To actually draw the orbitals and so forth you go to Surfaces/Electron Density, but see the video tutorials on the Molekel website for more information. One thing that might save you some time is to generate all of the orbital isosurfaces you want first, and then go to File/Save Orbital Pictures and choose a folder - this automatically saves all of the orbital images one after the other with sensible names.

As a final point, if you have a better graphics card than I do, you will be able to apply some really impressive shading effects using the GLSL shaders under Display/Shaders. Again see the video tutorials on the Molekel website for a demo.

Tuesday, 15 June 2010

A non-random method to improve your QSAR results - works every time!

I was going to say this was a novel method, but unfortunately it isn't. The idea of splitting a data set into training and test sets non-randomly is one that is very appealing, seems quite reasonable at first, and leads to great results. And those results are completely misleading.

The safest procedure when doing the two-third/one-third split (or whatever) of a dataset into training and test sets is to choose the split randomly. As soon as you do anything else, the performance of your results on the test set will not be related to its real-life predictive performance.

Now and then a paper comes out where the training set is chosen to be a diverse set (whether by a Kennard-Stone type procedure, gridding the points, or initial clustering). At first this seems quite reasonable: you've made sure that you have chosen points that span the entire space of your dataset. In actual fact, what you've just done is to ensure that all of your test set points are close to a point in your training set. This means that the predictions on the test set are now woefully optimistic and unrepresentative of predictions on a true test set (which probably won't have the courtesy to be close to points in your training set, nor be distributed evenly across the whole space).

Here's another way of thinking about it. Let's imagine that our predictive model is meant to work on a particular chemical space of molecules (the domain of applicability or so). These molecules have a particular distribution of descriptor variables (or whatever). And both the training and test sets are supposed to be representative samples of this distribution. However, once you use a non-random method to split into training and test sets, both the training and test sets become less and less representative of the population and hence you will have little idea of the real-life predictive ability of your model.

I've never seen this idea written down anywhere, so I'd be interested to know whether anyone thinks there is justification for choosing a non-random training set in certain circumstances. My own advice though is to keep it real and keep it random.

Image credit: Arenamontanus

Thursday, 10 June 2010

How to get an experimental ligand structure from the PDB? Part 2

I have a list of 36 ligands whose structures I need to get from specific PDB entries. Matteo's method for getting their structures from the PDB Ligand Expo worked fine for all but 2.

The underlying problem was the same in both cases; the authors had decided that their ligand was actually composed of several 'residues' and had given the different parts different names. The PDB had not taken this craziness into account and had split the ligand into separate ones based on the HETATM residue names.

So, a little bit of extra legwork for the first case (1pph). I searched for the three ligands in the large SDF file, copied and pasted them into a new SDF, and joined the ligands into one entry with "babel --join". Finally I opened it in Avogadro and drew the missing bonds between the 'residues'.

Fine.

The second case was worse (1ppc). Only 3 of the 4 residues made it to the PDB's ligand SDF file; the fourth residue was a 'glycine' and was not labelled as HETATM in the PDB file - as a result it was not included in the ligand SDF file.

Time to get the sledgehammer. I don't use the PDB file format in Open Babel very often, so I wasn't sure if the following would work:
   babel 1PPC.pdb --separate split.sdf

The second molecule in split.sdf is the complete ligand. Nice.

Now you might be wondering why I didn't just do this in the first place for all of the ligands. Well, the PDB file format doesn't contain bond order information, and so I was hoping to take advantage of the work the PDB had put into preparing its own lignad SDF files (which I think was one of the issues addressed during the PDB remediation process a few years ago).

If 2 out of my 36 ligands had this same problem, I wonder how common it is in the PDB data? Maybe I should make a list of cases and send it to the PDB.

How to get an experimental ligand structure from the PDB?

The Blue Obelisk Q&A website at http://blueobelisk.shapado.com is really turning into a useful resource. There are now over 100 user accounts, and the quality of the questions and answers is generally quite high. Although several of the questions relate to open source cheminformatics software, many are simply general cheminformatics questions. You can log in with OpenID or your Google account, or even ask questions anonymously (but this doesn't work so well, as you won't be updated when it's answered).

Q&A sites are much better than forums or mailing lists for handling technical questions because there is no repetition; each question only gets asked once (how times have you seen the same question on the CCL for example?). If additional information becomes available, that can be added as a new answer (unlike a forum, where adding to an old forum thread is discouraged). Also, if a question is ambiguous, the original author can correct it after some prompting (in a forum, the corrected question may appear half-way down the page).

Anyway, I recently asked the following question which is related to my research:
Given a PDB code, what is the simplest way to get an SDF file with the experimental coordinates of a particular ligand?

My current procedure is to go to the PDB Ligand Expo, and use the “Search for instances of chemical components by 3-letter ID code” at the bottom with “PDB entry codes + coordinates files” chosen. This requires me to specify the 3-letter ID code, which I need to have obtained from the web page for the PDB file.

On the search results page, I can simply click on the SDF file for the particular PDB file in which I am interested to download the experimental coordinates.

Is there an easier way? For example, it does not seem to be possible to go directly from the website for a particular PDB entry.

(I would also be interested in the same question applied to a handling a large number of PDB codes at the same time.)
Matteo Floris provided the answer:
First, you should download this dictionary (Tabulation of PDB entries containing each chemical component); in this way you can get the full list of ligands associated to PDB entries.
Then, the best is to download the SDF file with all the ligands and use the dictionary for the mapping PDB identifier -> list of associated ligands.

...this gives unique combinations of PDB entry – ligand – PDB entry chain… let me show two examples: the first is the ligand ZU3, for which we have the coordinates in all the PDB where this ligand is present:

2zu3_ZU3_1_A_501B 2zu4_ZU3_1_A_501B

the second example, the ligand ZZC, that is present in the SDF file for the same PDB entry but in 2 different chains (A and B):

2wod_ZZC_1_A_401D 2wod_ZZC_1_B_401F

Now to test it out!