Thursday 26 January 2012

Visualising the fragments in a path-based fingerprint

With any tool, users will often come up with ways of applying it that the developers of the tool did not originally anticipate. In the Open Babel paper, I gave an example where the InChI was used (by Fábián and Brock) as part of a workflow to identify a specific class of racemic crystals - I don't think that NIST had this use-case originally in mind.

Similarly, although Open Babel's path-based (or Daylight-type) fingerprint FP2 was developed for similarity searching of databases, we realised that users wanted to use the information in the fingerprint for other purposes. From time to time, someone would ask on the mailing list what fragments corresponded to each of the 1024 bits. At first, our response was to point out that we couldn't really say as (a) more than one fragment might correspond to a particular bit and (b) the hashing algorithm that was used to link the fragments and the bits only worked one-way.

Eventually we realised that people wanted something more, and so Chris added an output option to describe the fragments and their corresponding bits. These can be used just like fragments from other fragmentation schemes (looking for privileged fragments, unusual fragments, whatever), and the purpose of this blog post is to show how to get to grips with these fragments by visualising them.

The example molecule is:
generated by:
obabel -:N1CC1C(=O)Cl -O example.png
And here are the corresponding fragments generated by the FP2 fingerprint (scroll to zoom in the image below, click+drag to pan):Note: In the visualisation above, hydrogens should be ignored as they are not included in the paths (we could add an option to the SVG depiction to suppress these if necessary). Also, aromatic bonds are depicted as single bonds unless a complete aromatic ring is present in the fragment.

So...how is it done?

The first step in creating this visualisation is to generate a description of the bits in the corresponding FP2 fingerprint:
obabel -:N1CC1C(=O)Cl -ofpt -xs -xf FP2 > example.txt
example.txt will contain the following:
>
0 6 1 6 <670>
0 6 1 6 1 6 <260>
0 6 1 7 1 6 <693>
0 6 1 7 1 6 1 6 <9>
0 7 1 6 <82>
0 7 1 6 1 6 <906>
0 7 1 6 1 6 1 6 <348>
0 8 2 6 <623>
0 8 2 6 1 6 <329>
0 8 2 6 1 6 1 6 <652>
0 8 2 6 1 6 1 6 1 7 <635>
0 8 2 6 1 6 1 7 <653>
0 8 2 6 1 6 1 7 1 6 <46>
0 17 <17>
0 17 1 6 <328>
0 17 1 6 1 6 <219>
0 17 1 6 1 6 1 6 <1009>
0 17 1 6 1 6 1 6 1 7 <24>
0 17 1 6 1 6 1 7 <1010>
0 17 1 6 1 6 1 7 1 6 <456>
0 17 1 6 2 8 <329>
1 7 1 6 1 6 <225>
The help text for the FPT format explains what this means:
obabel -H fpt
...
For the path-based fingerprint FP2, the output from the ``-xs`` option is
instead a list of the chemical fragments used to set bits, e.g.::

 $ obabel -:"CCC(=O)Cl" -ofpt -xs -xf FP2
 >
 0 6 1 6 <670>
 0 6 1 6 1 6 <260>
 0 8 2 6 <623>
 ...etc

where the first digit is 0 for linear fragments but is a bond order
for cyclic fragments. The remaining digits indicate the atomic number
and bond order alternatively. Note that a bond order of 5 is used for
aromatic bonds. For example, bit 623 above is the linear fragment O=C
(8 for oxygen, 2 for double bond and 6 for carbon).
...
If we want to visualise these fragments, a small Python script can read example.txt, create the corresponding molecules, and write out their SMILES strings to output.smi:Visualising a file full of SMILES strings is then easy. The following line generates the SVG depiction shown above:
obabel output.smi -O fragments.svg -xC

Tuesday 17 January 2012

What's up dock? - Calculate the RMSD between docked and crystal poses

In cheminformatics, there are two reasons why one might want to calculate the RMSD between two conformers. The first is to check whether two conformers are very close in structure - e.g. for the purpose of generating a diverse set of conformers. This problem is solved using (1) a least squares alignment followed by (2) calculation of the RMSD.

The other situation is comparing two sets of 3D coordinates to see whether a prediction method has accurately reproduced experimental coordinates (e.g. docking). This just requires step (2) above.

The situation is complicated a little bit by the fact that only "heavy" atoms (i.e. non-H atoms in this context) are typically used to calculate the RMSD. A much greater complication is that automorphisms (well, isomorphisms of two molecules which are identical, to be exact) must be taken into account in both cases above. For example, consider the case where two para-substituted benzene rings must be compared; the RMSD calculation must take into account the fact that a 180 degree flip of the ring might yield a smaller RMSD.

Anyhoo, here's some Pybel code that will calculate the RMSD between a crystal pose and a set of docked poses. The code also illustrates how to access the isomorphisms. You should modify the code for your specific purpose: