Thursday 26 January 2012

Visualising the fragments in a path-based fingerprint

With any tool, users will often come up with ways of applying it that the developers of the tool did not originally anticipate. In the Open Babel paper, I gave an example where the InChI was used (by Fábián and Brock) as part of a workflow to identify a specific class of racemic crystals - I don't think that NIST had this use-case originally in mind.

Similarly, although Open Babel's path-based (or Daylight-type) fingerprint FP2 was developed for similarity searching of databases, we realised that users wanted to use the information in the fingerprint for other purposes. From time to time, someone would ask on the mailing list what fragments corresponded to each of the 1024 bits. At first, our response was to point out that we couldn't really say as (a) more than one fragment might correspond to a particular bit and (b) the hashing algorithm that was used to link the fragments and the bits only worked one-way.

Eventually we realised that people wanted something more, and so Chris added an output option to describe the fragments and their corresponding bits. These can be used just like fragments from other fragmentation schemes (looking for privileged fragments, unusual fragments, whatever), and the purpose of this blog post is to show how to get to grips with these fragments by visualising them.

The example molecule is:
generated by:
obabel -:N1CC1C(=O)Cl -O example.png
And here are the corresponding fragments generated by the FP2 fingerprint (scroll to zoom in the image below, click+drag to pan):Note: In the visualisation above, hydrogens should be ignored as they are not included in the paths (we could add an option to the SVG depiction to suppress these if necessary). Also, aromatic bonds are depicted as single bonds unless a complete aromatic ring is present in the fragment.

So...how is it done?

The first step in creating this visualisation is to generate a description of the bits in the corresponding FP2 fingerprint:
obabel -:N1CC1C(=O)Cl -ofpt -xs -xf FP2 > example.txt
example.txt will contain the following:
>
0 6 1 6 <670>
0 6 1 6 1 6 <260>
0 6 1 7 1 6 <693>
0 6 1 7 1 6 1 6 <9>
0 7 1 6 <82>
0 7 1 6 1 6 <906>
0 7 1 6 1 6 1 6 <348>
0 8 2 6 <623>
0 8 2 6 1 6 <329>
0 8 2 6 1 6 1 6 <652>
0 8 2 6 1 6 1 6 1 7 <635>
0 8 2 6 1 6 1 7 <653>
0 8 2 6 1 6 1 7 1 6 <46>
0 17 <17>
0 17 1 6 <328>
0 17 1 6 1 6 <219>
0 17 1 6 1 6 1 6 <1009>
0 17 1 6 1 6 1 6 1 7 <24>
0 17 1 6 1 6 1 7 <1010>
0 17 1 6 1 6 1 7 1 6 <456>
0 17 1 6 2 8 <329>
1 7 1 6 1 6 <225>
The help text for the FPT format explains what this means:
obabel -H fpt
...
For the path-based fingerprint FP2, the output from the ``-xs`` option is
instead a list of the chemical fragments used to set bits, e.g.::

 $ obabel -:"CCC(=O)Cl" -ofpt -xs -xf FP2
 >
 0 6 1 6 <670>
 0 6 1 6 1 6 <260>
 0 8 2 6 <623>
 ...etc

where the first digit is 0 for linear fragments but is a bond order
for cyclic fragments. The remaining digits indicate the atomic number
and bond order alternatively. Note that a bond order of 5 is used for
aromatic bonds. For example, bit 623 above is the linear fragment O=C
(8 for oxygen, 2 for double bond and 6 for carbon).
...
If we want to visualise these fragments, a small Python script can read example.txt, create the corresponding molecules, and write out their SMILES strings to output.smi:Visualising a file full of SMILES strings is then easy. The following line generates the SVG depiction shown above:
obabel output.smi -O fragments.svg -xC

5 comments:

SteveR said...

As an alternative (this for a single line from the fp2 output, but easily adapted for the whole file):

line="5 7 5 6 5 6 5 6 5 6 5 6 <942>"
#Defined the SMILES strings for bond types in a dictionary
bondtypes={1:"-",2:"=",3:"#",5:":"}

part=map(int,line.split()[:-1])

#Do the first atom
SMILES="[#%s]"%((part[1],))

#Deal with rings
if part[0]!=0:
SMILES += bondtypes[part[0]]+"1"

#Now do the remaining atoms
for i in range(2,len(part)):
if i%2==0:
#bonds
SMILES += bondtypes[part[i]]
else:
#atoms
SMILES+="[#%s]" % ((part[i],))

#Now finish with rings
if part[0]!=0:
SMILES += "1"

print SMILES

sorry - python tabbing appears to have disappeared but should be able to figure it!

Noel O'Boyle said...

Ah! - a nice alternative, with no need to even bust out Pybel.

A minor correction though is that [#6] is not a valid SMILES (you're thinking of SMARTS where it matches both C and c). A simple lookup table would take of that of course.

mattarno said...

hi - just learning this concept for a chemoinformatics/bioinformatics course. very helpful, thanks. Just a question though: the coding scheme that 'translates' the fragments into the numbers (<826> etc.): is it a common coding system between all fingerprinting methods? i.e. does O=C always become 826? And if the coded molecule contains O=C then the 826th digit in the bitstring would be a 1 not a 0: is that correct?

cheers for any help,

Matt

mattarno said...

...i meant <623> not 826...!

Noel O'Boyle said...

Fragments are translated into numbers via a hashing algorithm. The algorithm is different from program to program. Also, some programs may set several bits (via different hashing algorithms) for one fragment. Not to mention that the fragments may be different in different cases, and that the fingerprints may be different lengths.

As you say, then the 826th bit would be 1.