Here are two different ways to generate multiple SMILES strings for the same molecule using Open Babel (without introducing dot disconnections). As an example, let's consider my favourite molecule: c1ccccc1C(=O)Cl.
The first approach is to use canonical SMILES...except that the canonical labels are generated randomly. You can do this directly at the commandline (see "obabel -Hsmi" for more info):
>obabel -:c1ccccc1C(=O)Cl -osmi -xC O=C(c1ccccc1)Cl
Each time you do it, a different random SMILES string will be generated [1], up to a total of 16 variants (in this case):
C(=O)(Cl)c1ccccc1 C(=O)(c1ccccc1)Cl ClC(=O)c1ccccc1 O=C(Cl)c1ccccc1 O=C(c1ccccc1)Cl c1(C(=O)Cl)ccccc1 c1(ccccc1)C(=O)Cl c1c(C(=O)Cl)cccc1 c1c(cccc1)C(=O)Cl c1cc(C(=O)Cl)ccc1 c1cc(ccc1)C(=O)Cl c1ccc(C(=O)Cl)cc1 c1ccc(cc1)C(=O)Cl c1cccc(C(=O)Cl)c1 c1cccc(c1)C(=O)Cl c1ccccc1C(=O)Cl
We can generate even more variants by specifying the output order directly - this overrides some decisions that are usually left to the SMILES writer and allows us, for example, to force single bonds to be followed before double bonds:
>obabel -:c1ccccc1C(=O)Cl -osmi -xo 1-2-3-4-5-6-7-9-8 c1ccccc1C(Cl)=O
Using this approach, 32 variants can be generated:
C(=O)(Cl)c1ccccc1 C(=O)(c1ccccc1)Cl C(Cl)(=O)c1ccccc1 C(Cl)(c1ccccc1)=O C(c1ccccc1)(=O)Cl C(c1ccccc1)(Cl)=O ClC(=O)c1ccccc1 ClC(c1ccccc1)=O O=C(Cl)c1ccccc1 O=C(c1ccccc1)Cl c1(C(=O)Cl)ccccc1 c1(C(Cl)=O)ccccc1 c1(ccccc1)C(=O)Cl c1(ccccc1)C(Cl)=O c1c(C(=O)Cl)cccc1 c1c(C(Cl)=O)cccc1 c1c(cccc1)C(=O)Cl c1c(cccc1)C(Cl)=O c1cc(C(=O)Cl)ccc1 c1cc(C(Cl)=O)ccc1 c1cc(ccc1)C(=O)Cl c1cc(ccc1)C(Cl)=O c1ccc(C(=O)Cl)cc1 c1ccc(C(Cl)=O)cc1 c1ccc(cc1)C(=O)Cl c1ccc(cc1)C(Cl)=O c1cccc(C(=O)Cl)c1 c1cccc(C(Cl)=O)c1 c1cccc(c1)C(=O)Cl c1cccc(c1)C(Cl)=O c1ccccc1C(=O)Cl c1ccccc1C(Cl)=O
In summary, these approaches allow you to generate all possible SMILES strings consistent with a depth-first ordering of atoms [2], starting from different points and choosing different routes at each branch point. For machine learning, I'd imagine that the first approach would be preferred as the second approach will generate SMILES strings that will contain substrings that would never be observed normally (in Open Babel SMILES).
Python code
import random random.seed(1) import pybel def randomlabels(mol, N): ans = set() for i in range(N): ans.add(mol.write("smi", opt={"C":True}).rstrip()) return sorted(list(ans)) def randomorder(mol, N): ans = set() numatoms = mol.OBMol.NumAtoms() for i in range(N): idxs = list(range(1, numatoms+1)) random.shuffle(idxs) optval = "-".join(str(x) for x in idxs) ans.add(mol.write("smi", opt={"o": optval}).rstrip()) return sorted(list(ans)) if __name__ == "__main__": mol = pybel.readstring("smi", "c1ccccc1C(=O)Cl") print("Random canonical labels") randomsmis = randomlabels(mol, 500) print(len(randomsmis)) for smi in randomsmis: print(smi) print() print("Random output order") randomsmis = randomorder(mol, 500) print(len(randomsmis)) for smi in randomsmis: print(smi) print()
Notes:
1. An alternative (but slower) way to generate these same SMILES would be to shuffle the atoms in the OBMol and then write it out as a SMILES string.
2. If dot disconnections are tolerated, then see Andrew Dalke's approach.
No comments:
Post a Comment