Friday, 29 June 2007

Parsing SMILES with a frown?

In a recent post, Andrew Dalke compares SMILES parsing in Frowns (a now unsupported Open Source Python cheminformatics library) and the CDK (an actively developed Open Source Java cheminformatics library). Andrew is somewhat of a parsing expert - indeed, BioPython is built upon a parser called Martel which Andrew developed. The Frowns SMILES parser, contributed by Andrew, shows what an expert can do. It would be interesting to know whether similiar code could be incorporated into the CDK and OpenBabel, and if so, what is the speed tradeoff involved?

Andrew also discusses the compression of SMILES strings, which James Melville and Johnathan Hirst have also recently been studying. Rajarshi has also reviewed this work.

10 comments:

Rajarshi said...

The CDK SMILES parsing code is certainly inelegant. I would love to see a Javacc based version which should be much cleaner and simpler (and easier on the eyes!)

Geoff Hutchison said...

I'm not sure if there are particularly elegant SMILES parsers. It can just get ugly with what some people do.

That said, I'd like to think that the Open Babel SMILES parser is fairly well "battle-tested." There are certainly some remaining bugs, but it works. One catch is that there are often ways of doing faster parsing and interpretation.

Unknown said...

I heavily recommend a proper EBNF solution, and the full MQL syntax was already published. BTW, this is the first time that a EBNF notation was published in the cheminformatics/modeling area!

See blog post and paper.

Reference
E. Proschak, J. K. Wegner, A. Schüller, G. Schneider, U. Fechner, Molecular Query Language (MQL)-A Context-Free Grammar for Substructure Matching, J. Chem. Inf. Model., 47, 295-301, 2007,. DOI 10.1021/ci600305h

Andrew Dalke said...

Geoff: I'm not sure if there are particularly elegant SMILES parsers.

Have you seen my SMILES parser in FROWNS? The entire tokenizer is under 200 lines of Python code.

It is not the complete parser, but the rest of the parser is not complicated.

One time I even ported it to Javascript so the client-side code could do some format autodetection. If the first line contained more than 3 SMILES tokens then it was almost certainly a SMILES string.

Out of idle curiosity I would like to get a blisteringly fast SMILES parser, hence why I was looking at Ragel. The tokenizer I did there should need no backtracking and no forward probing.

Andrew Dalke said...

Joerg: I heavily recommend a proper EBNF solution

Yes... and no. Depends on what you want to do. Suppose you want to allow depiction of partial SMILES. For example, as I'm typing "c1ccccc1" I see the depictions for "C", "CC", "CCC", "CCCC", "CCCCC", then "c1ccccc1". I think it was ChemAxon which had an ActiveX plugin which allowed just that.

You can't do that with BNF definition, at least not one which enforces balanced parens, closed ']', etc. Whereas at the tokenizer level you can, because it's simpler.

Of course the tokenizer is still a "proper" EBNF. It's just not enforcing things that a context-free grammar can handle directly. But for SMILES those are trivial to implement manually.

Andrew Dalke said...

Joerg: BTW, this is the first time that a EBNF notation was published in the cheminformatics/modeling area!

Are you sure about that? I came across a comment by John Bradshaw saying:

Beilstein ... came up with a notation called ROSDAL . ... It is defined by means of a formal grammar expressed in Backus-Naur form, the meta notation commonly used in the definition of computer languages.

The references seem to be

Barnard, J.M.; Jochum, C.J; Welford, S.M. (1989) ROSDAL: A universal structure/substructure representation for PC-host communication. In Chemical Structure Information Systems, Warr, W.A. (Ed.) ACS Symposium Series, 400, pp. 76-81. Washington: American Chemical Society.

Barnard, J.M. (1990) Draft specification for revised version of the Standard Molecular Data (SMD) Format, J. Chem. Inf. Comput. Sci., 30, 81-96.

I just ordered a used copy of the first book. I'll have to go to the library for the JCICS copy, but from what I've found so far it probably doesn't mention BNF.

Unknown said...

Mmmmh, Andrew can you please elaborate on the 'partial SMILES' thing. I am not sure if I get the message. And why are you switching from an aliphatic series (CCCCC) directly to an aromatic ring (c1ccccc1)?

With the 'first' EBNF publication I am not sure either, but the publications you mentioning are more than 15 years ago. Our goal was only overcoming some recent syntax drawbacks and we never wanted to offend or compete with anybody. This was a pure scientifically driven collaboraration and syntax! Anyway, if other people have indeed done that before I would love to read about it. If this means that we have to collect this information online then this is even better ;-)
Thanks for your comments and I have to leave now for some social life things, called Salza!;-)
Cheers, Joerg

Andrew Dalke said...

joerg: can you please elaborate on the 'partial SMILES' thing. ... why are you switching from an aliphatic series (CCCCC) directly to an aromatic ring (c1ccccc1)?

Sure. Computers are fast. If you are typing in a SMILES string interactively, the program can interpret the string as best it can, and display an incomplete/partial result.

Given that "c1ccc" is incomplete, one way to interpret it is to toss the "1" ring closure and change the aromatics to aliphatic, so the structure view shows "CCCC".

The input string is not a SMILES. It's a partially completed SMILES. And my point is that sometimes it's nice to have the software give you that partial display.

As another example, it could display a "~"-like underscore (as with some text editors when there is a misspelling) to show you the location of a syntax error, or highlight incomplete closures and branches.

These are things that can't be done with a BNF-style language description, at least not one which demands matching parens before it accepts the grammar.

On the other hand, there are few people that enter SMILES by hand this way. Most cut and paste the SMILES, or draw in the structure. So the feature I'm talking about isn't that important. It's only suggesting that there are cases where a full BNF grammar makes certain tasks harder.

By "salza" do you mean "salsa" dancing? I do that too, and tango. I got back into salsa because of a Python programmer at EBI, and I know another Python programmer (ex- MPI Heidelberg) who's a salsa dancer.

Unknown said...

Andrew, I agree that there might be several applications after the syntax was parsed. You can store the parsed syntax for efficiency reasons in trees or graphs. You can check chemical correctness, display partial or non-partial things which make sense, do a similarity search, give a value prediction, ... you name it!

I would distinguish here a little bit. What we need are three things:
1. A very flexible and extendable syntax (e.g. MQL, FlexMol).
I do not care if EBNF makes it more complicated to parse things, I am more interested for having a good syntax, which fullfills the gaps in actual query language systems and helping developing more collaboration. The clever computer scientist out there should make those things fast, we should ensure that we improve the 'molecular drug space' mining side of it.

2. A compact and maybe cryptical language like WLN for people which really want to enter it by hand. But usually I agree with Andrew that most people get the syntax from drawing programs, but then extend it with all the fancy things you can do with query languages, e.g. the $ in SMARTS, or the chiralities in FlexMol.

Both of those things should come with a clear standard for allowing exchanging information. So, we need XML or line notations with proper versioning and plug-in support.

3. After that there can be two levels of validation. Level one is checking the chemical expert system and creating proper molecules in 2D or 3D.
Level two is storing things in an efficient way for allowing fast queries. As usual for different tasks different 'efficiency' criterias might be used, e.g. 3D conformations, fingerprints, pharmacophores, chemogenomics, and whatever else.

To make it now short, I think we are talking about the same things and I would not mix everything in the query language, but would rather encourage a growing standard over time by extending or modifying it.
The more critical question is how all the actual projects can use this? I think the language is only useful if it comes with parsers in Python, Java, .NET, and C++. Only this would guarantee that all projects are really exchanging information on a readable chemical expert level defined by the query language.

Yepp 'salsa', I know that you are doing this, too! This was the reason mentioning it ;-) Though I am still on a beginner level we are a nice group now and are thinking about taking more courses.

Andrew Dalke said...

Some 7 years ago, Jörg Kurt Wegner commented "BTW, this is the first time that a EBNF notation was published in the cheminformatics/modeling area!"

I've been reading through old copies of J. Chem. Documentation, and came across a paper which brought that discussion to mind. The article "A Computer Program to Index or Search Linear Notations", J. Chem. Doc., 1968, 8 (3), pp 128–129, DOI: 10.1021/c160030a003 uses a BNF grammar description, and cites the original 1960 BNF paper.

I think this is a much better counter-example than the railroad track diagram version I pointed out previously.