tag:blogger.com,1999:blog-7844526396210378482.post4703521140805955534..comments2024-01-31T09:23:26.925+00:00Comments on Noel O'Blog: Pybel - Just how unique are your molecules?Noel O'Boylehttp://www.blogger.com/profile/03288289351940689018noreply@blogger.comBlogger8125tag:blogger.com,1999:blog-7844526396210378482.post-4334912913593742992009-09-05T23:40:09.965+01:002009-09-05T23:40:09.965+01:00This post is from more than 2y ago, so I don't...This post is from more than 2y ago, so I don't really have those figures to hand at this stage. I should say that the canonical SMILES code in OpenBabel has been improved since then (particularly in terms of stereochemistry), and is undergoing further improvements right now (for 2.3). When that's done, it should be a very useful tool for accurately identifying duplicates.Noel O'Boylehttps://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-59184550644776819762009-09-03T15:22:57.195+01:002009-09-03T15:22:57.195+01:00Could you let me know how many duplicates you foun...Could you let me know how many duplicates you found with .smi and can smiles. I am using both to check the duplicatesAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-31686126174108787352007-07-09T22:25:00.000+01:002007-07-09T22:25:00.000+01:00@Felix: It was originally intended as simply an e...@Felix: It was originally intended as simply an example of using Pybel, rather than an analysis of ZINC. <BR/><BR/>However, due in particular to Geoff's comment on using canonical SMILES, there will be a follow-up post with more information.Noel O'Boylehttps://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-16370557736396553192007-07-09T22:00:00.000+01:002007-07-09T22:00:00.000+01:00wouldn't it make sense to look at a few examples t...wouldn't it make sense to look at a few examples to see if the molecules are different or not and not just look at numbers?Felixhttps://www.blogger.com/profile/05138335803929997277noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-30415252724339162912007-07-08T05:50:00.000+01:002007-07-08T05:50:00.000+01:00Actually, you'd be better off using the "can" (Can...Actually, you'd be better off using the "can" (Canonical SMILES) format, rather than SMILES for detecting duplicates / uniques. If you use regular SMILES, you're assuming the atom ordering is the same in every entry, which is unlikely.<BR/><BR/>Both the FP and InChI approaches will perform some measure of canonicalization, so you might as well pick the Canonical SMILES. It's in Open Babel 2.1.x and later.<BR/><BR/>I'd be curious if you get different results (but please use 2.1.1 or later.)Geoff Hutchisonhttps://www.blogger.com/profile/12183565052523203480noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-60040039792525605182007-07-07T17:59:00.000+01:002007-07-07T17:59:00.000+01:00What's with stereochemistry? And, you should check...What's with stereochemistry? <BR/><BR/>And, you should check<BR/><BR/>John D. MacCuish, Christos Nicolaou, and Norah E. MacCuish<BR/>J. Chemical Information & Computer Sciences 41(1):134–146, 2001. DOI 10.1021/ci000069q<BR/><BR/>Cheers, JoergAnonymoushttps://www.blogger.com/profile/09112376168632883058noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-26624373399256038152007-07-07T09:22:00.000+01:002007-07-07T09:22:00.000+01:00I think that path-based FPs (such as the Daylight-...I think that path-based FPs (such as the Daylight-type FP used here), although not guaranteed to be unique, hardly ever have clashes. It would be interesting to select all molecules in the entire ZINC dataset which are unique by InChI (not including the stereochemistry level, which a FP doesn't detect) and see whether there are any clashes. I bet there won't be.<BR/><BR/>I would tend to agree about the structural feature fingerprints though (which Open Babel also has). These are probably more suited to ensuring that a dataset has a diverse range of chemistry, rather than to avoid duplicates.Noel O'Boylehttps://www.blogger.com/profile/03288289351940689018noreply@blogger.comtag:blogger.com,1999:blog-7844526396210378482.post-20099339041724428872007-07-06T23:10:00.000+01:002007-07-06T23:10:00.000+01:00I don't think using FP's to characterize uniquenes...I don't think using FP's to characterize uniqueness is a reliable approach. If a fingerprint is defined as N bits corresponding to N structural features, it is possible that 2 molecule have the same set of M structural features, but one of them has a feature that is not considered by the fingerprint. This would imply that both structures would have the same fingerprint, but are actually different.<BR/><BR/>This situation is directly applicable to the structural keys (MACCS, BCI etc).<BR/><BR/>Obviously, this problem is alleviated by a large enough feature set or else by using hashed fingerprints.Rajarshihttps://www.blogger.com/profile/17004737222701996223noreply@blogger.com