SmiZip is a compressor for SMILES strings that uses the unused characters from the extended ASCII set. So, for example, while the character "C" is already used for Carbon, part of Chlorine, etc., the character 'J' could be used to indicate "C(=O)".
While most of the v2.0 changes (listed below) are tidy-ups or minor improvements, there is one in particular that simplifies a particular use case, to encode only into ASCII printable characters. This means that the compressed string can be stored in places where the full 256 characters are not supported, e.g. a JavaScript file (you can encode such characters with Unicode but that would increase the file size).
As an example, I trained on a dataset derived from ChEMBL and was able to compress to 30.8% on a hold-out test set using all available characters (202 multigrams). But even sticking to the printable ASCII character set, I was able to compress to 43.7% despite only having 43 characters available for multigrams. We benefit here from the fact that there are diminishing returns with each additional multigram - the majority of the benefit is derived from the initial multigrams.
As a specific example, here's CHEMBL505931 which is a polysaccharide attached to a sterol and compresses from 322 characters to 138 (42.9%):
[C@]12(O[C@@H](C[C@H]1C)C(=O)CC)[C@]1(CCC3=C(CC[C@H]4[C@@]([C@H](CC[C@]34C)O[C@@H]3O[C@@H]([C@H]([C@@H]([C@H]3O)O)O)CO[C@@H]3OC[C@@H]([C@@H]([C@H]3O[C@@H]3O[C@@H]([C@H]([C@@H]([C@H]3O[C@@H]3O[C@H]([C@@H]([C@H]([C@H]3O)O)O)C)O[C@@H]3OC[C@@H]([C@@H]([C@H]3O[C@@H]3OC[C@H]([C@@H]([C@H]3O)O)O)O)O)O)CO)O)O)(C)CO)[C@@]1(CC2)C)C compresses to ]12(O>,D1C$ *):]1,*3=~*D4:@](D,C:]34CV>3O>(D(>(D3OVV$O>3OC>(>(D3O>3O>(D(>(D3O>3OD(>(D(D3OVV$V>3OC>(>(D3O>3OCD(>(D3OVVVVV$OVV),$O):@]1,C2$$
Changes: v2.0 (2025-01)
- find_best_ngrams.py: new option
--non-printable
to facilitate encoding into printable ASCII charactersr - find_best_ngrams.py:
--chars
is now required (help text provides a reasonable starting point) to force the user to consider the list - find_best_ngrams.py: if the end of the training .SMI file is reached, the script wraps around to the start
- find_best_ngrams.py:
--cr
corrected to--lf
. - compress.py: A better error message is generated if an attempt is made to encode a character not present in the JSON file
- compress.py: support added for .SMI files without titles.
Thanks to Adriano Rutz (@adafede) and Charles Tapley Hoyt (@cthoyt) for feedback.
No comments:
Post a Comment