Noel O'Blog: SmiZip 2.0 released

Saturday, 1 February 2025

SmiZip 2.0 released

SmiZip is a compressor for SMILES strings that uses the unused characters from the extended ASCII set. So, for example, while the character "C" is already used for Carbon, part of Chlorine, etc., the character 'J' could be used to indicate "C(=O)".

While most of the v2.0 changes (listed below) are tidy-ups or minor improvements, there is one in particular that simplifies a particular use case, to encode only into ASCII printable characters. This means that the compressed string can be stored in places where the full 256 characters are not supported, e.g. a JavaScript file (you can encode such characters with Unicode but that would increase the file size).

As an example, I trained on a dataset derived from ChEMBL and was able to compress to 30.8% on a hold-out test set using all available characters (202 multigrams). But even sticking to the printable ASCII character set, I was able to compress to 43.7% despite only having 43 characters available for multigrams. We benefit here from the fact that there are diminishing returns with each additional multigram - the majority of the benefit is derived from the initial multigrams.

As a specific example, here's CHEMBL505931 which is a polysaccharide attached to a sterol and compresses from 322 characters to 138 (42.9%):

[C@]12(O[C@@H](C[C@H]1C)C(=O)CC)[C@]1(CCC3=C(CC[C@H]4[C@@]([C@H](CC[C@]34C)O[C@@H]3O[C@@H]([C@H]([C@@H]([C@H]3O)O)O)CO[C@@H]3OC[C@@H]([C@@H]([C@H]3O[C@@H]3O[C@@H]([C@H]([C@@H]([C@H]3O[C@@H]3O[C@H]([C@@H]([C@H]([C@H]3O)O)O)C)O[C@@H]3OC[C@@H]([C@@H]([C@H]3O[C@@H]3OC[C@H]([C@@H]([C@H]3O)O)O)O)O)O)CO)O)O)(C)CO)[C@@]1(CC2)C)C
compresses to
]12(O>,D1C$ *):]1,*3=~*D4:@](D,C:]34CV>3O>(D(>(D3OVV$O>3OC>(>(D3O>3O>(D(>(D3O>3OD(>(D(D3OVV$V>3OC>(>(D3O>3OCD(>(D3OVVVVV$OVV),$O):@]1,C2$$

Changes: v2.0 (2025-01)

find_best_ngrams.py: new option --non-printable to facilitate encoding into printable ASCII charactersr
find_best_ngrams.py: --chars is now required (help text provides a reasonable starting point) to force the user to consider the list
find_best_ngrams.py: if the end of the training .SMI file is reached, the script wraps around to the start
find_best_ngrams.py: --cr corrected to --lf.
compress.py: A better error message is generated if an attempt is made to encode a character not present in the JSON file
compress.py: support added for .SMI files without titles.

Thanks to Adriano Rutz (@adafede) and Charles Tapley Hoyt (@cthoyt) for feedback.

2 comments:

Geoff Hutchison said...: If you then turn this over to gzip, how much does it compress further?; 5 February 2025 at 18:34
Noel O'Boyle said...: In the SmiZip presentation I did, I had some details on this for the normal case. Here, it's quite a bit more. For example, with ChEMBL as a JavaScript array of strings (lots of quotation marks and commas in there), it compresses from 72M to 22M. Are you thinking that most webservers serve JavaScript compressed these days?; 7 February 2025 at 21:50