Re: Compression and dictionaries

linux@xxxxxxxxxxx · 15 Aug 2006 20:37:12 -0400

> I explained our situation to Mark Adler, zlib author, and he recommend
> checking out this book for techniques that can be used.
> 
> http://www.cs.mu.oz.au/mg/

I was about to suggest the same thing!  It's an excellent book.

It's about a piece of software and the choices made there, but it
explains in detail many alternatives and why they weren't chosen for
the mg software, but what they might be useful for.

The mg software itself is designed for human-language text, and does
indexing as well.  So it does a fixed breakdown into alternate word and
non-word tokens, builds a lexicon with frequency tables, then uses the
frequencies to build Huffman trees, and Huffman-compresses each token.
The "word" dictionary is also used as part of the search index, as each
word has a (very cleverly compressed to less than one byte on average)
pointer to each document it appears in with a count of the number of
appearances.

The word encoding is straight zero-order Huffman, so inter-word
correlations are not used.

For software, with lots of punctuation and multi-word idioms
("for (i = 0; i < n; i++) {"), the basic design is not very well suited.
(To say nothing of binary data, like some people who want to
check images or multi-megabyte video files into git.)

But there is a good description of many possible algorithms, with
lots of emphasis on practicalities.

And the software *does* support dynamic collections, via auxiliary indexes
and escape codes for any new words not found in the main dictionary.
In fact, generating the Huffman tree from as little as 1/8 of the material
to be compressed only loses you 5.7% of the compression.  (Table 9011,
p. 360.)

However, that is for English text, with a pretty Zipf-like distribution.
In code, we generate new words (new function names) frequently, and
proceed to use them heavily.

It's worth noting the similarity between generating a good base dictionary
with finding a good base version of a file for delta-encoding.
You may end up having to divide the documents into classes (different
source languages for example - C vs. asm vs. perl vs. python vs.
docs vs. GIFs), and use a different dictionary per class.  But that
drives up the cost of compression (you have to see which class the
file to be compressed falls into).

Anyway, highly recommended.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html