> I explained our situation to Mark Adler, zlib author, and he recommend > checking out this book for techniques that can be used. > > http://www.cs.mu.oz.au/mg/ I was about to suggest the same thing! It's an excellent book. It's about a piece of software and the choices made there, but it explains in detail many alternatives and why they weren't chosen for the mg software, but what they might be useful for. The mg software itself is designed for human-language text, and does indexing as well. So it does a fixed breakdown into alternate word and non-word tokens, builds a lexicon with frequency tables, then uses the frequencies to build Huffman trees, and Huffman-compresses each token. The "word" dictionary is also used as part of the search index, as each word has a (very cleverly compressed to less than one byte on average) pointer to each document it appears in with a count of the number of appearances. The word encoding is straight zero-order Huffman, so inter-word correlations are not used. For software, with lots of punctuation and multi-word idioms ("for (i = 0; i < n; i++) {"), the basic design is not very well suited. (To say nothing of binary data, like some people who want to check images or multi-megabyte video files into git.) But there is a good description of many possible algorithms, with lots of emphasis on practicalities. And the software *does* support dynamic collections, via auxiliary indexes and escape codes for any new words not found in the main dictionary. In fact, generating the Huffman tree from as little as 1/8 of the material to be compressed only loses you 5.7% of the compression. (Table 9011, p. 360.) However, that is for English text, with a pretty Zipf-like distribution. In code, we generate new words (new function names) frequently, and proceed to use them heavily. It's worth noting the similarity between generating a good base dictionary with finding a good base version of a file for delta-encoding. You may end up having to divide the documents into classes (different source languages for example - C vs. asm vs. perl vs. python vs. docs vs. GIFs), and use a different dictionary per class. But that drives up the cost of compression (you have to see which class the file to be compressed falls into). Anyway, highly recommended. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html