Re: Compression and dictionaries

"Jon Smirl" <jonsmirl@xxxxxxxxx> · Tue, 15 Aug 2006 09:29:28 -0400

On 15 Aug 2006 04:33:03 -0400, linux@xxxxxxxxxxx <linux@xxxxxxxxxxx> wrote:
Just to inform this discussion, a brief description of the zlib
"deflate" algorithm.  A full description is in RFC1951.

This is LZ77 compression, meaning that what is actually encoded is
a series of instruction to either
- Insert a specified byte, or
- Copy a run of n bytes, starting m bytes ago.

Shawn and I did some testing overnight. A simple 4KB preload table
yielded a 4% size reduction in my 850MB pack. I'm still hoping for 10%
using an optimal 32K table.

Paper on computing the optimal preload table:
http://www.eecs.harvard.edu/~michaelm/postscripts/dcc2001a.pdf

If anyone is interested in doing some research, it would be
interesting to build a pack engine based on full-text search
technology. For example cLucene,
http://clucene.sourceforge.net/index.php/Main_Page The full-text
search app contains the code for extracting the dictionary and then
encoding all of the blobs using it.

My prediction is that a large global dictionary based scheme will
compress significantly better than zlib is capable of.

--
Jon Smirl
jonsmirl@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html