Re: Compression and dictionaries

linux@xxxxxxxxxxx · 15 Aug 2006 04:33:03 -0400

Just to inform this discussion, a brief description of the zlib
"deflate" algorithm.  A full description is in RFC1951.

This is LZ77 compression, meaning that what is actually encoded is
a series of instruction to either
- Insert a specified byte, or
- Copy a run of n bytes, starting m bytes ago.

I.e. roughly

union {
	struct { bool is_literal, char literal};
	struct { bool is_literal; uint8_t length; uint16_t distance; } copy;
};

Note that 100 repetitions of a character can be unambiguously encoded
as the character, followed by "copy 99 chars starting at offset -1".

There are also 257 possible literals, the 256 bytes plus a special EOF token.

The valid copy lengths are 3..258, and the valid distances are 1..32768.

Anyway, the deflate algorithm operates in a couple of steps:

- Translate the input into an (uncompressed) series of literals and
  copies.
- Take the literals and the lengths and put them into one big Huffman tree.
- Take the distances (actually, the simplified distances; large
  distances are grouped into ranges, which are followed by a binary suffix)
  and put them in a separate Huffman tree.
- A Huffman tree can be transmitted by encoding the number of bits assigned
  to each symbol, with a apecial value (infinity is mathematically
  the most logical but deflate calls it zero) for a symbol that does
  not appear in the tree at all.
- The way that deflate sends the lengths involves run-length encoding
  and a third Huffman tree, but I won't go into it now.

The finding of the copies is the most expensive part.  The minimum
copy length is 3 bytes, so the 3 bytes starting at each position are
hashed and the result put in a hash table, then you search the hash
chain looking for the longest match.  There are various heuristics
for aborting the search early, which the compression level adjusts.
-1 doesn't look very hard at all, while -9 always does exhaustive search.

(For example, when searching at all aggressively, when you've found a
match, you search starting at the next byte position in the input as well.
If that produces a longer match, you emit the skipped byte as a literal
and take the longer match and start looking for an improvement on it.)

Anyway, input is divided into blocks.  Each block is has a 2-bit
prefix that encodes one of three possibilities:
- Send uncompressed
- Send compressed, with fixed (specified in the standard) literal/length &
  distance Huffman code.  This "static Huffman tree" case is useful for
  short blocks where the overhead of encoding the Huffman tree is
  significant.
- Send compressed, using dynamic (encoded in the file) Huffman codes.
  This is the most commonly used case for large, compressible files.

Note that "copy" intructions can span blocks; the block boundaries are
only opportunities to change the way that the copy instructions are
coded if the file changes character (like text vs. data parts of an
executable).  Deciding when to end one block and start a new one is
another set of heuristics.

Only thing like a "dictionary" is the codes used for Huffman
compression.

Now, what you *could* do is pre-load the "copy from this text" buffer
(hash table) with up to 32K of example data, before you start encoding
data for real.   Basically, run some example text through the compresser
to warm it up, but throw away the compressed form.

This isn't a "dictionary" in the LZ78 sense (to start with, there are
no boundaries between entries), but serves the same purpose.  Note that
since the distance to copy from is limited to 32768, there's no point
to having more warm-up text than that.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html