Re: Compression and dictionaries

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Mon, 14 Aug 2006 14:33:39 +0200 (CEST)

Hi,

On Sun, 13 Aug 2006, Jon Smirl wrote:

> > From what I remember from long ago most compression schemes build
> dictionaries as a way of achieving significant compression. If so,
> since we zlib compress each entry in a pack individually, are there
> many copies of very similar dictionaries in the pack?

No, there are no dictionaries in the pack. At least no explicit ones.

The trick is that the dictionary builds up with the data incrementally: 
both the encoding process and the decoding process construct it 
identically, on the fly.

Example: A very primitive example of compression is the arithmetic coder 
of single characters. Given a probability distribution of the characters, 
each character is encoded such that the length of the encoding of one 
character is reciprocal to its probability *Footnote 1*. If you want to 
compress context-free, i.e. without looking at the characters before or 
after the current character when encoding or decoding, this is provably 
optimal.

Now, if you do not have a probability distribution (because you haven't 
looked at the file yet), you can build an histogram as approximation on 
the fly. Whenever a character is encoded, you take the current histogram 
to approximate the probability distribution, and adjust the histogram for 
that character.

For zlib, it is a little more involved (it is not a simple histogram), but 
the principle holds: if you do not have a dictionary, you just build one 
from the data seen so far.

BTW I doubt that an explicit dictionary would be good: Either you 
distribute it with git, and have many people complaining that it is either 
too small, or too big, and in most cases does not fit their data well, or 
you have to create it for each local repository, which takes time.

Further, if the pack-file becomes corrupt, you usually still have the 
pack index, or the start of the pack-file, and can reconstruct most of the 
objects. If you use a dictionary, and just one bit flips in it, you're 
screwed.

Ciao,
Dscho

Footnote 1: If the probabilities are all powers of two (with negative 
exponent), the encodings can be fixed bit strings (because the lengths are 
integer numbers); if that is not the case, it gets a little more 
complicated; basically, you store a fractional bit as a state; that is why 
it is called arithmetic coding.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html