Re: Compression and dictionaries

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/14/06, Johannes Schindelin <Johannes.Schindelin@xxxxxx> wrote:
I still think that this is important to think through: Is it worth a
couple of kilobytes (I doubt that it would be as much as 1MB in _total_),
and be on the unsafe side?

The maps look something like this:
void => 10101010
char => 111101
int => 1110101
tree => 11010111
commit => 101011001

These maps are repeated in every one of my 1M revisions including the
deltas. I have 1GB pack files with 1M entries in them - 1K each entry.
Each byte saved out of a zlib entry take 1MB off my pack.

Note that the current internal maps aren't the same in each of each of
the zlib blobs since the algorithm that builds the internal maps
depends on the order the identifiers were encountered.

If the git tools add a global dictionary the tools would still be able
to read existing packs. If old tools try to read a new dictionary
based pack they will get the zlib NEED_DICT error.

If the entire file was one big zlib blob there would only be one
dictionary and adding a fixed dictionary wouldn't make any difference.
But since it is 1M little zlib blobs it is has 1M dictionaries.

The only "unsafe" aspect I see to this is if the global dictionary
doesn't contain any of the words in the documents being encoded. In
that case the global dictionary will occupy the short huffman keys
forcing longer internal keys.  The keys for the words in the document
would be longer by a about a bit on average.

A solution for making this work over time would be to store the global
dictionary at the front of the pack file and for the unpack tools to
use the stored copy. This would let us change the global dictionary in
the pack tool with no downside, you could even support multiple
dictionaries in the pack tool.

If someone wants to get fancy you could write a tool that would scan a
pack file and compute an optimal fixed dictionary. Store it at the
front of the pack file and repack using it.

Global dictionaries are common in full text searching. I seem to
recall an article stating that Google's global dictionary has about
250K entries in it. If git packs switch to a global dictionary model
it's not a big leap to add a full text search index. You just need
objects for each word in the dictionary pointing to the revisions that
contain it.

--
Jon Smirl
jonsmirl@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]