Re: A look at some alternative PACK file encodings

"Jon Smirl" <jonsmirl@xxxxxxxxx> · Wed, 6 Sep 2006 20:04:02 -0400

On 9/6/06, A Large Angry SCM <gitzilla@xxxxxxxxx> wrote:
Jon Smirl wrote:
> On 9/6/06, A Large Angry SCM <gitzilla@xxxxxxxxx> wrote:
>> TREE objects do not delta or deflate well.
>
> I can understand why they don't deflate, the path names are pretty
> much unique and the sha1s are incompressible. By why don't they delta
> well? Does sorting them by size mess up the delta process?

My guess would be the TREEs would only delta well against other TREE
versions for the same path.

I would have thought that the TREE delta code would have already taken
this into account. Is there enough info around to match up all of the
TREEs for a specific path into a delta chain?

The repack code could build a model of the tree as it is repacking,
that is what fast-import does. If you have a model of the tree then
when you change a TREE node you track the last sha1 that corresponded
to that directory path. Now you know what to diff to.

> Shawn is doing some prototype work on true dictionary based
> compression. I don't know how far along he is but it has potential for
> taking 30% off the Mozilla pack.

Just looking at the structures in non-BLOBS, I see a lot of potential
for the use of a set dictionaries when deflating TREEs and another set
of dictionaries when deflating COMMITs and TAGs. The low hanging fruit
is to create dictionaries of the most referenced IDs across all TREE or
COMMIT/TAG objects.

Doing this will get 4-7% according to our small tests, to get more you
need a different type of compression algorithm. For true dictionary
based compression you parse all of the input documents into words. The
words are put into a single giant huffman table. Then the huffman
codes for the words are concatenated together to form the compressed
document. This is one of the highest possible compression methods
since it takes into account you are compressing something that is
human readable. gzip is sort of like this but the dictionary is
limited to 32KB.

Search engines use true dictionary compression since it is easy to add
on search indices to the compressed data. For a search index you make
a bit vector that corresponds to each word in the dictionary and then
set a bit if that document number contains the word. The vectors and
dictionary are stored compressed. There are algorithms for doing
and/or on the compressed vectors. The and/or stage lets you identify
likely hits, you retrieve those and then do an exact match on the
search terms to be sure.

--
Jon Smirl
jonsmirl@xxxxxxxxx

--
Jon Smirl
jonsmirl@xxxxxxxxx
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html