Re: A look at some alternative PACK file encodings

Shawn Pearce <spearce@xxxxxxxxxxx> · Thu, 7 Sep 2006 01:34:24 -0400

A Large Angry SCM <gitzilla@xxxxxxxxx> wrote:
> Just looking at the structures in non-BLOBS, I see a lot of potential
> for the use of a set dictionaries when deflating TREEs and another set
> of dictionaries when deflating COMMITs and TAGs. The low hanging fruit
> is to create dictionaries of the most referenced IDs across all TREE or
> COMMIT/TAG objects.

The most referenced IDs should be getting reused through deltas.
That is IDs which are highly referenced are probably referenced
in the same tree over many versions of that tree.  Since the data
isn't changing it should be getting copied by a delta copy command
rather than appearing as a literal.

The Mozilla pack appears to have the bulk of its storage taken up
by blobs (both bases and deltas).  I suspect this is because the
bases compress to approx. 50% of their original size but share a
lot of common tokens.  Those common tokens are being repeated in
every private zlib dictionary.  The same thing happens in a delta,
except here we are probably copying a lot from the base so the
average size is greatly reduced but we are still repeating tokens
in the zlib dictionary for anything that is a literal in the delta
(as it didn't appear in the base).

A large dictionary containing all tokens for the project should
greatly reduce the size of each blob, base and delta alike.  It also
lends itself to creating an efficient full text index.

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html