Re: Computing delta sizes in pack files

Shawn Pearce <spearce@xxxxxxxxxxx> · Sat, 25 Nov 2006 02:33:38 -0500

Jonas Fonseca <jonas.fonseca@xxxxxxxxx> wrote:
> I will not post the numbers here. They are available in
> http://jonas.nitro.dk/tmp/stats.pdf for those interested. The following
> is my "analysis" of the numbers.

Thanks, this was interesting stuff.

> As expected, the randomness of the content of both commit and tag objects
> results in a very poor packing performance of only 2%.

This is one reason why Jon Smirl was pushing the idea of dictionary based
compression.  git.git has only 276 unique author lines, yet 37 of them
are really the top committers.  Not surprisingly Junio C Hamano leads
the pack with 3529+ commits...  :-)

A dictionary based compression would allow us to easily compress
Junio's authorship line away from those 3529+ commits into a single
string, getting much better compression on the commits.

In trees this may work very well too for very common file names, e.g.
"Makefile".  Yes each tree delta compresses very well against its
base (and likely copies the file name from the base even when the
SHA1 changed) but if the bases were able to use a common dictionary
that would help even more.

> The data show that for minimal index files, the packs need to contain
> more than 2500 objects. The 24 bytes per-object for the optimal case
> includes 20-bytes for the object SHA1, and thus cannot be expected to
> become lower.

This is just a fundamental property of the pack index file format.
The file *MUST* be 1064 bytes of fixed overhead, with 24 bytes of
data per object indexed.  So the fixed overhead amortizes very
quickly over the individual object entries, at which point its
exactly 24 bytes per entry.  This all of course assumes a 32 bit
index (which is the current format).

The thing is the Mozilla index is 44 MiB.  That's roughly 1.9 million
objects.  The index itself is larger than the entire git.git pack.
On a large repository the index ain't trivial...  yet its essential
to performance!

On the other hand the 1064 bytes of fixed overhead in the index
is nothing compared to the overhead in say an RCS file.  Or an
SVN repository...  :-)

What I failed to point out in my script (or in my email) is that
the 24 bytes of index entry cannot be eliminated, and thus must
be added to the "revision cost".  In some cases its about the same
size as the deltafied revision in the pack file.  :-(

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html