Re: Huge win, compressing a window of delta runs as a unit

Nicolas Pitre <nico@xxxxxxx> · Mon, 21 Aug 2006 13:48:59 -0400 (EDT)

On Mon, 21 Aug 2006, Jon Smirl wrote:

> How about making the length of delta chains an exponential function of
> the number of revs? In Mozilla configure.in has 1,700 revs and it is a
> 250K file. If you store a full copy every 10 revs that is 43MB
> (prezip) of data that almost no one is going to look at.
> The chains
> lengths should reflect the relative probability that someone is going
> to ask to see the revs. That is not at all a uniform function.

1) You can do that already with stock git-repack.

2) The current git-pack-objects code can and does produce a delta "tree" 
   off of a single base object.  It doesn't have to be a linear list.

Therefore, even if the default depth is 10, you may as well have many 
deltas pointing to the _same_ base object effectively making the 
compression ratio much larger than 1/10.

If for example each object has 2 delta childs, and each of those deltas 
also have 2 delta childs, you could have up to 39366 delta objects 
attached to a _single_ undeltified base object.

And of course the git-pack-objects code doesn't limit the number of 
delta childs in any way so this could theoritically be infinite even 
though the max depth is 10.  OK the delta matching window limits that 
somehow but nothing prevents you from repacking with a larger window 
since that parameter has no penalty on the reading of objects out of the 
pack.

> Personally I am still in favor of a two pack system. One archival pack
> stores everything in a single chain and size, not speed, is it's most
> important attribute. It is marked readonly and only functions as an
> archive; git-repack never touches it. It might even use a more compact
> compression algorithm.
> 
> The second pack is for storing more recent revisions. The archival
> pack would be constructed such that none of the files needed for the
> head revisions of any branch are in it. They would all be in the
> second pack.

Personally I'm still against that.  All arguments put forward for a 
different or multiple packing system are based on unproven assumptions 
so far and none of those arguments actually present significant 
advantages over the current system.  For example, I was really intriged 
by the potential of object grouping into a single zlib stream at first, 
but it turns out not to be so great after actual testing.

I still think that a global zlib dictionary is a good idea.  Not because 
it looks like it'll make packs enormously smaller, but rather because it 
impose no performance regression over the current system.  And I 
strongly believe that you have to have a really strong case for 
promoting a solution that carries performance regression, something like 
over 25% smaller packs for example.  But so far it didn't happen.

> This may be a path to partial repositories. Instead of downloading the
> real archival pack I could download just an index for it. The index
> entries would be marked to indicate that these objects are valid but
> not-present.

An index without the actual objects is simply useless.  You could do 
without the index entirely in that case anyway since the mere presence 
of an entry in the index doesn't give you anything really useful.

Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html