Re: [RFC] Add --create-cache to repack

Shawn Pearce <spearce@xxxxxxxxxxx> · Sun, 30 Jan 2011 11:29:49 -0800

On Sat, Jan 29, 2011 at 22:51, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Shawn Pearce <spearce@xxxxxxxxxxx> writes:
>
>> I fully implemented the reuse of a cached pack behind a thin pack idea
>> I was trying to describe in this thread.  It saved 1m7s off the JGit
>> running time, but increased the data transfer by 25 MiB.  I didn't
>> expect this much of an increase, I honestly expected the thin pack
>> portion to be well, thinner.  The issue is the thin pack cannot delta
>> against all of the history, its only delta compressing against the tip
>> of the cached pack.  So long-lived side branches that forked off an
>> older part of the history aren't delta compressing well, or at all,
>> and that is significantly bloating the thin pack.  (Its also why that
>> "newer" pack is 57M, but should be 14M if correctly combined with the
>> cached pack.)  If I were to consider all of the objects in the cached
>> pack as potential delta base candidates for the thin pack, the entire
>> benefit of the cached pack disappears.
>
> What if you instead use the cached pack this way?
>
>  0. You perform the proposed pre-traversal until you hit the tip of cached
>    pack(s), and realize that you will end up sending everything.
>
>  1. Instead of sending the new part of the history first and then sending
>    the cached pack(s), you send the contents of cached pack(s), but also
>    note what objects you sent;

This is the part I was trying to avoid.  Making this list of objects
from the cached pack(s) costs working set inside of the pack-objects
process.  I had hoped that the cached packs would let me skip this
step.

But lets say that's acceptable cost.  We cannot efficiently make a
useful list of objects from the pack.  Scanning the .idx file only
tells us the SHA-1.  It does not tell us the type, nor does it tell us
what the path hash code would be for the object if it were a tree or
blob.  So we cannot efficiently use this pack listing to construct the
delta window.

>  2. Then you send the new part of the history, taking full advantage of
>    what you have already sent, perhaps doing only half of the reuse-delta
>    logic (i.e. you reuse what you can reuse, but you do _not_ punt on an
>    object that is not a delta in an existing pack).

Well, I guess we could go half-way.  We could try to use only
non-delta objects from the cached pack as potential delta bases for
this delta search.

To do that we would build the reverse index for the cached pack, then
check each object's type code just before we send that part of the
cached pack.  If its non-delta, we can get its SHA-1 from the reverse
index, toss the object into the delta search list, and copy out the
length of the object until the next object starts.

However... I suspect our delta results would be the same as the thin
pack before cached pack test I did earlier.  The objects that are
non-delta in the cached pack are (in theory) approximately the objects
immediately reachable from the cached pack's tip.  That was already
put into the delta window as the base candidates for the thin pack.
This may be a faster way to find that thin pack edge, but the data
transfer will still be sub-optimal because we cannot consider deltas
as bases.

-- 
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html