Re: Pack v4 again..

Shawn Pearce <spearce@xxxxxxxxxxx> · Mon, 16 Feb 2015 20:12:33 -0800

On Sun, Feb 15, 2015 at 10:45 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Sun, Feb 15, 2015 at 11:59:02PM -0500, Nicolas Pitre wrote:
>
>> Yet, I think the biggest problem with pack v4 at the moment is the
>> packing algorithm for tree objects.  We are piggy-backing on the pack v2
>> object delta compression sorting and that produces suboptimal results
>> due to deep recursions.  And it is the recursion that kills us. The pack
>> v4 requires a new packing algorithm for its tree objects.
>>
>> What I imagined is something like this:
>>
>> - Each canonical tree entry is made of a SHA1, mode and path.  Let's
>>   assume this is hashed into a 24-bit value.
>>
>> - Each tree object can therefore be represented as a string of 24-bit
>>   "characters".
>>
>> - Delta-compressing a tree object becomes a substring search where we
>>   try to replace a sequence of "characters" with the longest "string"
>>   possible from another object.  Repeat with the remaining sequences.
>
> Somewhat related to this, I was playing this weekend with the idea of
> generating fast tree diffs from our on-disk deltas. That is, given a
> base tree and a binary delta against it, could I reliably reproduce a
> diff (one way or the other) in O(size of diff), rather than O(size of
> tree)?

Yes, if you always make the tree diff *search* on entry boundaries.

> The conclusion I came to was "no, you cannot do it in the general case
> of byte-wise binary diffs"[2].

This is also why you cannot binary search inside of the canonical tree
format. :(

> If we knew that our deltas were always produced on entry-boundaries (a
> "character" in your description above), this would be much simpler.

Eons ago Nico and I were of the opinion that pack v4 trees could use
the existing byte based delta format on disk, but the delta
search/encoder would always align to fixed width entry boundaries.
That gives you deltas that are understandable by the current decoder,
but are also trivially processed in delta format as insertions and
copy runs always cover complete entries and are never a partial entry.

It was all theory; we never actually wrote a prototype of that.

> [1] Of course there are other reachability checks besides packing, like
>     git-prune. But all of those are even better sped up by using
>     reachability bitmaps. Packing, as the place where we generate the
>     bitmaps, and which is sensitive to things like traversal order,
>     remains the place where we will always need to actually walk.

`git log -- $path` isn't trivially improved with reachability bitmaps.
Its something we have been pondering a lot at $DAY_JOB and haven't
found a magic bullet solution for yet. Until someone comes up with
another chunk of magic, we need tree diffs for a lot of operations.

> [2] One option, of course, is to generate byte-wise deltas, but with a
>     promise to always align them on entry boundaries. I'm tempted by
>     this, because the result would be readable by existing packv2
>     readers. We'd have to set a flag somewhere that indicates the pack
>     was written with this property, though.

Yes, that was always something we wanted to look at doing.

> [3] I suspect you could come up with some heuristic that finds tree
>     entry boundaries with high probability, and in the low probability
>     case does not produce a wrong answer, but instead has to walk all
>     the way back to the beginning of the tree. That would be fine here.
>     But frankly, this "walk backwards" thing was just the straw that
>     broke the camel's back for me in working on this. Handling all the
>     possible cases was ending up quite complicated.

No, I tried this in JGit once. You can't do it reliably enough.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html