Re: Pack v4 again..

Duy Nguyen <pclouds@xxxxxxxxx> · Mon, 16 Feb 2015 17:11:18 +0700

On Mon, Feb 16, 2015 at 11:59 AM, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:
>> I think pack v4 does not deliver its best promise that walking a tree
>> is simply following pointers and jumping from place to place. When we
>> want to copy from the middle of another tree, we need to scan from the
>> beginning of the tree. Tree offset cache helps, but the problem
>> remains. What do you think about an alternative format that each
>> "copy" instruction includes both index of the tree entry to copy from
>> (i.e. what we store now)  _and_ the byte offset from the beginning of
>> the tree? With this byte offset, we know exactly where to start
>> copying without scanning from the beginning. It will be a bit(?)
>> bigger, but it's also faster.
>
> That would make the format inflexible.  If we want to do partial
> repacking by, say, copying some objects and repacking others (some
> objects might need repacking because the objects they refer to are
> omitted from the repack) then if those repacked objects are themselves
> referred to by byte offset then we lose as the offset is no longer
> valid.

When generate new packs, we could rip out these offsets, but it
depends on whether we can reuse v4 trees unchanged like we do with v2
deltas. If we can, ripping is not an option because then we need to
parse more.

>> I imagine this is an optimization that can be done locally. The pack
>> transferred over network does not have these byte offsets. After the
>> pack is stored and verified by index-pack, we can rewrite it and add
>> this info. The simplest way is use a fixed size for this offset (e.g.
>> uint16_t or even uint8_t), add the place holder in copy instructions
>> of all v4 trees. After that object offsets will not change again and
>> we can start filling real offsets to placeholders.
>
> Having a local extra index is fine.  Just like the pack index which is
> always created locally and can be recreated at any time.  Some tree
> offset cache might be beneficial, but I'd avoid making it into the pack
> file itself.

Hm.. right.

> Yet, I think the biggest problem with pack v4 at the moment is the
> packing algorithm for tree objects.  We are piggy-backing on the pack v2
> object delta compression sorting and that produces suboptimal results
> due to deep recursions.  And it is the recursion that kills us. The pack
> v4 requires a new packing algorithm for its tree objects.

Yep. I made a conversion tool a few days ago to "flatten" v4 trees. So
if tree A copies some entries from B, but then B copies from C, tree A
could copy directly from C. Performance improves significantly (close
to v2 with rev-list, but still slower). But pack size doubles because
copy sequences are fragmented and we're duplicating same copy patterns
over and over again. All because we follow the single delta chain
decided since v2 time so we only have one tree with no copy sequences
(best to copy from).

> What I imagined is something like this:
>
> - Each canonical tree entry is made of a SHA1, mode and path.  Let's
>   assume this is hashed into a 24-bit value.
>
> - Each tree object can therefore be represented as a string of 24-bit
>   "characters".
>
> - Delta-compressing a tree object becomes a substring search where we
>   try to replace a sequence of "characters" with the longest "string"
>   possible from another object.  Repeat with the remaining sequences.
>
> Having a 24-bit hash value is totally arbitrary.  It could be 16 bits
> with more collisions but much faster search and less memory usage.  The
> optimal value would need to be determined after some experimentation.
>
> Algorithms for the longest common substring problem already exist.  So
> one of the classical algorithms could probably be adapted here.
>
> This would allow for exploiting the provision in pack v4 to copy from
> more than one tree object.  And this would also favor shallower
> recursions and even smaller packs.  Imposing a minimum substring length
> (rather than a maximum delta depth) would determine the runtime
> performance when using the pack afterwards.
>
> If you have enough free cycles to work on this, that's what I'd suggest
> you explore at this point. I wish I could myself as I think this ought
> to be rather cool work.

I'm getting there. I'm writing an alternative implementation to your
pv4_encode_tree() that takes multiple base trees instead of just one,
finding all copy sequences from these bases, then somehow pick the
best ones. After trees are sorted by similarity in pack-objects, we
preserve n trees as base trees (no copy sequences). There is a window
to feed some of these base trees to this encode function, like how we
do it in try_delta().

Your encoding tree entries as strings would be faster, but that's not
the immediate problem. Using the longest common substring is kinda
greedy (exactly what I'm thinking to do), but it probably produces a
suboptimal copy sequences. Maybe using two shorter copy sequences
would reduce fragmentation than one large copy sequence and a bunch of
individual tree entries, that sort of thing. Haven't worked it out
yet..
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html