Re: Pack v4 again..

Nicolas Pitre <nico@xxxxxxxxxxx> · Sun, 15 Feb 2015 23:59:02 -0500 (EST)

On Fri, 13 Feb 2015, Duy Nguyen wrote:

> After taking 1.5 years "vacation" from pack v4, I plan to do something
> about it again. Will post more when I have some patches to discuss.
> Only one question for now (forgive me if I asked already, it's been
> quite some time)

Yeah.  I had to re-study my own code before replying.

> I think pack v4 does not deliver its best promise that walking a tree
> is simply following pointers and jumping from place to place. When we
> want to copy from the middle of another tree, we need to scan from the
> beginning of the tree. Tree offset cache helps, but the problem
> remains. What do you think about an alternative format that each
> "copy" instruction includes both index of the tree entry to copy from
> (i.e. what we store now)  _and_ the byte offset from the beginning of
> the tree? With this byte offset, we know exactly where to start
> copying without scanning from the beginning. It will be a bit(?)
> bigger, but it's also faster.

That would make the format inflexible.  If we want to do partial 
repacking by, say, copying some objects and repacking others (some 
objects might need repacking because the objects they refer to are 
omitted from the repack) then if those repacked objects are themselves 
referred to by byte offset then we lose as the offset is no longer 
valid.

> I imagine this is an optimization that can be done locally. The pack
> transferred over network does not have these byte offsets. After the
> pack is stored and verified by index-pack, we can rewrite it and add
> this info. The simplest way is use a fixed size for this offset (e.g.
> uint16_t or even uint8_t), add the place holder in copy instructions
> of all v4 trees. After that object offsets will not change again and
> we can start filling real offsets to placeholders.

Having a local extra index is fine.  Just like the pack index which is 
always created locally and can be recreated at any time.  Some tree 
offset cache might be beneficial, but I'd avoid making it into the pack 
file itself.

Yet, I think the biggest problem with pack v4 at the moment is the 
packing algorithm for tree objects.  We are piggy-backing on the pack v2 
object delta compression sorting and that produces suboptimal results 
due to deep recursions.  And it is the recursion that kills us. The pack 
v4 requires a new packing algorithm for its tree objects.

What I imagined is something like this:

- Each canonical tree entry is made of a SHA1, mode and path.  Let's 
  assume this is hashed into a 24-bit value.

- Each tree object can therefore be represented as a string of 24-bit 
  "characters".

- Delta-compressing a tree object becomes a substring search where we 
  try to replace a sequence of "characters" with the longest "string" 
  possible from another object.  Repeat with the remaining sequences.

Having a 24-bit hash value is totally arbitrary.  It could be 16 bits 
with more collisions but much faster search and less memory usage.  The 
optimal value would need to be determined after some experimentation.

Algorithms for the longest common substring problem already exist.  So 
one of the classical algorithms could probably be adapted here.

This would allow for exploiting the provision in pack v4 to copy from 
more than one tree object.  And this would also favor shallower 
recursions and even smaller packs.  Imposing a minimum substring length 
(rather than a maximum delta depth) would determine the runtime 
performance when using the pack afterwards.

If you have enough free cycles to work on this, that's what I'd suggest 
you explore at this point. I wish I could myself as I think this ought 
to be rather cool work.

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html