Git pack v4: next step, help required

Nicolas Pitre <nico@xxxxxxxxxxx> · Wed, 18 Sep 2013 13:40:23 -0400 (EDT)

I think the pack v4 format and compatibility code is pretty stable now, 
and probably as optimal as it can reasonably be.

@Junio: might be a good idea to refresh your pu branch again.

Now the biggest problem to solve is the actual tree 
"deltification".  I don't have the time to spend on this otherwise very 
interesting problem (IMHO at least) so I'm sending this request for help 
in the hope that more people would be keen to contribute their computing 
skills to solve this challenge.

I'll make a quick description of the pv4 tree encoding first and explain 
the problem to solve afterwards.

A pv4 tree is comprised of two types of records: immediate tree entry 
and tree sequence copy.

1) Immediate Tree Entry

This is made of the following tuple:

	<path component index>,{SHA1 reference index>

The path component index refers to the path dictionary table where path 
strings and file modes are uniquely stored.  The SHA1 index refers to 
the sorted SHA1 table as previously found in the pack index but which is 
now part of the pack file itself.  So on average a single tree entry may 
take between 1 to 2 bytes for the path component index and 1 to 3 bytes 
for the SHA1 index.

2) Tree Sequence Copy

This is used to literally copy a contiguous list of tree entries from 
another tree object.  This goes as follows:

	<tree entry where to start>,<number of entries to copy>
	[,<SHA1 index of the object to copy from>]

So instead of having arbitrary copying of data like in delta objects, we 
refer directly to tree entries in another object.  The big advantage 
here is that the tree walker may directly jump to the copied object 
without having to do any delta patching and caching. The SHA1 index is 
optional if it refers to the same copied object as the previous tree 
sequence copy record in this tree object.  And another possible 
optimization for the tree walker when enumerating objects is to skip the 
copy entry entirely if the object being copied from has already been 
enumerated.

The size of this entry is more variable and is typically between 1 to 2 
bytes for the start index, 1 to 2 bytes for the copy size, and 0 to 3 
bytes for the SHA1 index.

So... what to do with this?

Currently the "deltification" is achieved using a pretty dumb heuristic 
which is to simply take each tree object in a pack v2 with their 
corresponding base delta object and perform a straight conversion into 
pack v4 tree format i.e. use copy records whenever possible as long as 
they represent a space saving over the corresponding immediate tree 
entries (which is normally the case).

However this is rather sub-optimal for two reasons:

1) This doesn't benefit from the ability to use multiple base objects to 
   copy from and is potentially missing on additional opportunities for 
   copy sequences which are not possible in the selected base object 
   from the pack v2 delta base selection. Pack v4 is already quite 
   smaller than pack v2 yet it might possibly be smaller still.

2) This makes deep delta chains very inefficient at runtime when the 
   pack is subsequently accessed.

Let's consider this example to fully illustrate #2.

	Tree A:
	entry 0:	"foo.txt"
	entry 1-3:	copy from tree B: start=2 count=3

	Tree B:
	entry 0-5:	copy from tree C: start=2 count=5
	entry 6:	"bar.txt"

	Tree C:
	entry 0:	"file_a"
	entry 1:	"file_b"
	entry 2:	"file_c"
	entry 3:	"file_D"
	entry 4:	"file_E"
	entry 5:	"file_F"
	entry 6:	"file_G"

This is a good example of what typically happens when the above heuristic 
is applied.  And it is not uncommon to see a long indirection chain of 
"copy those 2 entries from that other object" sometimes reaching 50 
levels deep where the same 2 (or more) entries require up to 50 object 
hops before they can be obtained.

Obviously here the encoding should be optimized to reduce the chaining 
effect simply by referring to the final object directly.  Hence tree A 
could be encoded as follows:

	Tree A:
	entry 0:	"foo.txt"
	entry 1-3:	copy from tree C: start=4 count=3

The on-disk encoding is likely to be the same size but the runtime 
access is greatly optimized.

Now... instead of trying to do reference simplification by factoring out 
the chain effect, I think a whole new approach to tree delta compression 
for pack v4 should be developed which would also address issue #1 above.

For example, we may try to make each canonical tree objects into 
sequences of hashed tree entries in memory where each tree entry would 
be reduced down to a single CRC32 value (or even Adler32 for speed).  
Each tree object would then be represented by some kind of 32-bit 
character "string".  Then it is just a matter of applying substring 
matching algorithms to find the best copy sequences without creating 
reference cycles, weighted by the number of indirections involved when 
a referred object already has a copy sequence covering part or all of 
the matched string. Etc.

Or that can be something else entirely.  Certainly there are similar 
problems already solved in the literature somewhere.

And it has to be _fast_.

So if someone is on the lookout for a nice algorithmic and coding 
challenge then here's your opportunity!

Nicolas

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html