Re: [PATCH] diff-delta: produce optimal pack data

Nicolas Pitre <nico@xxxxxxx> · Fri, 24 Feb 2006 16:39:42 -0500 (EST)

On Fri, 24 Feb 2006, Junio C Hamano wrote:

> I haven't looked at Nico's original or updated code closely at
> all, but two things come to mind.
> 
> (1) if we could tell the particular data is intrinsically
>     diff_delta unfriendly and diff_delta would waste too much
>     time when tried to delta against almost _any_ other blob,
>     then it might help to give an interface in diff-delta.c for
>     the caller to check for such a blob without even trying
>     diff_delta.
> 
> (2) otherwise, if diff_delta could detect it would spend too
>     many cycles to finish its work for a particular input early
>     on, we might want it to bail out instead of trying a
>     complete job.

I have a patch that implements an hybrid approach.

Currently, diff-delta takes blocks of data in the reference file and 
hash them.  When the target file is scanned, it uses the hash to match 
blocks from the target file with the reference file.

If blocks are hashed evenly the cost of  producing a delta is at most 
O(n+m) where n and m are the size of the reference and target files 
respectively.  In other words, with good data set the cost is linear.

But if many blocks from the reference buffer do hash to the same bucket 
then for each block in the target file many blocks from the reference 
buffer have to be tested against, making it tend towards O(n^m) which is 
pretty highly exponential.

The solution I'm investigating is to put a limit on the number of 
entries in the same hash bucket so to bring the cost back to something 
more linear.  That means the delta might miss on better matches that 
have not been hashed but still benefit from a limited set. Experience 
seems to show that the time to deltify the first two blobs you found to 
be problematic can be reduced by 2 orders of magnitude with about only 
10% increase in the resulting delta size, and still nearly 40% smaller 
than what the current delta code produces.

The question is how to determine the best limit on the number of entries 
in the same hash bucket.

Nicolas
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html