Re: [PATCH] diff-delta: produce optimal pack data

Linus Torvalds <torvalds@xxxxxxxx> · Fri, 24 Feb 2006 16:45:23 -0800 (PST)

On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> 
> Currently, diff-delta takes blocks of data in the reference file and 
> hash them.  When the target file is scanned, it uses the hash to match 
> blocks from the target file with the reference file.
> 
> If blocks are hashed evenly the cost of  producing a delta is at most 
> O(n+m) where n and m are the size of the reference and target files 
> respectively.  In other words, with good data set the cost is linear.

Assuming the hash is good, of course.

I think this was the problem with you trying something simpler than 
adler32..

> But if many blocks from the reference buffer do hash to the same bucket 
> then for each block in the target file many blocks from the reference 
> buffer have to be tested against, making it tend towards O(n^m) which is 
> pretty highly exponential.
> 
> The solution I'm investigating is to put a limit on the number of 
> entries in the same hash bucket so to bring the cost back to something 
> more linear.  That means the delta might miss on better matches that 
> have not been hashed but still benefit from a limited set.

Sounds fair enough.

However, you migt also want to consider another approach..

One of the biggest costs for the xdelta algorithm is probably just the 
"delta_prepare()", but at the same time, that is constant wrt the source 
buffer.

Now, the sad part is that when I wrote pack-objects, I didn't really 
understand the diff-delta algorithm, I just plugged it in. Which means 
that when I did it, I made the (obvious and simple) decision to keep the 
_result_ that we are looking at constant, and try to delta against 
different sources.

HOWEVER.

I suspect you already see where this is going..

We _could_ switch the "pack-objects" window handling around, and instead 
of looking at the object we want to pack, and looking at the ten (or 
"window") previous objects to delta against, we could do it the other way 
around: keep the object we delta against constant, and see what deltas we 
could prepare for the ten next objects.

And since the source would now be constant, you'd need to do the 
"delta_prepare()" just _once_ per window, instead of every single time.

Now, I haven't done any profiling on the diff-delta code, and maybe my 
guess that delta_prepare() is a pretty expensive part is wrong, and maybe 
it wouldn't help to switch the window probing around. But I thought I'd 
mention it as one thing to explore..

		Linus
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html