Re: [PATCH] Keep last used delta base in the delta window

David Kastrup <dak@xxxxxxx> · Mon, 27 Aug 2007 12:07:58 +0200

Junio C Hamano <gitster@xxxxxxxxx> writes:

> Martin Koegler <mkoegler@xxxxxxxxxxxxxxxxx> writes:
>
>> Keeping the last used delta base object, if it would be dropped,
>> supports creating smaller packs with shorter delta chains.
>
> Instead of treating the "the last used one happened to be on the
> horizon -- try not to drop it" special case, I wonder if it
> makes sense to float the last used delta base object early in
> the window _after_ it is used.  Wouldn't we keep more than one
> very recently used delta base objects in the window that way?

Well, given the little amount of spare time I have for personal
projects, I should not go flaunting them too much, but anyway...

I am just drafting implementing a delta-diff size estimator: for every
hash value it does not store hash chains or pointers to memory images,
but just a bit map of at most 32 (possibly 64 where ulong delivers it,
possibly even just 16 bits when one restricts oneself to 16 element
windows in order to save memory) delta window files, telling which of
the files show such a hash value anywhere.  When a file is considered
for deltaing, it is run once against this delta-diff estimator which
does not even look at the files again.  This delivers a lower bound
for the size a delta towards each of the delta candidates can take.
The candidates are then sorted in increasing size of expected delta
size and are considered in turn.  Once a delta has a lower size than
the lowest bound for further delta candidates, no further candidates
need to get considered.

Also the calculation of a delta can get aborted once it exceeds the
size of a previous delta.  Somewhat more complex and error-prone would
be to abort based on continually correcting the estimated lower bound
from the estimator while one is creating a delta and dropping out as
soon as it becomes impossible to beat a previous delta.

The nice thing about this is that one can start out with a simplistic
estimator as long as it is guaranteed never to deliver an estimate
that is too high.  As long as that is the case, the packs will never
get worse than they are now: a bad estimator will just mean that more
elements of the window need to be considered before the conclusion is
reached.

Since the estimator data structure does not point into any
memory-mapped files or hash chains scattered through memory, it should
be comparatively light (in comparison to an actual delta) on page
thrashing which appears to me one of the main performance problems,
while delivering a reasonably useful lower bound for all deltas in the
window at once.

Its size needs to be large enough so that the bit maps will be
dominated by zeros: hash collisions not actually corresponding to
matching data lead to underestimates.  If those become too many, the
pruning of comparisons will become ineffective:

A match implies that the next 2*RABIN_SIZE-1 (31) bytes at least could
possibly be expressed as a delta, and every match after another
RABIN_SIZE implies another RABIN_SIZE bytes might get folded into the
previous delta.

So false positives seriously distort the estimate.  There is still a
good chance that a good candidate will land early in the sort order
due to such an estimate and thus other candidates will have their
deltaing cut off early, but of course it is quite more effective if
their deltaing does not even have to start.

-- 
David Kastrup

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html