Re: [PATCH] Keep last used delta base in the delta window

David Kastrup <dak@xxxxxxx> · Mon, 27 Aug 2007 23:04:20 +0200

mkoegler@xxxxxxxxxxxxxxxxx (Martin Koegler) writes:

> On Mon, Aug 27, 2007 at 12:07:58PM +0200, David Kastrup wrote:
>> Well, given the little amount of spare time I have for personal
>> projects, I should not go flaunting them too much, but anyway...
>> 
>> I am just drafting implementing a delta-diff size estimator: for every
>> hash value it does not store hash chains or pointers to memory images,
>> but just a bit map of at most 32 (possibly 64 where ulong delivers it,
>> possibly even just 16 bits when one restricts oneself to 16 element
>> windows in order to save memory) delta window files, telling which of
>> the files show such a hash value anywhere. 
>
> How do you want to select these few hash values?

Hm?  Are you familiar with diff-delta.c?  I was not going to change
the hash value computation in there.  It is a 32bit CRC over 16byte
passages: the delta source is checksummed with a stride of 16 bytes
and the resulting CRC values are masked to a convenient number of bits
and used as index into a hash table with disambiguation to actual code
passages using linked lists.  Then the destination delta is
checksummed in the same manner but with a stride of 1, and the sums
are looked up in the hash until a matching prefix is found.

So I was going to do the same calculation, but look up a bit mask
rather than a linked list (namely calculating under the assumption
that a hash hit implies an actual match).

> I eg. dump database tables as SQL statements in per table files and
> commit all changes. All lines in each file share a common prefix
> (INSERT INTO tablename (COL1, COL2, ...) VALUES). Statistics showed,
> that up to 50% of the hash elements of the delta index were dropped,
> because they had equal hash values.

Yes, I am aware of that.  One somewhat crazy consequence of that is
that git's deltaing works better on Linux style indentation than on
GNU style indentation.

> With you apropach:
>
> If rare used values are selected as the 32 hash values, you will
> probably regard files, which contain different tables (and therefore
> a different prefix), as "similar", although the files have not very
> much similarity otherwise.

It's statistics.  Random matches will tend to level out.

> If you select the hash values of the prefixes, you will create for
> each table one cluster (if the table count is < 32/64). This already
> happens by using the hash value of the path name in the sorting.
>
> I don't belive, that your idea will improve very much in this case.
> I fear, that it will be even worser than the current algorithm.

Huh?  I am not replacing the current algorithm.  I am doing some
upfront estimates for getting a good order for doing the current
algorithm, so that I might abort parts of it early when I know they
can't improve things.  If the estimates are completely random and/or
useless, it will mean that you get a slowdown by a comparatively small
constant factor.

>> When a file is considered for deltaing, it is run once against this
>> delta-diff estimator which does not even look at the files again.
>> This delivers a lower bound for the size a delta towards each of
>> the delta candidates can take.  The candidates are then sorted in
>> increasing size of expected delta size and are considered in turn.
>> Once a delta has a lower size than the lowest bound for further
>> delta candidates, no further candidates need to get considered.
>
> So this means, that we have to hash the target entry too (which
> could be merged with creating the delta index).

Huh?  I don't see what you are getting at.

> We currently only hash an entry, if it is really needed as source in
> a delta-diff.

Why would you want to hash a non-candidate?

>> Also the calculation of a delta can get aborted once it exceeds the
>> size of a previous delta.
>
> This already happens.

Ah yes.  Overlooked this.  Presorting the candidates should be helpful
right away then.

> Do you want to do your estimations only against a small window of
> objects or all objects?

The estimates are made against the candidates in the window.  When a
candidate leaves the window, its bit in the bit masks gets reused for
the next candidate entering the window.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html