Re: small question about the repack algorithm

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 07 Feb 2008 23:12:44 -0800

Pierre Habouzit <madcoder@xxxxxxxxxx> writes:

>   I've trying to see if that optimization was used but I was somehow
> unable to find if it was the case, as the code is a bit tough :)
>
>   I was wondering if the repacking window was using triangle inequality
> to discard trying some costly deltas (I assume that what costs the most
> in the repacking is computing the delta). I mean, if you consider the
> "size" of a delta, I'm almost sure that it's very near a distance.
>
>   So assuming that we know the delta sizes between any pair of reference
> objects in the window, well, if an object we want to delta against the
> window Od are near one reference O1 enough, for each Oi in the window
> that holds: len(δ(O1, Oi)) > 2 * len(δ(Od, O1)), then it's not worth
> investigating.

We do not keep track of the delta size matrix between delta-base
candidates in the window, but I presume we could.  The storage
cost for doing so is very cheap (window^2 * size_t).  But we do
not even compute the distance matrix fully (I'll mention the
reason why the above is not (window^2 * size_t / 2) later).

    1-----------------------------i   <= delta-base candidates
     \
      \ 
       D <-- the target we are considering

Your idea is that if we want to find the cheapest delta-base,
and after we find out that candidate #1 is close to our target D
and candidate #i is very far from candidate #1, then delta to
create D using candidate #i as the base would be much bigger.
If the distance space is Euclidean, that would be a nice
optimization.

However I do not think deltification would work that way, for
two reasons.

The deltified representation of an object is a sequence of two
kinds of opcodes:

 (1) the N bytes in the target from the current point is a copy
     of the source at offset M;

 (2) the N bytes in the target from the current point are the
     following literal string.

The first form is two integers (and we represent short integers
efficiently), and the second form is one integer plus N bytes
literal.

An efficient delta candidate is the one with most common
contents with the target image.  Saying "copy that part" is
cheaper than "Here is the contents.....".

Notice that it does not matter how much other cruft a target
contains.  In an extreme case, if the base candidate #1 is a
prefix of the candidate #i, and the difference does not have any
commonality with target D, then the delta to represent D using
the base #1 and the delta using the base #i would be of the same
size and the same contents (as all offsets from the source image
of copied parts would be the same).  The tail part of #i would
not participate reconstruction of D at all.

In such a case, you would have a long distance between #1 and
#i, but that is because #i has very many unrelated contents that
are not shared with #1 (so the difference has to be represented
as "Append these bytes to represent #i, as we cannot copy from #1".

But that does not mean #i has less common contents than #1 has
with D.  In fact, we might even find common contents between the
tail part of #i and D that the delta using base #1 needed to
represent as literals.  In other words, #i could well be a
better candidate than #1.

The second reason is that the deltification is not symmetric.
If you define the "distance" between #1 and #i as "the size of
delta to reproduce #i using #1 as base", the distance between #1
and #i is very different from the distance between #i and #1.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html