Pierre Habouzit <madcoder@xxxxxxxxxx> writes: > I've trying to see if that optimization was used but I was somehow > unable to find if it was the case, as the code is a bit tough :) > > I was wondering if the repacking window was using triangle inequality > to discard trying some costly deltas (I assume that what costs the most > in the repacking is computing the delta). I mean, if you consider the > "size" of a delta, I'm almost sure that it's very near a distance. > > So assuming that we know the delta sizes between any pair of reference > objects in the window, well, if an object we want to delta against the > window Od are near one reference O1 enough, for each Oi in the window > that holds: len(δ(O1, Oi)) > 2 * len(δ(Od, O1)), then it's not worth > investigating. We do not keep track of the delta size matrix between delta-base candidates in the window, but I presume we could. The storage cost for doing so is very cheap (window^2 * size_t). But we do not even compute the distance matrix fully (I'll mention the reason why the above is not (window^2 * size_t / 2) later). 1-----------------------------i <= delta-base candidates \ \ D <-- the target we are considering Your idea is that if we want to find the cheapest delta-base, and after we find out that candidate #1 is close to our target D and candidate #i is very far from candidate #1, then delta to create D using candidate #i as the base would be much bigger. If the distance space is Euclidean, that would be a nice optimization. However I do not think deltification would work that way, for two reasons. The deltified representation of an object is a sequence of two kinds of opcodes: (1) the N bytes in the target from the current point is a copy of the source at offset M; (2) the N bytes in the target from the current point are the following literal string. The first form is two integers (and we represent short integers efficiently), and the second form is one integer plus N bytes literal. An efficient delta candidate is the one with most common contents with the target image. Saying "copy that part" is cheaper than "Here is the contents.....". Notice that it does not matter how much other cruft a target contains. In an extreme case, if the base candidate #1 is a prefix of the candidate #i, and the difference does not have any commonality with target D, then the delta to represent D using the base #1 and the delta using the base #i would be of the same size and the same contents (as all offsets from the source image of copied parts would be the same). The tail part of #i would not participate reconstruction of D at all. In such a case, you would have a long distance between #1 and #i, but that is because #i has very many unrelated contents that are not shared with #1 (so the difference has to be represented as "Append these bytes to represent #i, as we cannot copy from #1". But that does not mean #i has less common contents than #1 has with D. In fact, we might even find common contents between the tail part of #i and D that the delta using base #1 needed to represent as literals. In other words, #i could well be a better candidate than #1. The second reason is that the deltification is not symmetric. If you define the "distance" between #1 and #i as "the size of delta to reproduce #i using #1 as base", the distance between #1 and #i is very different from the distance between #i and #1. - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html