Re: [PATCH] diff-delta: produce optimal pack data

Nicolas Pitre <nico@xxxxxxx> · Fri, 24 Feb 2006 22:07:29 -0500 (EST)

On Fri, 24 Feb 2006, Linus Torvalds wrote:

> 
> 
> On Fri, 24 Feb 2006, Nicolas Pitre wrote:
> > 
> > Currently, diff-delta takes blocks of data in the reference file and 
> > hash them.  When the target file is scanned, it uses the hash to match 
> > blocks from the target file with the reference file.
> > 
> > If blocks are hashed evenly the cost of  producing a delta is at most 
> > O(n+m) where n and m are the size of the reference and target files 
> > respectively.  In other words, with good data set the cost is linear.
> 
> Assuming the hash is good, of course.
> 
> I think this was the problem with you trying something simpler than 
> adler32..

Well, that's the compromize to make.  By default the version with 
adler32 used 16 byte blocks to index the reference buffer.  That means 
you can match target data against the reference only if whole 16 byte 
blocks match.  Then, if you fix a typo in the target buffer then you'll 
inevitably need 16 literal bytes in the delta instead of only 
one because you won't be able to resynchronize with the reference buffer 
until the next 16 byte block.

What I've made in my last delta patch is to reduce that 16 byte block to 
only 3 bytes.  Why 3 bytes? Because less than that produces smaller 
delta data if done with literal bytes directly, and 3 bytes provided 
enough bits to hash.  I also made those 3 byte blocks overlap so 
indexing would start at any offset with byte precision.  This really 
allows for optimal deltas so that they cannot be smaller.

Now the problem comes when indexing a reference file full of:

        0x46f8, 0x000b, 0x42fe, 0x0000, 0xffc0, 0x0001, 0xff00, 0x0008,
        0x03e0, 0x0009, 0x0f01, 0x0003, 0x8072, 0x0000, 0x0400, 0x0000,
        0x0046, 0x0003, 0x9180, 0x0001, 0x0003, 0x0008, 0x02eb, 0x0003,
        0x8072, 0x0000, 0x0400, 0x0000, 0x8010, 0x0008, 0x0010, 0x0000,
        0x0361, 0x0003, 0x037e, 0x0004, 0x3941, 0x0002, 0x0b0f, 0x0003,
        0x8072, 0x0000, 0x0400, 0x0000, 0x000a, 0x000b, 0x0346, 0x000c,
        0x11fe, 0x0000, 0x3717, 0x0003, 0x8072, 0x0000, 0x0400, 0x0000,
        0x8010, 0x0008, 0x000e, 0x0000, 0x0361, 0x0003, 0x8060, 0x0000,

There is a bunch of ", 0x" that get hashed to the same thing.  And when 
the second phase i.e. trying to find the best match into the reference 
buffer for each occurrence of the same many ", 0x" in the target buffer 
you get a conbinatorial explosion.

The adler32 made that particular example a non issue since the 
likelyhood of many 16 byte blocks to be the same is pretty low in this 
case.  But the flaw remains if for example there is lots of similar 16 
byte blocks, like a binary file with lots of zeroes for example.  In 
fact, the performance problem Carl is having does use the diff-delta 
version still using adler32.

> > But if many blocks from the reference buffer do hash to the same bucket 
> > then for each block in the target file many blocks from the reference 
> > buffer have to be tested against, making it tend towards O(n^m) which is 
> > pretty highly exponential.
> > 
> > The solution I'm investigating is to put a limit on the number of 
> > entries in the same hash bucket so to bring the cost back to something 
> > more linear.  That means the delta might miss on better matches that 
> > have not been hashed but still benefit from a limited set.
> 
> Sounds fair enough.

Testing appear to show that this is a worthwhile safety valve.  And in 
most case that safety valve should not be activated at all.

> However, you migt also want to consider another approach..
> 
> One of the biggest costs for the xdelta algorithm is probably just the 
> "delta_prepare()", but at the same time, that is constant wrt the source 
> buffer.

Actually it is not that costly.  Much much less than computing the sha1 
of the same buffer for example.

> Now, the sad part is that when I wrote pack-objects, I didn't really 
> understand the diff-delta algorithm, I just plugged it in. Which means 
> that when I did it, I made the (obvious and simple) decision to keep the 
> _result_ that we are looking at constant, and try to delta against 
> different sources.
> 
> HOWEVER.
> 
> I suspect you already see where this is going..
> 
> We _could_ switch the "pack-objects" window handling around, and instead 
> of looking at the object we want to pack, and looking at the ten (or 
> "window") previous objects to delta against, we could do it the other way 
> around: keep the object we delta against constant, and see what deltas we 
> could prepare for the ten next objects.
> 
> And since the source would now be constant, you'd need to do the 
> "delta_prepare()" just _once_ per window, instead of every single time.

Might be worth trying.  Actually, this can be tested without even 
changing the window handling just yet, since diff-delta() could return 
the index data instead of freeing it, and pack-objects can store it 
along side with the object data it tries to delta against.  That 
wouldn't be memory efficient, but at least that would give an idea of 
the magnitude of the saving on CPU time.  But I really doubt that'll 
save more than a few percent.

> 
> Now, I haven't done any profiling on the diff-delta code, and maybe my 
> guess that delta_prepare() is a pretty expensive part is wrong, and maybe 
> it wouldn't help to switch the window probing around. But I thought I'd 
> mention it as one thing to explore..

Just to give you an idea, the bulk of my current "prepare" code looks 
like this:

        /* then populate the index */
        data = buf + bufsize - 2;
        while (data > buf) {
                entry->ptr = --data;
                i = (data[0] << hshift) ^ data[1];
                i ^= (i << hshift) ^ data[2];
                entry->next = hash[i];
                hash[i] = entry++;
        }

As you can see it is pretty lightweight.

But that would probably be a worthwhile optimization to have even if it 
saves 10% of CPU time.

Nicolas
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html