On Fri, 24 Feb 2006, Nicolas Pitre wrote: > > Well, that's the compromize to make. By default the version with > adler32 used 16 byte blocks to index the reference buffer. That means > you can match target data against the reference only if whole 16 byte > blocks match. Then, if you fix a typo in the target buffer then you'll > inevitably need 16 literal bytes in the delta instead of only > one because you won't be able to resynchronize with the reference buffer > until the next 16 byte block. That shouldn't be true. Once you find a 16-byte match, you can search forward (in theory you should be able to search backwards too, but by that time we've already expanded the non-matching previous bytes, I think). But I'm no xdelta expert, so I migt be wrong. > What I've made in my last delta patch is to reduce that 16 byte block to > only 3 bytes. Why 3 bytes? Because less than that produces smaller > delta data if done with literal bytes directly, and 3 bytes provided > enough bits to hash. On the other hand, the cost is that your lookups _are_ going to be more expensive. Regardless of how good the hash is, basically you have 16/3 more hash-entries to look up, so you've made compression more expensive in footprint, at least (I assume you've made the hash appropriately larger). Also, at 3 bytes, insertion is at least equally dense (three bytes of data vs three bytes of offset into the source), and can be worse (the offset might be 5 bytes, no?). So it would seem like you'd be better off with 4+ bytes, at which point the delta should be a win. Have you tried some half-way point, like ~8 bytes? > Now the problem comes when indexing a reference file full of: > > 0x46f8, 0x000b, 0x42fe, 0x0000, 0xffc0, 0x0001, 0xff00, 0x0008, ... > > There is a bunch of ", 0x" that get hashed to the same thing. You'll find a lot of that in any file: three or four bytes of similarity just doesn't sound worthwhile to go digging after. > The adler32 made that particular example a non issue since the > likelyhood of many 16 byte blocks to be the same is pretty low in this > case. But the flaw remains if for example there is lots of similar 16 > byte blocks, like a binary file with lots of zeroes for example. In > fact, the performance problem Carl is having does use the diff-delta > version still using adler32. Agreed. I think limiting the hash length is a fine idea regardless, I just think it sounds dangerous with the three-byte thing where a lot of matches should be expected (never mind ", 0x", just things like newlines and tabs in source code). Only considering 16-byte sequences of similarities is a trade-off of packing cost vs win.. Not saying other block-sizes aren't worth testing, but I suspect trying too hard is going to be too costly. Especially as deltas compress _worse_ the smaller they are. Bigger "insert" chunks probably compress a lot better than a copy chunk. Have you looked at the delta size vs compression? Linus - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html