Re: RFC: New diff-delta.c implementation

Geert Bosch <bosch@xxxxxxxxxxx> · Mon, 24 Apr 2006 11:19:01 -0400

On Mon, Apr 24, 2006 at 01:27:07AM -0400, Nicolas Pitre wrote:
> I took the liberty of reworking the coding style, as well as simplifying 
> some constructs a bit.
> 
> I also fixed all the format encoding and size computations bugs the 
> original code has so it works perfectly now.
Thanks for fixing that! I'll have a more detailed look at your
changes later, as not all changes are immediately obvious since the
formatting is different.
> Note that the GIT encoding is itself different (a bit denser) from the 
> xdiff encoding.
OK, I mostly looked a patch-delta.c to reverse engineer the format,
and mainly looked at libxdiff in the hope to find some format description
there.
> Please just completely drop that word fetch "optimization".  Not only 
> does it produce worse assembly code in the end due to the extra loop 
> counter and extra shifting in turn increasing register pressure, but it 
> also assumes that the target data buffer is always word aligned which is 
> not guaranteed (casting a char pointer to an unsigned and dereferencing 
> it is not portable since on some architecture such misaligned accesses 
> are not allowed).  And it even has worse performance.  For example, with 
> word fetch and shift code, packing the Linux kernel archive on my P4 at 
> 3GHz:

> user    6m33.797s
[vs.]
> user    6m13.911s
> So you must be extremely careful when trying to implement such kind of 
> optimizations.

As a compiler writer, I am more than aware of this. 
Your change results in a 70% slowdown on PowerPC. On many targets
byte-accesses are relatively slow. Indeed, I had expect this to be
mostly neutral on x86, and the 5% slowdown you saw is not unexpected.
Note that in this case aligned accesses are guaranteed, because the
index_step is guaranteed to be a multiple of the word size.

> But here comes the sad part.  Even after simplifying the code as much as 
> I could, performance is still significantly worse than the current 
> diff-delta.c code.  Repacking again the same Linux kernel repository 
> with the current code:
That's unexpected, but I can see how this could be if most files have
very few differences and are relatively small. For such cases, almost
any hash will do, and the more complicated hashing will be more compute
intensive.

I have benchmarked my original diff code on a set of large files with
lots of changes. These are hardest to get right, and hardest to get
good performance with. Just try diffing any two large (uncompressed)
tar files, and you'll see. On many of such large files, the new code
is orders of magnitude faster. On these cases, the resulting deltas
are also much smaller.

The comparison is a bit between a O(n^2) sort that is fast on small
or mostly sorted inputs (but horrible on large ones) and a more
complex O(nlogn) algorithm that is a bit slower for the simple
cases, but far faster for more complex cases.

> In the mean time I'll try replacing the adler32 hashing with your Rabin 
> polynomial hashing into the current code just to see if it helps (and I 
> think it will help quite a bit).  The best solution might be a mix of 
> both approaches.
I think the most interesting part of my code is in process_copies.
This optimization results in elimination of about half the commands
in large test cases, while taking virtually no time at all.
Also, I believe that any form of hash chaining of secondary hashing
is useless. I have done many benchmarks, and in all cases a slight
increase of hash table size was both more effective for pack size and
faster due to less cache misses.
> Of course there is the issue of reversing the object matching window 
> logic in pack-object.c to further improve things, but I consider that an 
> orthogonal improvement at this point which both approaches will benefit 
> from anyway.

Right, this will be the most important one for performance for now.

Thanks again for taking the time to look at my code and fixing the
errors in handling the git file format.

  -Geert

PS. Somehow your code had "double line spacing" :)
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html