Re: [PATCH 1/3] diff histogram: intern strings

Jeff King <peff@xxxxxxxx> · Wed, 17 Nov 2021 11:46:50 -0500

On Wed, Nov 17, 2021 at 10:55:02AM -0500, Derrick Stolee wrote:

> > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
> > index e694bfd9e31..6c1c88a69a1 100644
> > --- a/xdiff/xhistogram.c
> > +++ b/xdiff/xhistogram.c
> > @@ -91,9 +91,8 @@ struct region {
> >  static int cmp_recs(xpparam_t const *xpp,
> >  	xrecord_t *r1, xrecord_t *r2)
> >  {
> > -	return r1->ha == r2->ha &&
> > -		xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size,
> > -			    xpp->flags);
> > +	return r1->ha == r2->ha;
> > +
> 
> nit: stray newline.
> 
> The only meaningful change here is that you are relying entirely on
> the hash and not checking the content again. This means that hash
> collisions on this 32-bit hash could start introducing different
> results. Are we worried about that?

I had the same thought. But running "git log --histogram -p" on git.git
does not seem to produce any differences between the two. So perhaps
collisions are removed elsewhere. It would be nice to have a better
understanding of this before proceeding with this change.

Curiously, I got a much smaller improvement in my test, which did:

  git log --no-merges -p --histogram :^po

My assumption being that "po/" diffs are big and uninteresting and so
bloat the output. But that turned out to be interesting timing-wise.
Excluding "po/" means that patch produces only a 0.6% improvement in
speed. But _just_ running the diffs for po shows a 24% speedup!

I guess this is just because those files are much larger than average
(and changed in fewer commits), meaning that the time spent hashing and
comparing lines will show up as a larger percentage of the total work.

But I wondered...

> The existence of these conditions gave me pause, so I went to look for where they
> were inserted. They were made in 9f37c27 (xdiff/xprepare: skip classification,
> 2011-07-12) with the justification that 
> 
>     We don't need any of that in histogram diff, so we omit calls to these
>     functions. We also skip allocating memory to the hash table, rhash, as
>     it is no longer used.
> 
>     This gives us a small boost in performance.
> 
> But you are actually _using_ these preparation steps, which means you are
> re-adding the cost of hashing but overall improving because you use the
> data correctly. Excellent.

Are we making a tradeoff here based on the data patterns? That is, it
seems like we are spending extra time upfront to do classification in
order to get quicker comparisons later. Presumably the upfront work is
O(n) in the length of the file. How many comparisons do we expect to
save?  Is it also linear in the number of lines, or could it be
super- or sub-linear depending on the actual diff?

-Peff