On Wed, Nov 17, 2021 at 10:55:02AM -0500, Derrick Stolee wrote: > > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c > > index e694bfd9e31..6c1c88a69a1 100644 > > --- a/xdiff/xhistogram.c > > +++ b/xdiff/xhistogram.c > > @@ -91,9 +91,8 @@ struct region { > > static int cmp_recs(xpparam_t const *xpp, > > xrecord_t *r1, xrecord_t *r2) > > { > > - return r1->ha == r2->ha && > > - xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size, > > - xpp->flags); > > + return r1->ha == r2->ha; > > + > > nit: stray newline. > > The only meaningful change here is that you are relying entirely on > the hash and not checking the content again. This means that hash > collisions on this 32-bit hash could start introducing different > results. Are we worried about that? I had the same thought. But running "git log --histogram -p" on git.git does not seem to produce any differences between the two. So perhaps collisions are removed elsewhere. It would be nice to have a better understanding of this before proceeding with this change. Curiously, I got a much smaller improvement in my test, which did: git log --no-merges -p --histogram :^po My assumption being that "po/" diffs are big and uninteresting and so bloat the output. But that turned out to be interesting timing-wise. Excluding "po/" means that patch produces only a 0.6% improvement in speed. But _just_ running the diffs for po shows a 24% speedup! I guess this is just because those files are much larger than average (and changed in fewer commits), meaning that the time spent hashing and comparing lines will show up as a larger percentage of the total work. But I wondered... > The existence of these conditions gave me pause, so I went to look for where they > were inserted. They were made in 9f37c27 (xdiff/xprepare: skip classification, > 2011-07-12) with the justification that > > We don't need any of that in histogram diff, so we omit calls to these > functions. We also skip allocating memory to the hash table, rhash, as > it is no longer used. > > This gives us a small boost in performance. > > But you are actually _using_ these preparation steps, which means you are > re-adding the cost of hashing but overall improving because you use the > data correctly. Excellent. Are we making a tradeoff here based on the data patterns? That is, it seems like we are spending extra time upfront to do classification in order to get quicker comparisons later. Presumably the upfront work is O(n) in the length of the file. How many comparisons do we expect to save? Is it also linear in the number of lines, or could it be super- or sub-linear depending on the actual diff? -Peff