Hi, On Wed, 17 Nov 2021, Phillip Wood wrote: > On 17/11/2021 15:55, Derrick Stolee wrote: > > > > On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote: > > > From: Phillip Wood <phillip.wood@xxxxxxxxxxxxx> > > > > > > Histogram is the only diff algorithm not to call > > > xdl_classify_record(). xdl_classify_record() ensures that the hash > > > values of two strings that are not equal differ which means that it is > > > not necessary to use xdl_recmatch() when comparing lines, all that is > > > necessary is to compare the hash values. This gives a 7% reduction in > > > the runtime of "git log --patch" when using the histogram diff > > > algorithm. > > > > > > Test HEAD^ HEAD > > > ----------------------------------------------------------------------------- > > > 4000.1: log -3000 (baseline) 0.18(0.14+0.04) 0.19(0.17+0.02) > > > +5.6% > > > 4000.2: log --raw -3000 (tree-only) 0.99(0.77+0.21) 0.98(0.78+0.20) > > > -1.0% > > > 4000.3: log -p -3000 (Myers) 4.84(4.31+0.51) 4.81(4.15+0.64) > > > -0.6% > > > 4000.4: log -p -3000 --histogram 6.34(5.86+0.46) 5.87(5.19+0.66) > > > -7.4% > > > 4000.5: log -p -3000 --patience 5.39(4.60+0.76) 5.35(4.60+0.73) > > > -0.7% > > > > > > Signed-off-by: Phillip Wood <phillip.wood@xxxxxxxxxxxxx> > > > --- > > > xdiff/xhistogram.c | 5 ++--- > > > xdiff/xprepare.c | 24 ++++++++---------------- > > > 2 files changed, 10 insertions(+), 19 deletions(-) > > > > > > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c > > > index e694bfd9e31..6c1c88a69a1 100644 > > > --- a/xdiff/xhistogram.c > > > +++ b/xdiff/xhistogram.c > > > @@ -91,9 +91,8 @@ struct region { > > > static int cmp_recs(xpparam_t const *xpp, > > > xrecord_t *r1, xrecord_t *r2) > > > { > > > - return r1->ha == r2->ha && > > > - xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size, > > > - xpp->flags); > > > + return r1->ha == r2->ha; > > > + > > > > nit: stray newline. > > > > The only meaningful change here is that you are relying entirely on > > the hash and not checking the content again. This means that hash > > collisions on this 32-bit hash could start introducing different > > results. Are we worried about that? > > xdiff-interface.c limits the size of the file that can be passed to just below > 1GB so we are safe. The other diff algorithms are already using this > optimization. (the hash is 64 bits on most platforms, the xdiff code could > really benefit from a unsigned long -> size_t cleanup) I think the really important thing to point out is that `xdl_classify_record()` ensures that the `ha` attribute is different for different text. AFAIR it even "linearizes" the `ha` values, i.e. they won't be all over the place but start at 0 (or 1). So no, I'm not worried about collisions. That would be a bug in `xdl_classify_record()` and I think we would have caught this bug by now. Ciao, Dscho