Re: [PATCH 1/3] diff histogram: intern strings

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Thu, 18 Nov 2021 16:35:48 +0100 (CET)

Hi,

On Wed, 17 Nov 2021, Phillip Wood wrote:

> On 17/11/2021 15:55, Derrick Stolee wrote:
> >
> > On 11/17/2021 6:20 AM, Phillip Wood via GitGitGadget wrote:
> > > From: Phillip Wood <phillip.wood@xxxxxxxxxxxxx>
> > >
> > > Histogram is the only diff algorithm not to call
> > > xdl_classify_record(). xdl_classify_record() ensures that the hash
> > > values of two strings that are not equal differ which means that it is
> > > not necessary to use xdl_recmatch() when comparing lines, all that is
> > > necessary is to compare the hash values. This gives a 7% reduction in
> > > the runtime of "git log --patch" when using the histogram diff
> > > algorithm.
> > >
> > > Test                                  HEAD^             HEAD
> > > -----------------------------------------------------------------------------
> > > 4000.1: log -3000 (baseline)          0.18(0.14+0.04)   0.19(0.17+0.02)
> > > +5.6%
> > > 4000.2: log --raw -3000 (tree-only)   0.99(0.77+0.21)   0.98(0.78+0.20)
> > > -1.0%
> > > 4000.3: log -p -3000 (Myers)          4.84(4.31+0.51)   4.81(4.15+0.64)
> > > -0.6%
> > > 4000.4: log -p -3000 --histogram      6.34(5.86+0.46)   5.87(5.19+0.66)
> > > -7.4%
> > > 4000.5: log -p -3000 --patience       5.39(4.60+0.76)   5.35(4.60+0.73)
> > > -0.7%
> > >
> > > Signed-off-by: Phillip Wood <phillip.wood@xxxxxxxxxxxxx>
> > > ---
> > >   xdiff/xhistogram.c |  5 ++---
> > >   xdiff/xprepare.c   | 24 ++++++++----------------
> > >   2 files changed, 10 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/xdiff/xhistogram.c b/xdiff/xhistogram.c
> > > index e694bfd9e31..6c1c88a69a1 100644
> > > --- a/xdiff/xhistogram.c
> > > +++ b/xdiff/xhistogram.c
> > > @@ -91,9 +91,8 @@ struct region {
> > >   static int cmp_recs(xpparam_t const *xpp,
> > >   	xrecord_t *r1, xrecord_t *r2)
> > >   {
> > > -	return r1->ha == r2->ha &&
> > > -		xdl_recmatch(r1->ptr, r1->size, r2->ptr, r2->size,
> > > -			    xpp->flags);
> > > +	return r1->ha == r2->ha;
> > > +
> >
> > nit: stray newline.
> >
> > The only meaningful change here is that you are relying entirely on
> > the hash and not checking the content again. This means that hash
> > collisions on this 32-bit hash could start introducing different
> > results. Are we worried about that?
>
> xdiff-interface.c limits the size of the file that can be passed to just below
> 1GB so we are safe. The other diff algorithms are already using this
> optimization. (the hash is 64 bits on most platforms, the xdiff code could
> really benefit from a unsigned long -> size_t cleanup)

I think the really important thing to point out is that
`xdl_classify_record()` ensures that the `ha` attribute is different for
different text. AFAIR it even "linearizes" the `ha` values, i.e. they
won't be all over the place but start at 0 (or 1).

So no, I'm not worried about collisions. That would be a bug in
`xdl_classify_record()` and I think we would have caught this bug by now.

Ciao,
Dscho