Re: [PATCH 1/3] diff histogram: intern strings

Jeff King <peff@xxxxxxxx> · Fri, 19 Nov 2021 09:45:26 -0500

On Fri, Nov 19, 2021 at 10:05:32AM +0000, Phillip Wood wrote:

> On 18/11/2021 15:42, Jeff King wrote:
> > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote:
> > 
> > > I think the really important thing to point out is that
> > > `xdl_classify_record()` ensures that the `ha` attribute is different for
> > > different text. AFAIR it even "linearizes" the `ha` values, i.e. they
> > > won't be all over the place but start at 0 (or 1).
> > > 
> > > So no, I'm not worried about collisions. That would be a bug in
> > > `xdl_classify_record()` and I think we would have caught this bug by now.
> > 
> > Ah, thanks for that explanation. That addresses my collision concern from
> > earlier in the thread completely.
> 
> Yes, thanks for clarifying I should have been clearer in my reply to Stolee.
> The reason I was waffling on about file sizes is that there can only be a
> collision if there are more than 2^32 unique lines. I think the minimum file
> size where that happens is just below 10GB when one side of the diff has
> 2^31 lines and the other has 2^31 + 1 lines and all the lines are unique.

Right, that makes more sense (and we are not likely to lift the 1GB
limit anytime soon; there are tons of 32-bit variables and potential
integer overflows all through the xdiff code).

It's probably worth explaining this a bit in the commit message.

I also, FWIW, found the subject confusing. I expected "intern" to refer
to keeping a single copy of some strings. Maybe:

  Subject: diff histogram: skip xdl_recmatch for comparing records

or something?

-Peff