On Fri, Nov 19 2021, Jeff King wrote: > On Fri, Nov 19, 2021 at 10:05:32AM +0000, Phillip Wood wrote: > >> On 18/11/2021 15:42, Jeff King wrote: >> > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote: >> > >> > > I think the really important thing to point out is that >> > > `xdl_classify_record()` ensures that the `ha` attribute is different for >> > > different text. AFAIR it even "linearizes" the `ha` values, i.e. they >> > > won't be all over the place but start at 0 (or 1). >> > > >> > > So no, I'm not worried about collisions. That would be a bug in >> > > `xdl_classify_record()` and I think we would have caught this bug by now. >> > >> > Ah, thanks for that explanation. That addresses my collision concern from >> > earlier in the thread completely. >> >> Yes, thanks for clarifying I should have been clearer in my reply to Stolee. >> The reason I was waffling on about file sizes is that there can only be a >> collision if there are more than 2^32 unique lines. I think the minimum file >> size where that happens is just below 10GB when one side of the diff has >> 2^31 lines and the other has 2^31 + 1 lines and all the lines are unique. > > Right, that makes more sense (and we are not likely to lift the 1GB > limit anytime soon; there are tons of 32-bit variables and potential > integer overflows all through the xdiff code). Interestingly: $ du -sh 8gb* 8.1G 8gb 8.1G 8gb.cp $ ~/g/git/git -P -c core.bigFileThreshold=10g diff -U0 --no-index --no-color-moved 2gb 2gb.cp diff --git a/8gb b/8gb.cp index a886cdfe5ce..4965a132d44 100644 --- a/8gb +++ b/8gb.cp @@ -17,0 +18 @@ more +blah And the only change I made was: diff --git a/xdiff-interface.c b/xdiff-interface.c index 75b32aef51d..cb8ca5f5d0a 100644 --- a/xdiff-interface.c +++ b/xdiff-interface.c @@ -117,9 +117,6 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co mmfile_t a = *mf1; mmfile_t b = *mf2; - if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE) - return -1; - if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT)) trim_common_tail(&a, &b); Perhaps we're being overly concervative with these hardcoded limits, at least on some platforms? This is Linux x86_64. I understand from skimming the above that it's about the pathological case, these two files are the same except for a trailer at the end. I wonder how far you could get with #define int size_t & the like ... :)