Re: [PATCH 1/3] diff histogram: intern strings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Nov 19 2021, Jeff King wrote:

> On Fri, Nov 19, 2021 at 10:05:32AM +0000, Phillip Wood wrote:
>
>> On 18/11/2021 15:42, Jeff King wrote:
>> > On Thu, Nov 18, 2021 at 04:35:48PM +0100, Johannes Schindelin wrote:
>> > 
>> > > I think the really important thing to point out is that
>> > > `xdl_classify_record()` ensures that the `ha` attribute is different for
>> > > different text. AFAIR it even "linearizes" the `ha` values, i.e. they
>> > > won't be all over the place but start at 0 (or 1).
>> > > 
>> > > So no, I'm not worried about collisions. That would be a bug in
>> > > `xdl_classify_record()` and I think we would have caught this bug by now.
>> > 
>> > Ah, thanks for that explanation. That addresses my collision concern from
>> > earlier in the thread completely.
>> 
>> Yes, thanks for clarifying I should have been clearer in my reply to Stolee.
>> The reason I was waffling on about file sizes is that there can only be a
>> collision if there are more than 2^32 unique lines. I think the minimum file
>> size where that happens is just below 10GB when one side of the diff has
>> 2^31 lines and the other has 2^31 + 1 lines and all the lines are unique.
>
> Right, that makes more sense (and we are not likely to lift the 1GB
> limit anytime soon; there are tons of 32-bit variables and potential
> integer overflows all through the xdiff code).

Interestingly:
    
    $ du -sh 8gb*
    8.1G    8gb
    8.1G    8gb.cp
    $ ~/g/git/git -P -c core.bigFileThreshold=10g diff -U0 --no-index --no-color-moved 2gb 2gb.cp
    diff --git a/8gb b/8gb.cp
    index a886cdfe5ce..4965a132d44 100644
    --- a/8gb
    +++ b/8gb.cp
    @@ -17,0 +18 @@ more
    +blah

And the only change I made was:
    
    diff --git a/xdiff-interface.c b/xdiff-interface.c
    index 75b32aef51d..cb8ca5f5d0a 100644
    --- a/xdiff-interface.c
    +++ b/xdiff-interface.c
    @@ -117,9 +117,6 @@ int xdi_diff(mmfile_t *mf1, mmfile_t *mf2, xpparam_t const *xpp, xdemitconf_t co
            mmfile_t a = *mf1;
            mmfile_t b = *mf2;
     
    -       if (mf1->size > MAX_XDIFF_SIZE || mf2->size > MAX_XDIFF_SIZE)
    -               return -1;
    -
            if (!xecfg->ctxlen && !(xecfg->flags & XDL_EMIT_FUNCCONTEXT))
                    trim_common_tail(&a, &b);

Perhaps we're being overly concervative with these hardcoded limits, at
least on some platforms? This is Linux x86_64.

I understand from skimming the above that it's about the pathological
case, these two files are the same except for a trailer at the end.

I wonder how far you could get with #define int size_t & the like ... :)



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux