RE: [Email External to GFS] Re: Git status extremely slow if any file is a multiple of 8GBi

Jason Hatton <jhatton@xxxxxxxxxxxxxxxxxxx> · Fri, 6 May 2022 00:22:36 +0000

>On 05/05/2022 22:04, René Scharfe wrote:
>> Am 04.05.22 um 19:47 schrieb Jason Hatton:
>>>>> The condition sd_size==0 is used as a signal for "no, we really need
>>>>> to compare the contents", and causes the contents to be hashed, and
>>>>> if the contents match the object name recorded in the index, the
>>>>> on-disk size is stored in sd_size and the entry is marked as
>>>>> CE_UPTODATE.  Alas, if the truncated st_size is 0, the resulting
>>>>> entry would have sd_size==0 again, so a workaround like what you
>>>>> outlined is needed.
>>>> Junio C Hamano <gitster@xxxxxxxxx> writes:
>>>>
>>>> This is of secondary importance, but the fact that Jason observed
>>>> 8GBi files gets hashed over and over unnecessarily means that we
>>>> would do the same for an empty file, opening, reading 0-bytes,
>>>> hashing, and closing, without taking advantage of the fact that
>>>> CE_UPTODATE bit says the file contents should be up-to-date with
>>>> respect to the cached object name, doesn't it?
>>>>
>>>> Or do we have "if st_size == 0 and sd_size == 0 then we know what it
>>>> hashes to (i.e. EMPTY_BLOB_SHA*) and there is no need to do the
>>>> usual open-read-hash-close dance" logic (I didn't check)?
>>> Junio C Hamano
>>>
>>> As best as I can tell, it rechecks the zero sized files. My Linux box can run
>>> git ls in .006 seconds with 1000 zero sized files in the repo. Rehashing every
>>> file that is a multiple of 2^32 with every "git ls" on the other hand...
>>>
>>> I managed to actually compile git with the proposed changes.
>> Meaning that file sizes of n * 2^32 bytes get recorded as 1 byte instead
>> of 0 bytes?  Why 1 and not e.g. 2^32-1 or 2^31 (or 42)?
>
>My thought on this. after considering a few options, would be that the
>'sign bit' of the uint32_t size should be set to 1 when the high word of
>the 64 bit filesize value is non zero.
>
>This would result in file sizes of 0 to 4GiB-1 retaining their existing
>values, and those from 4GiB onward produces a down-folded 2GiB to 4GiB-1
>values.

I believe it would be best to only change the behavior of files that are
multiples of 2^32 exactly. Changing the behavior of all files larger than
4GBi may not be good. I like the idea of using 0x80000000 instead of 1.

>This would mean, That we are able to detect almost all incremental and
>decremental changes in filesizes, as well as retaining the 'zero is
>racy' flag aspect.
>>> It seems to correct
>>> the problem and "make test" passes. If upgrading to the patched version if git,
>>> git will rehash the 8GBi files once and work normally. If downgrading to an
>>> unpatched version, git will perceive that the 8GBi files have changes. This
>>> needs to be corrected with "git add" or "git checkout".
>> Not nice, but safe.  Can there be an unsafe scenario as well?  Like if a
>> 4GiB file gets added to the index by the new version, which records a
>> size of 1, then the file is extended by one byte while mtime stays the
>> same and then an old git won't detect the change?
>
>There is still some potential for different Git versions to be
>'confused' for these very large files, but I feel that it's relatively
>safe (no worse than the 'set to unity' idea). For large files we will
>always have that loss of precision at the 32bit rollover. It just a case
>of choosing a least worst.
>
>I haven't considered if my proposed 'truncation' overhead would be fast
>code.
>
>>> If you people are
>>> interested, I may be able to find a way to send a patch to the list or put it
>>> on github.
>> Patches are always welcome, they make discussions and testing easier.
>>
>> René
>Philip

I have a patch file, but I'm not sure how to actually submit it. I'm going to
attempt using outlook.

Jason