Re: [PATCH v2 2/2] Prevent git from rehashing 4GiB files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2023-10-12 at 17:58:42, Junio C Hamano wrote:
> "brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> writes:
> 
> > An example would be to have a 2^32 sized file in the index of
> > patched git. Patched git would save the file as 2^31 in the cache.
> > An unpatched git would very much see the file has changed in size
> > and force it to rehash the file, which is safe.
> 
> The reason why this is "safe" is because an older Git will would
> keep rehashing whether 2^31 or 0 is stored as its sd_size, so the
> change is not making things worse?  With older git, "git diff-files"
> will report that such a file is not up to date, and then the user
> will refresh the index, which will store 0 as its sd_file, so
> tentatively "git status" may give a wrong information, but we
> probalby do not care?  Is that how the reasoning goes?

Correct.  An older Git will rehash it either way, on every git status.
If you run git diff-files on an older Git, it will always show it as
modified (hence why I used that in the tests).  The git status will
rehash it and then realize that it isn't modified, not printing any
changes, so the behaviour is not wrong, it's just extremely slow (SHA-1
DC on 4 GiB of data).

If you use a new Git with an old index, git status will still rehash it
the first time and correctly determine that it isn't changed, then write
the 0x80000000 to the index.  It just won't rehash it on subsequent
calls to git status.

The only incorrect behaviour (assuming that slow is not incorrect) that
I'm aware of on older Git is that git diff-files marks it as modified
when it's not.  There is no incorrect behaviour on a fixed Git.

> > +/*
> > + * Munge st_size into an unsigned int.
> > + */
> > +static unsigned int munge_st_size(off_t st_size) {
> > +	unsigned int sd_size = st_size;
> > +
> > +	/*
> > +	 * If the file is an exact multiple of 4 GiB, modify the value so it
> > +	 * doesn't get marked as racily clean (zero).
> > +	 */
> > +	if (!sd_size && st_size)
> > +		return 0x80000000;
> > +	else
> > +		return sd_size;
> > +}
> 
> This assumes typeof(sd_size) aka "unsigned int" is always 32-bit,
> which does not sound reasonable.  Reference to 4GiB, 2^32 and 2^31
> in the code and the proposed commit log message need to be qualified
> with "on a platform whose uint is 32-bit" or something, or better
> yet, phrased in a way that is agnostic to the integer size.  At
> the very least, the hardcoded 0x80000000 needs to be rethought, I
> am afraid.

unsigned int is 32-bit on every modern 32- or 64-bit platform known to
exist today.  I don't believe we support MS-DOS or systems of its era,
nor should we support 8- or 16-bit systems.  If you'd prefer "uint32_t",
I can do that.

Note that the problem is reproducible on a stock Ubuntu amd64 system, so
it is definitely a problem on all modern systems.

> Other than that, the workaround for the racy-git avoidance code does
> sound good.  I actually wonder if we should always use 1 regardless
> of integer size.

I think using 2^31 is better because it's far away from very small
values and very large values, which means that it's a hard to modify a
file which used to have its value as a multiple of 4 GiB and
accidentally make Git think it was unchanged.  Using 1 would make a
simple "echo > foo" possibly think things were unchanged in some cases,
which we should avoid.
-- 
brian m. carlson (he/him or they/them)
Toronto, Ontario, CA

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux