Re: [PATCH 1/9] Convert pack-objects to size_t

Ramsay Jones <ramsay@xxxxxxxxxxxxxxxxxxxx> · Mon, 14 Aug 2017 20:31:50 +0100

On 14/08/17 18:08, Junio C Hamano wrote:
> Junio C Hamano <gitster@xxxxxxxxx> writes:
> 
>> One interesting question is which of these two types we should use
>> for the size of objects Git uses.  
>>
>> Most of the "interesting" operations done by Git require that the
>> thing is in core as a whole before we can do anything (e.g. compare
>> two such things to produce delta, have one in core and apply patch),
>> so it is tempting that we deal with size_t, but at the lowest level
>> to serve as a SCM, i.e. recording the state of a file at each
>> version, we actually should be able to exceed the in-core
>> limit---both "git add" of a huge file whose contents would not fit
>> in-core and "git checkout" of a huge blob whose inflated contents
>> would not fit in-core should (in theory, modulo bugs) be able to
>> exercise the streaming interface to handle such case without holding
>> everything in-core at once.  So from that point of view, even size_t
>> may not be the "correct" type to use.
> 
> A few additions to the above observations.
> 
>  - We have varint that encodes how far the location from a delta
>    representation of an object to its base object in the packfile.
>    Both encoding and decoding sides in the current code use off_t to
>    represent this offset, so we can already reference an object that
>    is far in the same packfile as a base.
> 
>  - I think it is OK in practice to limit the size of individual
>    objects to size_t (i.e. on 32-bit arch, you cannot interact with
>    a repository with an object whose size exceeds 4GB).  Using off_t
>    would allow occasional ultra-huge objects that can only be added
>    and checked in via the streaming API on such a platform, but I
>    suspect that it may become too much of a hassle to maintain.

In a previous comment, I said that (on 32-bit Linux) it was likely
that an object of > 4GB could not be handled correctly anyway. (more
likely > 2GB). This was based on the code from (quite some) years ago.
In particular, before you added the "streaming API". So, maybe a 32-bit
arch _should_ be able to handle objects as large as the LFS API allows.
(Ignoring, for the moment, that I think anybody who puts files of that
size into an SCM probably gets what they deserve. :-P ).

The two patches I commented on, however, changed the type of some
variables from off_t to size_t. In general, the patches did not
seem to make anything worse, but these type changes could potentially
do harm. Hence my comment. (I still haven't tried the patches on my
32-bit Linux system. I only boot it up about once a week, and I would
rather wait until the patches are in the 'pu' branch before testing).

ATB,
Ramsay Jones