Re: [PATCH 1/9] Convert pack-objects to size_t

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 14 Aug 2017 10:08:05 -0700

Junio C Hamano <gitster@xxxxxxxxx> writes:

> One interesting question is which of these two types we should use
> for the size of objects Git uses.  
>
> Most of the "interesting" operations done by Git require that the
> thing is in core as a whole before we can do anything (e.g. compare
> two such things to produce delta, have one in core and apply patch),
> so it is tempting that we deal with size_t, but at the lowest level
> to serve as a SCM, i.e. recording the state of a file at each
> version, we actually should be able to exceed the in-core
> limit---both "git add" of a huge file whose contents would not fit
> in-core and "git checkout" of a huge blob whose inflated contents
> would not fit in-core should (in theory, modulo bugs) be able to
> exercise the streaming interface to handle such case without holding
> everything in-core at once.  So from that point of view, even size_t
> may not be the "correct" type to use.

A few additions to the above observations.

 - We have varint that encodes how far the location from a delta
   representation of an object to its base object in the packfile.
   Both encoding and decoding sides in the current code use off_t to
   represent this offset, so we can already reference an object that
   is far in the same packfile as a base.

 - I think it is OK in practice to limit the size of individual
   objects to size_t (i.e. on 32-bit arch, you cannot interact with
   a repository with an object whose size exceeds 4GB).  Using off_t
   would allow occasional ultra-huge objects that can only be added
   and checked in via the streaming API on such a platform, but I
   suspect that it may become too much of a hassle to maintain.

   It may help reducing the maintenance if we introduced obj_size_t
   that is defined to be size_t for now, so that we can later swap
   it to ofs_t or some larger type when we know we do need to
   support objects whose size cannot be expressed in size_t, but I
   do not offhand know what the pros-and-cons with such an approach
   would look like.

Thanks.