On Thu, Aug 10, 2017 at 01:07:07PM -0700, Junio C Hamano wrote: > > The current delta code produces incorrect pack objects for files > 4GB. > > > > Signed-off-by: Martin Koegler <martin.koegler@xxxxxxxxx> > > I am a bit torn on this change. > > Given that this is not merely a local storage format but it also is > an interchange format, we would probably want to make sure that the > receiving end (e.g. get_delta_hdr_size() that is used at the > beginning of patch_delta()) on a platform whose size_t is smaller > than that of a platform that produced the delta stream with this > code behaves "sensibly". Overflows would already be detected during unpack: * Assuming size_t = uint32, the system should just be able to handle up to 4GB of process memory. So loading any source blob larger than 4GB should already fail. * Assuming size_t = uint32 and a source blob size < 4 GB, the target blob size would be readed truncated. apply_delta checks, that the generated result matches the encoded size - this check would fail. > If we replaced ulong we use in create/patch delta codepaths with > uint32_t, that would be safer, just because the encoder would not be > able to emit varint that is larger than the receivers to handle. > But that defeats the whole point of using varint() to encode the > sizes in the first place. It was partly done for space saving, but > more for allowing larger sizes and larger ulong in the future > without having to change the file format. The ondisk-format is able to handle larger sizes [using a slightly worse compression]. The current implementation is just buggy. I would not move to uint32_t. The remaing part of git uses "unsigned long", so the delta code could still be called with larger files. We will also see more RAM as well as CPU power - reducing the limits just because of older plattforms, which can't even handle such large blobs, is the wrong way. Regards, Martin