Re: [PATCH 6/6] zlib: zlib can only process 4GB at a time

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 13 Jun 2011 04:56:56 -0700

Erik Faye-Lund <kusmabite@xxxxxxxxx> writes:

> On Sun, Jun 12, 2011 at 11:33 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>> Erik Faye-Lund <kusmabite@xxxxxxxxx> writes:
>>
>>> On Fri, Jun 10, 2011 at 10:15 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>>>> The size of objects we read from the repository and data we try to put
>>>> into the repository are represented in "unsigned long", so that on larger
>>>> architectures we can handle objects that weigh more than 4GB.
>>>
>>> shouldn't this be "size_t" instead of "unsigned long"?
>>
>> No, this must be unsigned long as that is the internal type we use.

There are two unrelated issues you have to address if your "unsigned long"
is 32-bit and you want to handle more than 4GB data in git.

When git holds repository data in core, it always has represented it as a
pair of <pointer to the beginning of memory block that holds data, length>
where the length is "unsigned long" from day one.  See read_sha1_file() in
read-cache.c that appears in e83c516 (Initial revision of "git", the
information manager from hell, 2005-04-07). This limits you to 4GB if your
"unsigned long" is 32-bit.

The right type to use in order to enable more platforms to go beyond 4GB
might be to use uintmax_t, but the series you are commenting on however is
not about changing that.

We have another problem stemming from the way in which we incorrectly used
zlib API even on a platform where "unsigned long" is capable to express
size beyond 4GB. In many places, we set up the state object used by zlib
API (i.e. z_stream) to point at the "pointer to the beginning of memory
block" with its "next_in" field, and "length" with its "avail_in" field,
pass that object around in the callchain, and expect that by making
repeated call to zlib, "next_in" would eventually progress to the end of
the data we have in core while "avail_in" would fall to zero when all data
is processed. The "avail_in" field zlib API gives us however is uInt which
is 32-bit, so this expectation is incorrect. If you have 4G+32 bytes of
data, for example, we only feed 32 bytes and stop, barfing on "corrupt"
data.

That is the issue this series is about. The approach of the series takes
is to wrap zlib's state object with our own, that has our own "avail_in"
field (by the way, the same issue exists in "next_out/avail_out" on the
output side) that uses the same type of "length" used in other parts of
our system.

The type of the "avail_in" and "avail_out" fields in the wrapper needs to
be updated to match that type when you address the "other" issue to update
all the internal "length" from "unsigned long" to "uintmax_t", but not
before. And updating the rest of the system to "uintmax_t" is not part of
the scope of this series.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html