Re: [RFC] 'unsigned long' to 'size_t' conversion

Jeff King <peff@xxxxxxxx> · Wed, 6 Dec 2017 16:43:44 -0500

On Wed, Dec 06, 2017 at 10:08:23AM -0500, Derrick Stolee wrote:

> There are several places in Git where we refer to the size of an object by
> an 'unsigned long' instead of a 'size_t'. In 64-bit Linux, 'unsigned long'
> is 8 bytes, but in 64-bit Windows it is 4 bytes.
> 
> The main issue with this conversion is that large objects fail to load (they
> seem to hash and store just fine). For example, the following 'blob8gb' is
> an 8 GB file where the ith byte is equal to i % 256:

Yeah, I think there's widespread agreement that this needs fixing. It's
just big enough that nobody has tackled it. If you're willing, that
sounds great to me. ;)

> In my opinion, the correct thing to do would be to replace all 'unsigned
> long's that refer to an object size and replace them with 'size_t'. However,
> a simple "git grep 'unsigned long size'" reveals 194 results, and there are
> other permutations of names and pointer types all over.

Yep. I think it's actually even trickier than that, though. Something
like off_t is probably the right type for the abstract concept of an
object size, since objects can be larger than the address space (in
which case we'd generally hit streaming code paths when possible).

But then of course we'd want to use size_t for objects that we've
actually placed into a buffers. At first glance it might be OK to just
use a larger type everywhere, but I think that runs into some funny
situations. For instance, you would not want to pass an off_t to
xmalloc(), as it is truncated (which may lead to a too-small buffer and
then a heap-smashing overflow).

So in an ideal world it is something like:

  - abstract object types are always off_t

  - lower-level code that takes an object in a buffer always uses size_t

  - at the conversion point, we must always use xsize_t() to catch
    conversion problems.

  - Any time that xsize_t() might die is a place where the caller should
    probably figure out if it can use a streaming code path for large
    objects.

> This conversion would be a significant patch, so I wanted to get the
> community's thoughts on this conversion.
> 
> If there are small, isolated chunks that can be done safely, then this may
> be a good target for a first patch.

Yes, I agree this probably has to be done incrementally. The threads
Thomas linked might be good starting points. But do be careful with
incremental changes that you don't introduce any spots where we might
get a value mismatch that used to be consistently truncated. E.g., this
code is wrong but not a security problem:

  size_t full_len = strlen(src); /* ok */
  unsigned long len = full_len; /* possible truncation! */
  char *dst = xmalloc(len); /* buffer possibly smaller than full_len */
  memcpy(dst, src, len); /* copies truncated value */

But this has a vulnerability:

  memcpy(dst, src, full_len); /* whoops, buffer is too small! */

Obviously this is a stupid toy example, but in the real world things
like that "unsigned long" assignment happen via implicit conversion in
function parameters. Or we never have "full_len" in the first place, and
instead build it up over a series of operations that write into the
buffer. Etc.

I made a pass over the whole code-base a while back to fix any existing
problems and introduce helpers to make it harder to get this wrong
(e.g., by making sure a consistent value is used for allocating and
populating a buffer). So I _hope_ that it should be hard to accidentally
regress any code by switching out types. But it's something to be aware
of.

-Peff