Re: [PATCH] Fix in Git.pm cat_blob crashes on large files

Joshua Clayton <stillcompiling@xxxxxxxxx> · Thu, 21 Feb 2013 15:18:40 -0800

On Thu, Feb 21, 2013 at 2:43 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Thu, Feb 21, 2013 at 02:13:32PM -0800, Joshua Clayton wrote:
>
>> Greetings.
>> This is my first patch here. Hopefully I get the stylistic & political
>> details right... :)
>> Patch applies against maint and master
>
> I have some comments. :)
>
> The body of your email should contain the commit message (i.e., whatever
> people reading "git log" a year from now would see). Cover letter bits
> like this should go after the "---". That way "git am" knows which part
> is which.
>
>>         Developer's Certificate of Origin 1.1
>
> You don't need to include the DCO. Your "Signed-off-by" is an indication
> that you agree to it.
>
>> Affects git svn clone/fetch
>> Original code loaded entire file contents into a variable
>> before writing to disk. If the offset within the variable passed
>> 2 GiB, it becrame negative, resulting in a crash.
>
> Interesting. I didn't think perl had signed wrap-around issues like
> this, as its numeric variables are not strictly integers. But I don't
> have a 32-bit machine to test on (and numbers larger than 2G obviously
> work on 64-bit machines). At any rate, though:
>
>> On a 32 bit system, or a system with low memory it may crash before
>> reaching 2 GiB due to memory exhaustion.
>
> Yeah, it is stupid to read the whole thing into memory if we are just
> going to dump it to another filehandle.
>
>> @@ -949,13 +951,21 @@ sub cat_blob {
>>               last unless $bytesLeft;
>>
>>               my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
>> -             my $read = read($in, $blob, $bytesToRead, $bytesRead);
>> +             my $read = read($in, $blob, $bytesToRead, $blobSize);
>>               unless (defined($read)) {
>>                       $self->_close_cat_blob();
>>                       throw Error::Simple("in pipe went bad");
>>               }
>
> Hmph. The existing code already reads in 1024-byte chunks. For no
> reason, as far as I can tell, since we are just loading the blob buffer
> incrementally into memory, only to then flush it all out at once.
>
> Why do you read at the $blobSize offset? If we are just reading in
> chunks, we be able to just keep writing to the start of our small
> buffer, as we flush each chunk out before trying to read more.
>
> IOW, shouldn't the final code look like this:
>
>   my $bytesLeft = $size;
>   while ($bytesLeft > 0) {
>           my $buf;
>           my $bytesToRead = $bytesLeft < 1024 ? $bytesLeft : 1024;
>           my $read = read($in, $buf, $bytesToRead);
>           unless (defined($read)) {
>                   $self->_close_cat_blob();
>                   throw Error::Simple("unable to read cat-blob pipe");
>           }
>           unless (print $fh $buf) {
>                   $self->_close_cat_blob();
>                   throw Error::Simple("unable to write blob output");
>           }
>
>           $bytesLeft -= $read;
>   }
>
> By having the read and flush size be the same, it's much simpler.

My original bugfix did just read 1024, and write 1024. That works fine
and, yes, is simpler.
I changed it to be more similar to the original code in case there
were performance reasons for doing it that way.
That was the only reason I could think of for the design, and adding
the $flushSize variable means that
some motivated person could easily optimize it.

So far I have been too lazy to profile the two versions....
I guess I'll try a trivial git svn init; git svn fetch and check back in.
Its running now.

>
> Your change (and my proposed code) do mean that an error during the read
> operation will result in a truncated output file, rather than an empty
> one. I think that is OK, though. That can happen anyway in the original
> due to a failure in the "print" step. Any caller who wants to be careful
> that they leave only a full file in place must either:
>
>   1. Check the return value of cat_blob and verify that the result has
>      $size bytes, and otherwise delete it.
>
>   2. Write to a temporary file, then once success has been returned from
>      cat_blob, rename the result into place.
>
> Neither of which is affected by this change.
>
> -Peff

In git svn fetch (which is how I discovered it) the file being passed
to cat_blob is a temporary file, which is checksummed before putting
it into place.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html