Re: [PATCH 0/4] Report rejections over HTTP when the remote rejects during the transfer

Carlos Martín Nieto <cmn@xxxxxxx> · Tue, 23 Jul 2024 15:46:57 +0200

On Thu, 2024-06-13 at 05:11 -0400, Jeff King wrote:
> On Wed, Jun 12, 2024 at 01:50:24PM +0200, Carlos Martín Nieto wrote:
> 
> > While investigating a difference between HTTP and SSH when rejecting a push due
> > to it being too large, I noticed that rejecting a push without receiving the
> > entire packfile causes git to print out the error message "pack exceeds maximum
> > allowed size" but it also shows "Everything up-to-date" instead of the rejection
> > of every ref update like the server has specified.
> > 
> > This is the result of two issues in git, of which I aim to fix one here, namely
> > 
> >   1) when the server sends the response and closes the connection, remote-curl
> >   sees that as an error and stops processing the send-pack output, combined with
> > 
> >   2) git does not remember what it asked the remote helper to push so it cannot
> >   distinguish whether an empty report means "I had an error and did nothing" or
> >   "everything was up to date and I didn't have to do anything".
> > 
> > The latter issue is more complex so here I'm concentrating on the former, which
> > has a simple solution but a complex test. The solution is to read in to the end
> > of what send-pack is telling us (it's creating the whole packfile that we're
> > throwing away anyway) so we can report back to the user.
> 
> Hmm. I think the second one is much more important, though. If the
> remote helper dies unexpectedly for any reason, shouldn't the parent
> git-push realize this and propagate the error? It sounds like you are
> fixing one _specific_ case where the remote helper dies so that we don't
> run into the problem caused by (2).

I am concentrating on this one bug that seems relatively self-contained
(other than the testing). The (2) bug requires a lot more reworking of
the remote helper mechanism. But fixing it only fixes one aspect which
is the lack of ref report instead of "Everything up-to-date". It still
wouldn't display the message sent back from the server saying why it
was rejected (though as you later argue, it might be correct to ignore
it when we send back the 200).

> 
> But if we fixed (2), then (1) would not be nearly as important anymore

It's not as confusing regarding the up-to-date message, but (1) would
still hide the error message from the server, so it still wouldn't be
showing that message, just some other generic message about the helper
dying. And for HTTP it would still be different from what we see over
SSH.

> (if at all). But I think there is a little more going on with this
> specific case...
> 
> > The testing however proved a bit complicated as this bug requires the server to
> > cut off the connection while git is uploading the packfile. The existing HTTP
> > tests use CGI and as far as I've been able to test, httpd buffers too much for
> > us to be able to replicate the situation.
> 
> It is not buffering that gets in your way, but rather that Apache will
> actually drain the remainder of the request (throwing away the extra
> bytes) if the CGI dies. You can see this by running your test against
> apache and strace-ing the apache process. It is in a read/write loop
> streaming the packfile from network to the CGI's stdin, it hits EPIPE on
> the write(), and then it switches to just read().
> 

Yes, that was bad wording on my side. I think the effect here was that
I did not see the same issues when remote-curl or send-pack manged to
get further into their own data stream that what we see with the 500
from the server that disconnects.

> Which makes sense, because you can't send back the HTTP 200 response
> until you've read the whole request. RFC 7230 does make an exception for
> error responses:
> 
>   A client sending a message body SHOULD monitor the network connection
>   for an error response while it is transmitting the request.  If the
>   client sees a response that indicates the server does not wish to
>   receive the message body and is closing the connection, the client
>   SHOULD immediately cease transmitting the body and close its side of
>   the connection.
> 
> That's just a "SHOULD" but I think curl does this properly. In the case
> of GitHub's servers, a fatal condition from index-pack seems to trigger
> an HTTP 500. So that is OK by the standard above, but it does mean we
> lose the opportunity to provide any error report at the Git protocol
> level. And so you end up with output like:
> 
>   error: RPC failed; HTTP 500 curl 22 The requested URL returned error: 500
>   send-pack: unexpected disconnect while reading sideband packet
>   fatal: the remote end hung up unexpectedly
> 
> whereas the same push over ssh produces[1]:
> 
>   remote: fatal: non-blob object size limit exceeded (commit ddc6d4e7091229184537d078666afb241ea8ef62 is 104857798 bytes)
>   error: remote unpack failed: index-pack failed
>   To github.com:peff/test.git
>    ! [remote rejected] main -> main (failed)
>   error: failed to push some refs to 'github.com:peff/test.git'
> 
> And even if we teach the remote helper to soak up the whole pack, we're
> still going to see the same useless output. So I would argue that the
> first fix here needs to be at the server level. It should be soaking up
> all of the request bytes if possible so that we can write back an HTTP
> 200 with the appropriate error report.
> 
> Of course it's kind of lame that we don't cut off the client and let it
> keep sending useless bytes (though really it is not any different than
> fsck errors, which we queue after reading the whole input).

Ideally we would be able to tell the client quickly too, but that
presents some challenges both on the server and client.

> 
> It might be possible to send an application/x-git-receive-pack-result as
> the body of the 500 response. And then we could teach the client side to
> notice and handle that better. We could possibly even use a different
> code (413 Payload Too Large?), and I suspect existing clients would
> behave sensibly (and new ones could give a better message). We _might_
> even be able to get away with a "200" response here, though even if curl
> is OK with it, I think we are stretching the RFC a bit.

Yes, I had completely missed this before, and I thought it was a curl
behaviour, but we actually just ignore any output from the server if
the status is >= 300 in rpc_in. This makes sense as we don't want to
try to parse a generic error message, but if the server manages to set
the content-type, then it can tell git that it does want it to parse
the message, but it's an error. A slight modification to my server
script does make it seem like it does work and this might have been a
significant part of what I was fighting against before.

If git (send-pack here I think) can reliably detect the server sending
an error, then it might be able to stop sending the data early. That
would then work for any protocol.

> 
> [1] The keen-eyed observer may notice this is failing due to a different
>     reason than "pack exceeds maximum allowed size". I happen to know
>     that GitHub's version of index-pack will also reject commits over 100MB
>     in the same unceremonious way, so we can generate the same condition
>     without having to spend tons of bandwidth:
> 
>       # start with one small commit
>       git init
>       git commit --allow-empty -m foo
> 
>       # now make a too-big one. Note that it will compress well with
>       # zlib!
>       git cat-file commit HEAD >commit
>       perl -e 'print "a" x 79, "\n" for (1..1310720)' >>commit
>       commit=$(git hash-object -t commit -w commit)
>       git update-ref HEAD $commit
> 
>       # pushing just the commit above will not generate the issue,
>       # because we'll (racily) have sent the rest of the pack by
>       # the time the server notices the problem. So tack on a bunch of
>       # big (and uncompressible) irrelevant data.
>       dd if=/dev/urandom of=foo.rand bs=1M count=90
>       git add foo.rand
>       git commit -m 'big data'
> 
>     And now pushing over http will quickly-ish get you to the ugly
>     output (but over ssh is OK). But note this only work with GitHub!
>     Upstream Git does not have the same object-size check.
> 
> > This is why there's a python Git server in this patch series that doesn't rely
> > on CGI but streams the data both ways so it can close the stream as soon as
> > receive-pack exits. There's already some python tooling in the project and I'm
> > much more familiar with it than e.g. perl, so I hope that's fine. I tried to
> > make it as simple as possible while still being able to stream bidirectionally.
> 
> I admit that I don't love carrying a separate webserver implementation,
> but I doubt we can convince Apache not to do the draining behavior.
> Though based on what I wrote above, I'm not sure that non-draining is a
> behavior we care about testing that much (we _do_ care about getting an
> HTTP 500 and bailing from the helper, but that is much easier to test).

Coming to it from the side where you see this for a large packfile, I'm
partial to git working in such a way that we don't have to eat up the
entire rest of the packfile. It would also be nice if we can return
other errors to the user quickly for their own sake.

> 
> If we are going to carry some custom server code, python brings some
> complications. We don't currently depend on python at all in the test
> suite, outside of the git-p4 tests. And those are disabled if NO_PYTHON
> is set. We do assume that we have access to perl in the test suite even
> if NO_PERL is set, but we try to limit ourselves to a very vanilla
> subset that would work even on ancient versions of perl (so basically
> treating perl like sed/awk as being in the "it's everywhere" unix
> toolbox).
> 
> I don't know if we could do the same with python. I suspect there are
> systems without it, even these days. I'm also not sure what portability
> issues we might see on the specific components you're using. E.g., is
> http.server always available?

AFAICT I did not rely on anything that doesn't come with the standard
library, although I did write python 3 which may not be easy to install
everywhere git has even run. It should also be available with python 2,
but I appreciate the extra requirements. I did see a lot of python in
git-p4 but if that's the only thing, that's a different support level
than the core of git.

Cheers,
   cmn