On Fri, Sep 13, 2024 at 12:26:41PM -0700, Junio C Hamano wrote: > The last test in t5512 we recently added seems to be flaky. > Running > > $ make && cd t && sh ./t5512-ls-remote.sh --stress > > shows that "git ls-remote foo::bar" exited with status 141, which > means we got a SIGPIPE. This test piece was introduced by 9e89dcb6 > (builtin/ls-remote: fall back to SHA1 outside of a repo, 2024-08-02) > and is pretty much independent from all other tests in the script > (it can even run standalone with everything before it removed). > > The transport-helper.c:get_helper() function tries to write to the > helper. As we can see the helper script is very short and can exit > even before it reads anything, when get_helper() tries to give the > first command, "capabilities", the helper may already be gone. > > A trivial fix, presented here, os to make sure that the helper reads > the first command it is given, as what it writes later is a response > to that command. This analysis looks right, and I think the test change here is a good minimal fix. It does leave "real world" cases where we'd hit SIGPIPE vulnerable to the same problem, but I don't think that happens all that often (any helper that doesn't speak the protocol correctly and causes spurious pipe failures will annoy users and get fixed). So I think we should apply it as-is. But see below for some interesting historical context. > I however would wonder if the interactions with the helper initiated > by get_helper() should be done on a non-blocking I/O (we do check > the return value from our write(2) system calls, do we?). You still get SIGPIPE with non-blocking I/O. I assume you meant ignoring SIGPIPE? That's certainly an option, but unfortunately it does mean we'd ignore SIGPIPE everywhere. We do generally check the return value of write(), but not always for buffered I/O. So in something like: git ls-remote | head -1 we'd probably do a bunch of useless writes to the closed pipe. That's not the end of the world, given that we've done most of the work before generating the first line of output anyway, but that's the case that SIGPIPE is really supposed to be helping with. It's much worse on something like git-log, but if we are just talking about ignoring SIGPIPE in users of transport-helper.c, that's much more limited (though a confusing concept when you start thinking about libification, which may do many things in one process). I really wish there was a way to ignore SIGPIPE per-descriptor (or even tell which descriptor triggered it in a signal handler), but I don't think there is. Some environments like "go" claim to do this, but I think it's really that they ignore SIGPIPE, and then assume all writes are going through their wrapper, which converts EPIPE into the signal. That's a bit harder for us to do (even if we got xwrite(), you have to similarly wrap fprintf, and so on). Another option is to selectively disable/re-enable SIGPIPE for individual write() calls. That's not thread-safe, and I imagine may have some performance penalty in practice due to the extra syscalls. But it might make sense to do it selectively. And now we get into the historical bits. A long time ago, we noticed that a totally broken helper (i.e., one that exits immediately) would cause Git to exit silently, which could be confusing. So we added an error message when we saw an unexpected close of the pipe. But people found that annoying, because most of the time the helper already said something useful to stderr, and the user did not even care that a helper was in use. So we got rid of it. But we discussed the concept of complaining only about the initial "capabilities" request, which would catch helpers that failed to even start. There's some discussion in this thread: https://lore.kernel.org/git/20140117201325.GB775@xxxxxxxxxxxxxxxxxxxxx/ But the most interesting part to me is that on that same day over 10 years ago I wrote the patch below, but apparently never sent it. And I've been rebasing it forward ever since (I have several dozen of these "to look at and think about again" topic branches, and I guess I let this one slide for quite a while). Anyway, the interesting thing is that it does the exact "turn off SIGPIPE only for this call" strategy discussed above. And so I did not even see the race in t5512, because it was "fixed" in my fork. I use quotation marks there, though, because the test was sometimes racily succeeding for the wrong reason (because we got EPIPE, not because of the forbidden capability response). So even if we took the patch below, I think your test fix is the right thing to do. -- >8 -- From: Jeff King <peff@xxxxxxxx> Date: Fri, 17 Jan 2014 15:36:21 -0500 Subject: [PATCH] print an error when remote helpers die during capabilities The transport-helper code generally relies on the remote-helper to provide an informative message to the user when it encounters an error. In the rare cases where the helper does not do so, the output can be quite confusing. E.g.: $ git clone https://example.com/foo.git Cloning into 'foo'... $ echo $? 128 $ ls foo /bin/ls: cannot access foo: No such file or directory We tried to address this with 81d340d (transport-helper: report errors properly, 2013-04-10). But that makes the common case much more confusing. The remote helper protocol's method for signaling normal errors is to simply hang up. So when the helper does encounter a routine error and prints something to stderr, the extra error message is redundant and misleading. So we dropped it again in 266f1fd (transport-helper: be quiet on read errors from helpers, 2013-06-21). This puts the uncommon case right back where it started. We may be able to do a little better, though. It is common for the helper to die during a "real" command, like fetching the list of remote refs. It is not common for it to die during the initial "capabilities" negotiation, right after we start. Reporting failure here is likely to catch fundamental problems that prevent the helper from running (and reporting errors) at all. Anything after that is the responsibility of the helper itself to report. Signed-off-by: Jeff King <peff@xxxxxxxx> --- t/t5801-remote-helpers.sh | 11 +++++++++++ transport-helper.c | 16 +++++++++++++--- 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/t/t5801-remote-helpers.sh b/t/t5801-remote-helpers.sh index 20f43f7b7d..d21877150e 100755 --- a/t/t5801-remote-helpers.sh +++ b/t/t5801-remote-helpers.sh @@ -344,4 +344,15 @@ test_expect_success 'fetch tag' ' compare_refs local v1.0 server v1.0 ' +test_expect_success 'totally broken helper reports failure message' ' + write_script git-remote-broken <<-\EOF && + read cap_cmd + exit 1 + EOF + test_must_fail \ + env PATH="$PWD:$PATH" \ + git clone broken://example.com/foo.git 2>stderr && + grep aborted stderr +' + test_done diff --git a/transport-helper.c b/transport-helper.c index 09b3560ffd..98dbeb6d3d 100644 --- a/transport-helper.c +++ b/transport-helper.c @@ -89,11 +89,18 @@ static int recvline(struct helper_data *helper, struct strbuf *buffer) return recvline_fh(helper->out, buffer); } -static void write_constant(int fd, const char *str) +static int write_constant_gently(int fd, const char *str) { if (debug) fprintf(stderr, "Debug: Remote helper: -> %s", str); if (write_in_full(fd, str, strlen(str)) < 0) + return -1; + return 0; +} + +static void write_constant(int fd, const char *str) +{ + if (write_constant_gently(fd, str) < 0) die_errno(_("full write to remote helper failed")); } @@ -168,13 +175,16 @@ static struct child_process *get_helper(struct transport *transport) die_errno(_("can't dup helper output fd")); data->out = xfdopen(duped, "r"); - write_constant(helper->in, "capabilities\n"); + sigchain_push(SIGPIPE, SIG_IGN); + if (write_constant_gently(helper->in, "capabilities\n") < 0) + die("remote helper '%s' aborted session", data->name); + sigchain_pop(SIGPIPE); while (1) { const char *capname, *arg; int mandatory = 0; if (recvline(data, &buf)) - exit(128); + die("remote helper '%s' aborted session", data->name); if (!*buf.buf) break; -- 2.46.0.942.g308ef7fa08