On Mon, Nov 22, 2021 at 02:11:18PM -0500, Jeff King wrote: > On Sun, Nov 21, 2021 at 09:49:32PM -0800, Junio C Hamano wrote: > > > > That's a lot more tedious "if (!in_signal)" checks, but: > > > > > > - we don't have to duplicate any of the actual application logic > > > > > > - we'd now cover the extra cases for waitpid failing or returning the > > > wrong pid (previously if waitpid() failed we'd still look at status, > > > which could contain complete garbage!) > > > > Yeah, the repeated "if (!in_signal)" look a bit ugly, but fixing > > that "we only deal with ifexited in in_signal case" to do the right > > thing would make the code even more annoying and harder to maintain. > > OK. Let me see if I can clean this up into a full series that actually > fixes the race you saw, and breaks down the other fixes a bit more. Hmm. So I had thought to maybe split apart the earlier patches and interweave this one at the right spot. But really there is a mutual dependency in the test fixes: fixing the SIGPIPE issue means that this will start to fail (but only sometimes, if /bin/sh is bash!), but fixing the "exec" thing means it is only testing something sometimes (because of the pipe raciness). So I ended up doing this as just a separate preparatory fix, which is in the patch below. I see you picked up the earlier diff as jk/pager-exit-fix. This one gives much more explanation, and adjusts the test to actually cover this case. It can be applied directly on top of ab/pager-exit-log, replacing what's in jk/pager-exit-fix. And then I'd apply the other two (which are currently in jk/t7006-sigpipe-tests-fix) directly on top of that. IMHO it should all just be a single series that graduates together, but we could do the other test fixups separately (they are fixing races, but I think the races err on the side of the tests doing nothing, not failing erroneously). -- >8 -- Subject: run-command: unify signal and regular logic for wait_or_whine() Since 507d7804c0 (pager: don't use unsafe functions in signal handlers, 2015-09-04), we have a separate code path in wait_or_whine() for the case that we're in a signal handler. But that code path misses some of the cases handled by the main logic. This was improved in be8fc53e36 (pager: properly log pager exit code when signalled, 2021-02-02), but that covered only case: actually returning the correct error code. But there are some other cases: - if waitpid() returns failure, we wouldn't notice and would look at uninitialized garbage in the status variable; it's not clear if it's possible to trigger this or not - if the process exited by signal, then we would still report "-1" rather than the correct signal code This latter case even had a test added in be8fc53e36, but it doesn't work reliably. It sets the pager command to: >pager-used; test-tool sigchain The latter command will die by signal, but because there are multiple commands, there will be a shell in between. And it's the shell whose waitpid() call will see the signal death, and it will then exit with code 143, which is what Git will see. To make matters even more confusing, some shells (such as bash) will realize that there's nothing for the shell to do after test-tool finishes, and will turn it into an exec. So the test was only checking what it thought when /bin/sh points to a shell like bash (we're relying on the shell used internally by Git to spawn sub-commands here, so even running the test under bash would not be enough). This patch adjusts the tests to explicitly call "exec" in the pager command, which produces a consistent outcome regardless of shell. Note that without the code change in this patch it _should_ fail reliably, but doesn't. That test, like its siblings, tries to trigger SIGPIPE in the git-log process writing to the pager, but only do so racily. That will be fixed in a follow-on patch. For the code change here, we have two options: - we can teach the in_signal code to handle WIFSIGNALED() - we can stop returning early when in_signal is set, and instead annotate individual calls that we need to skip in this case The former is a simpler patch, but means we're essentially duplicating all of the logic. So instead I went with the latter. The result is a bigger patch, and we do run the risk of new code being added but forgetting to handle in_signal. But in the long run it seems more maintainable. I've skipped any non-trivial calls for the in_signal case, like calling error(). We'll also skip the call to clear_child_for_cleanup(), as we were before. This is arguably the wrong thing to do, since we wouldn't want to try to clean it up again. But: - we can't call it as-is, because it calls free(), which we must avoid in a signal handler (we'd have to pass in_signal so it can skip the free() call) - we'll only go through the list of children to clean once, since our cleanup_children_on_signal() handler pops itself after running (and then re-raises, so eventually we'd just exit). So this cleanup only matters if a process is on the cleanup list _and_ it has a separate handler to clean itself up. Which is questionable in the first place (and AFAIK we do not do). - double-cleanup isn't actually that bad anyway. waitpid() will just return an error, which we won't even report because of in_signal. Signed-off-by: Jeff King <peff@xxxxxxxx> --- run-command.c | 19 +++++++++---------- t/t7006-pager.sh | 2 +- 2 files changed, 10 insertions(+), 11 deletions(-) diff --git a/run-command.c b/run-command.c index 509841bf27..350593928d 100644 --- a/run-command.c +++ b/run-command.c @@ -551,20 +551,17 @@ static int wait_or_whine(pid_t pid, const char *argv0, int in_signal) while ((waiting = waitpid(pid, &status, 0)) < 0 && errno == EINTR) ; /* nothing */ - if (in_signal) { - if (WIFEXITED(status)) - code = WEXITSTATUS(status); - return code; - } if (waiting < 0) { failed_errno = errno; - error_errno("waitpid for %s failed", argv0); + if (!in_signal) + error_errno("waitpid for %s failed", argv0); } else if (waiting != pid) { - error("waitpid is confused (%s)", argv0); + if (!in_signal) + error("waitpid is confused (%s)", argv0); } else if (WIFSIGNALED(status)) { code = WTERMSIG(status); - if (code != SIGINT && code != SIGQUIT && code != SIGPIPE) + if (!in_signal && code != SIGINT && code != SIGQUIT && code != SIGPIPE) error("%s died of signal %d", argv0, code); /* * This return value is chosen so that code & 0xff @@ -575,10 +572,12 @@ static int wait_or_whine(pid_t pid, const char *argv0, int in_signal) } else if (WIFEXITED(status)) { code = WEXITSTATUS(status); } else { - error("waitpid is confused (%s)", argv0); + if (!in_signal) + error("waitpid is confused (%s)", argv0); } - clear_child_for_cleanup(pid); + if (!in_signal) + clear_child_for_cleanup(pid); errno = failed_errno; return code; diff --git a/t/t7006-pager.sh b/t/t7006-pager.sh index 0e7cf75435..032bde278e 100755 --- a/t/t7006-pager.sh +++ b/t/t7006-pager.sh @@ -767,7 +767,7 @@ test_expect_success TTY 'git attempts to page to nonexisting pager command, gets test_expect_success TTY 'git returns SIGPIPE on propagated signals from pager' ' test_when_finished "rm pager-used trace.normal" && - test_config core.pager ">pager-used; test-tool sigchain" && + test_config core.pager ">pager-used; exec test-tool sigchain" && GIT_TRACE2="$(pwd)/trace.normal" && export GIT_TRACE2 && test_when_finished "unset GIT_TRACE2" && -- 2.34.0.639.g83a0ed08a8