Re: [PATCH] trace2: intercept all common signals

Emily Shaffer <nasamuffin@xxxxxxxxxx> · Fri, 10 May 2024 12:34:07 -0700

On Fri, May 10, 2024 at 11:47 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Emily Shaffer <emilyshaffer@xxxxxxxxxx> writes:
>
> > From: Emily Shaffer <nasamuffin@xxxxxxxxxx>
> >
> > We already use trace2 to find out about unexpected pipe breakages, which
> > is nice for detecting bugs or system problems, by adding a handler for
> > SIGPIPE which simply writes a trace2 line. However, there are a handful
> > of other common signals which we might want to snoop on:
> >
> >  - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in
> >    frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`)
> >  - SIGHUP, when the network closes unexpectedly (indicating there may be
> >    a problem to solve)
> >
> > There are lots more signals which we might find useful later, but at
> > least let's teach trace2 to report these egregious ones. Conveniently,
> > they're also already covered by the `_common` variants in sigchain.[ch].
> >
> > Sigchain itself is already tested via helper/test-sigchain.c, and trace2
> > is tested in a number of places - let's also add tests demonstrating
> > that sigchain + trace2 works correctly.
> >
> > Signed-off-by: Emily Shaffer <nasamuffin@xxxxxxxxxx>
> > ---
> >  t/helper/test-trace2.c   | 17 +++++++++++++++++
> >  t/t0210-trace2-normal.sh | 22 ++++++++++++++++++++++
> >  trace2.c                 |  2 +-
> >  3 files changed, 40 insertions(+), 1 deletion(-)
> >
> > diff --git a/t/helper/test-trace2.c b/t/helper/test-trace2.c
> > index 1adac29a57..8970956ea8 100644
> > --- a/t/helper/test-trace2.c
> > +++ b/t/helper/test-trace2.c
> > @@ -231,6 +231,22 @@ static int ut_010bug_BUG(int argc UNUSED, const char **argv UNUSED)
> >       BUG("a %s message", "BUG");
> >  }
> >
> > +static int ut_011signal(int argc, const char **argv)
> > +{
> > +     const char *usage_error = "expect <bool common>";
> > +     int common = 0;
> > +
> > +     if (argc != 1 || get_i(&common, argv[0]))
> > +             die("%s", usage_error);
> > +
> > +     /*
> > +      * There is no strong reason SIGSEGV is ignored by trace2 - it's just
> > +      * not included by sigchain_push_common().
> > +      */
> > +     raise(common ? SIGINT : SIGSEGV);
> > +     return 0; /*unreachable*/
> > +}
> > +
> >  /*
> >   * Single-threaded timer test.  Create several intervals using the
> >   * TEST1 timer.  The test script can verify that an aggregate Trace2
> > @@ -482,6 +498,7 @@ static struct unit_test ut_table[] = {
> >       { ut_008bug,      "008bug",    "" },
> >       { ut_009bug_BUG,  "009bug_BUG","" },
> >       { ut_010bug_BUG,  "010bug_BUG","" },
> > +     { ut_011signal,   "011signal","" },
> >
> >       { ut_100timer,    "100timer",  "<count> <ms_delay>" },
> >       { ut_101timer,    "101timer",  "<count> <ms_delay> <threads>" },
> > diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh
> > index c312657a12..c34ccc518c 100755
> > --- a/t/t0210-trace2-normal.sh
> > +++ b/t/t0210-trace2-normal.sh
> > @@ -244,6 +244,28 @@ test_expect_success 'bug messages followed by BUG() are written to trace2' '
> >       test_cmp expect actual
> >  '
> >
> > +test_expect_success 'trace2 reports common signals' '
> > +     test_when_finished "rm trace.normal actual" &&
> > +
> > +     # signals are fatal, so expect this to fail
> > +     ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 1 &&
>
> Is it deliberate that this does not use test_must_fail or is it an
> oversight?  The same comment applies to all other uses of "! env".
>
> We often see the use of "env" in conjunction with a test that is
> expected to fail because
>
>     test_must_fail VAR=VAL cmd
>
> simply does not work.  If you are not using test_expect_fail, then
>
>     ! VAR=VAL cmd
>
> should be sufficient, but it would mean that you will be happy even
> if the way "cmd" dies is not in a controlled way (e.g. due to
> receiving a signal).
>
> Ah, perhaps that is it?  Is "test-tool trace2 011signal 1" raise a
> signal to kill itself and after showing the event in the trace
> stream it is expected to die the narual death of receiving the same
> signal by re-raising it?

Yes, it is because test_must_fail expects "natural" death. You can
tell test_must_fail which signal you'd expect to receive, in theory,
but I didn't get it to work (and it will be tricky to provide the
correct signal in shell - I had originally hardcoded signal ints in
sh, but then moved the signal enum->int resolution into
helper/test-trace2.c because the alternative is doing some nasty
grepping on other shell utility outputs, since the signal codes aren't
platform/arch consistent).

Anyway, I will try it without `env`.

>
> If that is what is happening here, not using test_must_fail is
> absolutely the right thing to do, but then I doubt you need "env"
> there.  Also, if we know what signal is raised, then we should also
> know the exit status from this (i.e. signal number plus 128 or
> something) that we want to validate?  I dunno.

We could? But I don't feel strongly about it. If I specify the exit
status, the test will be brittle if we change the exit codes later
(for example, 128 or -1 for all error exits is kind of an antipattern;
it might be nice to ask Git to return meaningful error codes depending
on what went wrong, in the future). We are already checking later on
during the test_grep that we exited due to a fatal signal.

Thanks for the feedback - I'll get going on a v2 and aim to send it
later today, since I don't hear you saying that the patch's overall
goal is objectionable.

While I'm at it, since you pointed out ! instead of test_must_fail, I
wondered if I should change "! test_grep" as well - but when I grep t/
it looks like it's not usual to use `test_must_fail test_grep`, but
instead to use `test_grep ! <omitted pattern> <file>`. I'll change
that too.

I also wonder - do we want to capture SIGKILL as well? Some people may
have muscle memory for `kill -9` (I do, for better or worse) instead
of gentler `kill`. My intent was to notice any indication of user
frustration resulting in manual termination, which would include `kill
-9` too...

 - Emily