Re: [PATCH v2] run-command: don't spam trace2_child_exit()

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 08 Jun 2022 00:09:14 +0200

On Tue, Jun 07 2022, Josh Steadmon wrote:

> In rare cases[1], wait_or_whine() cannot determine a child process's
> status (and will return -1 in this case). This can cause Git to issue
> trace2 child_exit events despite the fact that the child may still be
> running. In pathological cases, we've seen > 80 million exit events in
> our trace logs for a single child process.
>
> Fix this by only issuing trace2 events in finish_command_in_signal() if
> we get a value other than -1 from wait_or_whine(). This can lead to
> missing child_exit events in such a case, but that is preferable to
> duplicating events on a scale that threatens to fill the user's
> filesystem with invalid trace logs.
>
> [1]: This can happen when:
>
> * waitpid() returns -1 and errno != EINTR
> * waitpid() returns an invalid PID
> * the status set by waitpid() has neither the WIFEXITED() nor
>   WIFSIGNALED() flags
>
> Signed-off-by: Josh Steadmon <steadmon@xxxxxxxxxx>
> ---
> Updated the commit message with more details about when wait_or_whine()
> can fail.
>
>  run-command.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/run-command.c b/run-command.c
> index a8501e38ce..e0fe2418a2 100644
> --- a/run-command.c
> +++ b/run-command.c
> @@ -983,7 +983,8 @@ int finish_command(struct child_process *cmd)
>  int finish_command_in_signal(struct child_process *cmd)
>  {
>  	int ret = wait_or_whine(cmd->pid, cmd->args.v[0], 1);
> -	trace2_child_exit(cmd, ret);
> +	if (ret != -1)
> +		trace2_child_exit(cmd, ret);
>  	return ret;
>  }

This seems like a legitimate issue, but I really don't think we should
sweep this under the rug like this.

 * Why can't we see if we logged such an event already in common_exit(),
   if we didn't we should trace2_child_exit() (or similar). I.e. not
   miss an event, ever.

 * Should this really be an "exit" event, aren't some of these failed
   signal events? Per the "should this be an exit event?" question in my
   related "signal on BUG" series.

 * We should have tests here, e.g. in t0210 to see the exact events we
   emit in certain cases, we really should have a test for this. Perhaps
   we can instrument a simulated failure with some GIT_TEST_* variables?